<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Prithwish Nath</title>
    <description>The latest articles on DEV Community by Prithwish Nath (@prithwish_nath).</description>
    <link>https://dev.to/prithwish_nath</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2949644%2F62f159e9-294c-4c8f-a449-3857d7e5fbb6.jpg</url>
      <title>DEV Community: Prithwish Nath</title>
      <link>https://dev.to/prithwish_nath</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/prithwish_nath"/>
    <language>en</language>
    <item>
      <title>5 MCP Servers That Can Replace Expensive SaaS Subscriptions in 2026</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Tue, 16 Jun 2026 06:50:06 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/5-mcp-servers-that-can-replace-expensive-saas-subscriptions-in-2026-3l8b</link>
      <guid>https://dev.to/prithwish_nath/5-mcp-servers-that-can-replace-expensive-saas-subscriptions-in-2026-3l8b</guid>
      <description>&lt;p&gt;TL;DR: Cancel PhantomBuster, ZoomInfo, Ahrefs, Zapier, Otter.ai, and Mixpanel by swapping to these open-source MCP servers in Claude and Cursor.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwf3f4ta9k5xlxjdcy6k2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwf3f4ta9k5xlxjdcy6k2.png" width="640" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Stop paying $129/month for an SEO dashboard you query twice a week, $30/seat for a transcription tool, and another $69 for an automation platform — most of which you now drive through Claude or Cursor anyway. MCP collapses those logins into one interface. Here are 5 MCP servers that each let you cancel a SaaS.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You’ll Learn
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  How to replace a full $500+/month web-scraping + lead-data stack with one &lt;a href="https://get.brightdata.com/scraping-browser7181?utm_content=five_mcp_servers_that_can_replace_expensive_saas_subscriptions_in_2026" rel="noopener noreferrer"&gt;free-tier MCP&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;  Replacing expensive voice and transcription tools with &lt;a href="https://elevenlabs.io/docs/eleven-agents/customization/tools/mcp" rel="noopener noreferrer"&gt;usage-based speech your MCP client generates and transcribes mid-conversation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Killing a $999/month programmatic SEO wall with &lt;a href="https://dataforseo.com/help-center/setting-up-the-official-dataforseo-mcp-server-simple-guide" rel="noopener noreferrer"&gt;an MCP that runs keyword research, SERP checks, and backlink audits on demand&lt;/a&gt;, in natural language.&lt;/li&gt;
&lt;li&gt;  Replacing Zapier with &lt;a href="https://www.n8n-mcp.com/" rel="noopener noreferrer"&gt;a self-hosted automation engine your AI designs, builds, and deploys for you&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Dropping a product-analytics subscription for something &lt;a href="https://posthog.com/docs/model-context-protocol" rel="noopener noreferrer"&gt;you can query using natural language or SQL.&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Claude Desktop, Cursor, or another MCP-compatible client&lt;/li&gt;
&lt;li&gt;  Node.js v18+ for most installations (Python/&lt;code&gt;uv&lt;/code&gt; for one)&lt;/li&gt;
&lt;li&gt;  An API key per service (free tiers where available)&lt;/li&gt;
&lt;li&gt;  10 minutes per server setup&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;What Are MCP Servers?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://modelcontextprotocol.io/docs/getting-started/intro" rel="noopener noreferrer"&gt;MCP (Model Context Protocol)&lt;/a&gt; &lt;em&gt;is an open standard that lets MCP-compatible clients like Claude, Cursor, and many others connect to external data sources and tools through standardized server interfaces that expose specific tools and resources to Large Language Models (LLMs).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Once connected, the AI uses these tools naturally in conversation. Ask it to “pull this quarter’s backlinks for our domain and flag the toxic ones,” and it just works — without a separate dashboard or subscription.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;To find out more about how MCP works, check out the&lt;/em&gt; &lt;a href="https://modelcontextprotocol.io/docs/getting-started/intro" rel="noopener noreferrer"&gt;MCP documentation&lt;/a&gt;&lt;em&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This isn’t about official MCPs for SaaS you keep paying for (Stripe, Notion, Sentry). It’s about &lt;strong&gt;consolidation that actually removes a bill.&lt;/strong&gt; So, I’m only including usage-based APIs and self-hostable tools that do the job for a fraction of the seat price.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why MCP Servers Can Now Replace SaaS Subscriptions
&lt;/h2&gt;

&lt;p&gt;Most SaaS tools charge for a &lt;strong&gt;dashboard and a seat&lt;/strong&gt;. But the underlying data or action is almost always an API. For years you paid the seat premium because the API was painful: auth, pagination, JSON wrangling, rate limits.&lt;/p&gt;

&lt;p&gt;MCP removes that friction. Your agent handles the API calls, parses the output, and chains the result into whatever you’re actually doing. Once that’s true, the math flips: a $139/month subscription with API access gated behind a $1,499/month enterprise tier loses to a pay-per-query API that costs cents per call.&lt;/p&gt;

&lt;p&gt;The five below are the easiest swaps available in mid-2026. Let’s dive in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bright Data Web MCP: Replace PhantomBuster, ZoomInfo &amp;amp; Scraping Tools
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/brightdata/brightdata-mcp?source=post_page-----e89cb078d398---------------------------------------" rel="noopener noreferrer"&gt;GitHub - brightdata/brightdata-mcp: A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/brightdata/brightdata-mcp" rel="noopener noreferrer"&gt;https://github.com/brightdata/brightdata-mcp&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Free Tier:&lt;/strong&gt; 5,000 requests/month&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Replaces:&lt;/strong&gt; &lt;a href="https://phantombuster.com" rel="noopener noreferrer"&gt;PhantomBuster&lt;/a&gt; (&lt;strong&gt;$69–439/month&lt;/strong&gt;), &lt;a href="https://www.zoominfo.com" rel="noopener noreferrer"&gt;ZoomInfo&lt;/a&gt; company/contact data (&lt;strong&gt;custom annual contracts, commonly ~$15,000+/year&lt;/strong&gt;), and general scraping SaaS like ScrapingBee / Octoparse.&lt;/p&gt;
&lt;h3&gt;
  
  
  What Bright Data Web MCP Does
&lt;/h3&gt;

&lt;p&gt;This is &lt;em&gt;the&lt;/em&gt; web access layer for LLMs and AI agents. It’s an MCP wrapper around all of Bright Data’s offerings — exposing &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=five_mcp_servers_that_can_replace_expensive_saas_subscriptions_in_2026" rel="noopener noreferrer"&gt;search&lt;/a&gt;, &lt;a href="https://get.brightdata.com/bd-web-unlocker?utm_content=five_mcp_servers_that_can_replace_expensive_saas_subscriptions_in_2026" rel="noopener noreferrer"&gt;scraping&lt;/a&gt;, &lt;a href="https://get.brightdata.com/bd-datasets?utm_content=five_mcp_servers_that_can_replace_expensive_saas_subscriptions_in_2026" rel="noopener noreferrer"&gt;structured datasets&lt;/a&gt;, and &lt;a href="https://get.brightdata.com/bd-scraping-browser?utm_content=five_mcp_servers_that_can_replace_expensive_saas_subscriptions_in_2026" rel="noopener noreferrer"&gt;browser automation&lt;/a&gt; through one server. It handles the things that break naive scrapers: bot detection, CAPTCHAs, geo-restrictions, and JavaScript-heavy sites.&lt;/p&gt;

&lt;p&gt;The Pro mode for this server includes tools that pull the exact datasets you used to buy seats for: LinkedIn profiles/posts, Crunchbase and ZoomInfo company profiles, Amazon/Walmart products, and more.&lt;/p&gt;
&lt;h3&gt;
  
  
  How Bright Data Web MCP Replaces PhantomBuster &amp;amp; ZoomInfo
&lt;/h3&gt;

&lt;p&gt;PhantomBuster charges $69–439/month for execution time and “Phantom slots” to scrape LinkedIn and build lead lists. ZoomInfo charges five figures a year for company and contact data, locked behind annual contracts. Bright Data’s MCP exposes the same &lt;em&gt;kinds&lt;/em&gt; of data — &lt;code&gt;web_data_linkedin_posts&lt;/code&gt;, &lt;code&gt;web_data_zoominfo_company_profile&lt;/code&gt;, &lt;code&gt;web_data_crunchbase_company&lt;/code&gt; — on a free tier of 5,000 requests/month, then pay-as-you-go.&lt;/p&gt;

&lt;p&gt;You pay for the calls your agent actually makes — and you make them from inside Claude or Cursor instead of logging into three separate dashboards.&lt;/p&gt;
&lt;h3&gt;
  
  
  Bright Data Web MCP Features &amp;amp; Capabilities
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Real-time Web Access:&lt;/strong&gt; Up-to-date data straight from the live web, scaling on demand&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;60+ specialized tools&lt;/strong&gt; (Pro) including social media, e-commerce, business, and search engines&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Structured datasets:&lt;/strong&gt; ZoomInfo/Crunchbase company profiles, LinkedIn data, e-commerce product data&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Geo-restriction Bypass:&lt;/strong&gt; Target specific geographies for localized data&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Browser Control:&lt;/strong&gt; Remote browser automation for complex interactions (Pro)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  How to Install Bright Data Web MCP (Claude &amp;amp; Cursor)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/bd7914?utm_content=five_mcp_servers_that_can_replace_expensive_saas_subscriptions_in_2026" rel="noopener noreferrer"&gt;Sign up&lt;/a&gt; and grab your API token from the Bright Data dashboard (&lt;strong&gt;Settings → API tokens&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Option 1: Remote Server (Easiest, zero install)&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "mcpServers": {  
    "Bright Data": {  
      "command": "npx",  
      "args": ["mcp-remote", "https://mcp.brightdata.com/mcp?token=YOUR_API_TOKEN_HERE"]  
    }  
  }  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Option 2: Running Locally + Pro Mode&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "mcpServers": {  
    "Bright Data": {  
      "command": "npx",  
      "args": ["@brightdata/mcp"],  
      "env": {  
        "API_TOKEN": "YOUR_API_TOKEN_HERE",  
        "PRO_MODE": "true"  
      }  
    }  
  }  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Note:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;The free tier covers&lt;/em&gt; &lt;code&gt;_search_engine_&lt;/code&gt;&lt;em&gt;,&lt;/em&gt; &lt;code&gt;_scrape_as_markdown_&lt;/code&gt;&lt;em&gt;, and&lt;/em&gt; &lt;code&gt;_discover_&lt;/code&gt;&lt;em&gt;. Pro mode unlocks the 60+ tools (browser automation + structured Web Data APIs) and incurs pay-as-you-go charges.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Bright Data Web MCP Tools List (~60 in Pro)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;General Web Scraping:&lt;/strong&gt; search engine results + generic webpage scraping with bot-detection bypass&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Structured Business Data:&lt;/strong&gt; ZoomInfo, Crunchbase, LinkedIn jobs, Yahoo Finance&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Social Media:&lt;/strong&gt; LinkedIn, Instagram, Facebook, X/Twitter, TikTok, YouTube, Reddit&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;E-commerce:&lt;/strong&gt; Amazon, Walmart, eBay, Zillow, Booking.com, and more&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Browser Automation:&lt;/strong&gt; click, type, navigate, screenshot&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Bright Data Web MCP Example Prompts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Lead Lists (replacing PhantomBuster):&lt;/strong&gt; &lt;em&gt;“Find VP of Engineering profiles at Series-A fintechs and pull their company data.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Company Intel (replacing ZoomInfo):&lt;/strong&gt; &lt;em&gt;“Get the ZoomInfo company profile and Crunchbase funding history for Acme Corp.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Competitive Analysis:&lt;/strong&gt; &lt;em&gt;“Scrape the pricing pages of the top 5 project management tools.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Real-time Data:&lt;/strong&gt; &lt;em&gt;“What’s Tesla’s current market cap?”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; developers building data-collection pipelines, sales/growth teams replacing lead-gen subscriptions, and anyone needing reliable web data without per-seat scraping tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  ElevenLabs MCP: Replace Murf, Play.ht, Otter.ai &amp;amp; Rev
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/elevenlabs/elevenlabs-mcp?source=post_page-----e89cb078d398---------------------------------------" rel="noopener noreferrer"&gt;GitHub - elevenlabs/elevenlabs-mcp: The official ElevenLabs MCP serverThe official ElevenLabs MCP server. Contribute to elevenlabs/elevenlabs-mcp development by creating an account on…github.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/elevenlabs/elevenlabs-mcp" rel="noopener noreferrer"&gt;https://github.com/elevenlabs/elevenlabs-mcp&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Free Tier:&lt;/strong&gt; 10,000 credits/month (ElevenLabs free plan)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Replaces:&lt;/strong&gt; Voiceover tools like &lt;a href="https://murf.ai" rel="noopener noreferrer"&gt;Murf&lt;/a&gt; (&lt;strong&gt;Creator $29/month, Business $75/month&lt;/strong&gt;) and &lt;a href="https://play.ht" rel="noopener noreferrer"&gt;Play.ht&lt;/a&gt; (&lt;strong&gt;Creator $39/month, Pro $99/month&lt;/strong&gt;), plus transcription tools like &lt;a href="https://otter.ai" rel="noopener noreferrer"&gt;Otter.ai&lt;/a&gt; (&lt;strong&gt;Pro $8.33–16.99/month, Business $20–30/seat/month&lt;/strong&gt;) and &lt;a href="https://www.rev.com" rel="noopener noreferrer"&gt;Rev&lt;/a&gt; (&lt;strong&gt;Essentials $25.49/seat/month, Pro $47.99/seat/month&lt;/strong&gt;).&lt;/p&gt;
&lt;h3&gt;
  
  
  What the ElevenLabs MCP Does
&lt;/h3&gt;

&lt;p&gt;The official ElevenLabs MCP server is a thin, prompt-driven layer over the full ElevenLabs audio platform. Instead of writing API code, you tell Claude or Cursor what you want — &lt;em&gt;“generate a voiceover in the Rachel voice”&lt;/em&gt; or &lt;em&gt;“transcribe this recording with speaker labels”&lt;/em&gt; — and the server handles auth, the API call, and file output.&lt;/p&gt;

&lt;p&gt;It covers &lt;strong&gt;both directions of audio&lt;/strong&gt;: text-to-speech (studio-grade voice generation, cloning, and design), plus speech-to-text via the Scribe model, with diarization and word-level timestamps.&lt;/p&gt;
&lt;h3&gt;
  
  
  How ElevenLabs MCP Replaces Murf, Otter &amp;amp; Rev
&lt;/h3&gt;

&lt;p&gt;If you’re paying for Murf or Play.ht to narrate content, generate product demo voiceovers, or produce audio for any part of your stack, ElevenLabs replaces both on a usage model — with a 10k-credit/month free tier that most indie devs won’t blow through. The voices are at least as good, and you’re driving it from your editor instead of a separate dashboard.&lt;/p&gt;

&lt;p&gt;The bonus: the &lt;em&gt;same&lt;/em&gt; MCP also does transcription via Scribe. If you’re also paying for an Otter or Rev seat to transcribe meetings or recordings, that goes too. For a team doing both, that’s two-to-three subscriptions collapsed into one usage-based account.&lt;/p&gt;
&lt;h3&gt;
  
  
  ElevenLabs MCP Features (TTS, STT, Cloning)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Text-to-Speech&lt;/strong&gt; &lt;em&gt;(replaces Murf/Play.ht)&lt;/em&gt;: studio-quality voices with stability/speed/style control&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Voice Cloning &amp;amp; Design&lt;/strong&gt; &lt;em&gt;(replaces Murf’s custom voice features)&lt;/em&gt;: clone from a sample or design a voice from a text prompt&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Speech-to-Text via Scribe&lt;/strong&gt; &lt;em&gt;(replaces Otter/Rev)&lt;/em&gt;: transcription with speaker diarization and word-level timestamps&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Audio Processing:&lt;/strong&gt; sound effects, audio isolation, voice conversion, and music generation&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Flexible Output:&lt;/strong&gt; save to disk or return files as MCP resources (handy for cloud/serverless clients)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  How to Install the ElevenLabs MCP
&lt;/h3&gt;

&lt;p&gt;Get your API key from &lt;a href="https://elevenlabs.io/app/settings/api-keys" rel="noopener noreferrer"&gt;ElevenLabs → Settings → API Keys&lt;/a&gt; (free tier: 10k credits/month). Install &lt;code&gt;uv&lt;/code&gt;, then add to &lt;code&gt;claude_desktop_config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "mcpServers": {  
    "ElevenLabs": {  
      "command": "uvx",  
      "args": ["elevenlabs-mcp"],  
      "env": {  
        "ELEVENLABS_API_KEY": "`&amp;lt;insert-your-api-key-here&amp;gt;`"  
      }  
    }  
  }  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Cursor/Windsurf: &lt;code&gt;pip install elevenlabs-mcp&lt;/code&gt;, then run &lt;code&gt;python -m elevenlabs_mcp --api-key=YOUR_KEY --print&lt;/code&gt; to generate the config block.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Note: Windows users, enable “Developer Mode” in Claude Desktop (Help → Enable Developer Mode) to use the server.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  ElevenLabs MCP Tools List
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Text-to-Speech:&lt;/strong&gt; &lt;code&gt;text_to_speech&lt;/code&gt;, &lt;code&gt;text_to_voice&lt;/code&gt; (design a voice from a prompt), &lt;code&gt;play_audio&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Speech-to-Text:&lt;/strong&gt; &lt;code&gt;speech_to_text&lt;/code&gt; (Scribe, with diarization + word-level timestamps)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Voice Management:&lt;/strong&gt; &lt;code&gt;voice_clone&lt;/code&gt;, &lt;code&gt;search_voices&lt;/code&gt;, &lt;code&gt;search_voice_library&lt;/code&gt;, &lt;code&gt;get_voice&lt;/code&gt;, &lt;code&gt;create_voice_from_preview&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Audio Processing:&lt;/strong&gt; &lt;code&gt;text_to_sound_effects&lt;/code&gt;, &lt;code&gt;isolate_audio&lt;/code&gt;, &lt;code&gt;speech_to_speech&lt;/code&gt;, &lt;code&gt;compose_music&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Account:&lt;/strong&gt; &lt;code&gt;check_subscription&lt;/code&gt; (credit/usage monitoring)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ElevenLabs MCP Example Prompts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Voiceover (replacing Murf/Play.ht):&lt;/strong&gt; &lt;em&gt;“Generate a 60-second narration of this script in a warm, documentary voice.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Voice Prototyping:&lt;/strong&gt; &lt;em&gt;“Generate this product intro in 5 different voices so I can pick one.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Meeting Notes (replacing Otter/Rev):&lt;/strong&gt; &lt;em&gt;“Transcribe&lt;/em&gt; &lt;code&gt;_/recordings/standup.mp3_&lt;/code&gt; &lt;em&gt;with speaker diarization and save the transcript to&lt;/em&gt; &lt;code&gt;_notes.md_&lt;/code&gt;&lt;em&gt;."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Content Repurposing:&lt;/strong&gt; &lt;em&gt;“Turn this podcast clip into text, then re-voice each speaker with a distinct voice.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Sound Design:&lt;/strong&gt; &lt;em&gt;“Generate a thunderstorm ambience track for this video’s intro.”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; developers and content teams paying for a voiceover tool — with transcription as a free side-effect of the same account.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pro Tips for using ElevenLabs MCP
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Clean noisy audio first.&lt;/strong&gt; Run &lt;code&gt;isolate_audio&lt;/code&gt; before &lt;code&gt;speech_to_text&lt;/code&gt; on multi-speaker recordings with background noise — cleaner input meaningfully improves Scribe's diarization quality.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Credit costs vary wildly by tool.&lt;/strong&gt; TTS is cheap per character; voice cloning and music generation cost a lot more. Before running a bulk job, ask Claude to check your balance with &lt;code&gt;check_subscription&lt;/code&gt; and estimate the cost first.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  DataForSEO MCP: Replace Ahrefs &amp;amp; SEMrush (Pay-Per-Query SEO Data)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/dataforseo/mcp-server-typescript?source=post_page-----e89cb078d398---------------------------------------" rel="noopener noreferrer"&gt;GitHub - dataforseo/mcp-server-typescript: DataForSEO API modelcontextprotocol server&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/dataforseo/mcp-server-typescript" rel="noopener noreferrer"&gt;https://github.com/dataforseo/mcp-server-typescript&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Documentation:&lt;/strong&gt; &lt;a href="https://docs.dataforseo.com/v3/" rel="noopener noreferrer"&gt;https://docs.dataforseo.com/v3/&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; Apache-2.0&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Pricing:&lt;/strong&gt; No subscription — pay-as-you-go with a one-time &lt;strong&gt;~$50 minimum deposit&lt;/strong&gt; (plus a free sandbox for testing).&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Replaces:&lt;/strong&gt; &lt;a href="https://ahrefs.com" rel="noopener noreferrer"&gt;Ahrefs&lt;/a&gt; (&lt;strong&gt;$129/month Lite; programmatic API access gated behind Enterprise at ~$999–1,499/month&lt;/strong&gt;) and &lt;a href="https://www.semrush.com" rel="noopener noreferrer"&gt;SEMrush&lt;/a&gt; (&lt;strong&gt;$139.95/month Pro&lt;/strong&gt;). Typically &lt;strong&gt;70–97% cheaper&lt;/strong&gt; at equivalent query volume.&lt;/p&gt;
&lt;h3&gt;
  
  
  What the DataForSEO MCP Does
&lt;/h3&gt;

&lt;p&gt;The official DataForSEO MCP server exposes DataForSEO’s API suite — SERP, keyword, on-page, backlinks, domain analytics, and more — as MCP tools. Your agent runs keyword research, SERP checks, backlink audits, and site crawls in natural language. DataForSEO returns the raw data and the agent structures it.&lt;/p&gt;
&lt;h3&gt;
  
  
  How DataForSEO MCP Replaces Ahrefs &amp;amp; SEMrush
&lt;/h3&gt;

&lt;p&gt;Ahrefs and SEMrush charge for &lt;em&gt;access&lt;/em&gt; (a dashboard seat), not consumption — and if you want their data &lt;em&gt;programmatically&lt;/em&gt;, Ahrefs gates the API behind a ~$999–1,499/month Enterprise plan. DataForSEO on the other hand, only charges per query against the same class of data: SERPs, backlinks, keyword volumes.&lt;/p&gt;

&lt;p&gt;Some quick math: 500 SERP requests/month costs roughly $0.30 on DataForSEO vs. $129/month on Ahrefs Lite — about 97% cheaper. Even at 50,000 requests/month ($30), you’re ~92% under the equivalent dashboard tier.&lt;/p&gt;
&lt;h3&gt;
  
  
  DataForSEO MCP Features (SERP, Backlinks, Keywords)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;SERP API:&lt;/strong&gt; real-time results for Google, Bing, and Yahoo&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Backlinks API:&lt;/strong&gt; referring domains, anchor text, link quality, toxic-link signals&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Keywords Data:&lt;/strong&gt; search volume, CPC, and clickstream metrics&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;OnPage API:&lt;/strong&gt; crawl sites for on-page SEO issues&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;DataForSEO Labs + Domain Analytics:&lt;/strong&gt; ranked keywords, competitor domains, tech stack, Whois&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Modular:&lt;/strong&gt; enable only the API groups you need via &lt;code&gt;ENABLED_MODULES&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  How to Install the DataForSEO MCP
&lt;/h3&gt;

&lt;p&gt;Get your API login/password from the DataForSEO dashboard, then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "mcpServers": {  
    "dataforseo": {  
      "command": "npx",  
      "args": ["-y", "dataforseo-mcp-server"],  
      "env": {  
        "DATAFORSEO_USERNAME": "your_api_login",  
        "DATAFORSEO_PASSWORD": "your_api_password"  
      }  
    }  
  }  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Optional — limit which modules load (smaller toolset, less context bloat):&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export ENABLED_MODULES="SERP,KEYWORDS_DATA,ONPAGE,DATAFORSEO_LABS,BACKLINKS"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It can also run as an HTTP server or deploy to a Cloudflare Worker for remote, OAuth-style access.&lt;/p&gt;

&lt;h3&gt;
  
  
  DataForSEO MCP Tools List (83 Tools, 10 Modules)
&lt;/h3&gt;

&lt;p&gt;Each module is a bundle of DataForSEO API endpoints that the server registers as separate MCP tools your agent can call. &lt;code&gt;ENABLED_MODULES&lt;/code&gt; controls which bundles load; if unset, all 10 modules (and all ~83 tools) are enabled by default.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;**DATAFORSEO_LABS**&lt;/code&gt; &lt;strong&gt;(21 tools):&lt;/strong&gt; Keyword research, competitor domains, ranked keywords, traffic estimation&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**BACKLINKS**&lt;/code&gt; &lt;strong&gt;(20 tools):&lt;/strong&gt; Backlink profiles, referring domains, spam scores, bulk analysis&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**AI_OPTIMIZATION**&lt;/code&gt; &lt;strong&gt;(13 tools):&lt;/strong&gt; LLM benchmarking, ChatGPT scraper, keyword search volume for AI/GEO&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**SERP**&lt;/code&gt; &lt;strong&gt;(7 tools):&lt;/strong&gt; Google + YouTube organic SERP, location lists, video info&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**KEYWORDS_DATA**&lt;/code&gt; &lt;strong&gt;(7 tools):&lt;/strong&gt; Google Ads search volume, Google Trends, DataForSEO Trends&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**ONPAGE**&lt;/code&gt; &lt;strong&gt;(3 tools):&lt;/strong&gt; Instant page crawl, Lighthouse audit, content parsing&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**DOMAIN_ANALYTICS**&lt;/code&gt; &lt;strong&gt;(4 tools):&lt;/strong&gt; Tech stack detection, Whois lookup&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**CONTENT_ANALYSIS**&lt;/code&gt; &lt;strong&gt;(3 tools):&lt;/strong&gt; Brand/citation search, phrase trends, sentiment&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**MERCHANT**&lt;/code&gt; &lt;strong&gt;(4 tools):&lt;/strong&gt; Amazon product/ASIN/seller data&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**BUSINESS_DATA**&lt;/code&gt; &lt;strong&gt;(2 tools):&lt;/strong&gt; Business listings search (Google Maps, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Loading all 83 tools will obviously bloat your agent’s context. Trimming with &lt;code&gt;ENABLED_MODULES="SERP,DATAFORSEO_LABS,BACKLINKS,ONPAGE"&lt;/code&gt; for a focused SEO stack (~51 tools) is a must. Or, add &lt;code&gt;ENABLED_PROMPTS&lt;/code&gt; to limit prompts to only a few canned ones you need.&lt;/p&gt;

&lt;h3&gt;
  
  
  DataForSEO MCP Example Prompts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Keyword Research (replacing Ahrefs/SEMrush):&lt;/strong&gt; &lt;em&gt;“Find keyword gaps between our domain and competitor.com, sorted by search volume.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Backlink Audit:&lt;/strong&gt; &lt;em&gt;“List new backlinks to our domain this month and flag any with high spam scores.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;SERP Tracking:&lt;/strong&gt; &lt;em&gt;“Show the top 10 Google results for ‘mcp server’ in the US, with paid vs. organic.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Technical SEO:&lt;/strong&gt; &lt;em&gt;“Crawl example.com and list pages with missing meta descriptions or broken links.”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; SEO consultants, agencies, and developers building automated SEO pipelines who don’t need the polished dashboard — just the data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pro Tips for using DataForSEO MCP
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Batch with bulk endpoints.&lt;/strong&gt; When you only need difficulty scores, &lt;code&gt;dataforseo_labs_bulk_keyword_difficulty&lt;/code&gt; takes up to 1,000 keywords in a single call (~$0.025/1,000) — far cheaper than checking them one at a time. Ask Claude to batch. (For volume/CPC, use the bulk &lt;code&gt;kw_data_google_ads_search_volume&lt;/code&gt; instead.)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Don't sleep on the&lt;/strong&gt; &lt;code&gt;**AI_OPTIMIZATION**&lt;/code&gt; &lt;strong&gt;module.&lt;/strong&gt; Tracking brand mentions across LLMs (&lt;code&gt;ai_opt_llm_ment_search&lt;/code&gt;) is pay-per-query GEO data that classic Ahrefs/SEMrush dashboards don't surface. Ahrefs' Brand Radar does, but it's a $199+/month add-on — if Generative Engine Optimization is on your radar, this module alone can justify the account.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Warning: the sandbox won’t save you credits here.&lt;/strong&gt; DataForSEO has a free API sandbox that returns realistic dummy data, but the MCP server always hits &lt;em&gt;live&lt;/em&gt; endpoints — every tool call costs credits. Test integrations against the sandbox via direct API calls/Postman. And when testing through Claude, always use small live batches.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  n8n MCP: Replace Zapier With Self-Hosted, AI-Built Workflows
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/czlonkowski/n8n-mcp?source=post_page-----e89cb078d398---------------------------------------" rel="noopener noreferrer"&gt;GitHub - czlonkowski/n8n-mcp: A MCP for Claude Desktop / Claude Code / Windsurf / Cursor to build n8n workflows for you&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/czlonkowski/n8n-mcp" rel="noopener noreferrer"&gt;https://github.com/czlonkowski/n8n-mcp&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Documentation:&lt;/strong&gt; &lt;a href="https://dashboard.n8n-mcp.com" rel="noopener noreferrer"&gt;https://dashboard.n8n-mcp.com&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Free Tier:&lt;/strong&gt; Hosted free tier (100 tool calls/day); fully free when self-hosted. (n8n itself is fair-code and free to self-host with unlimited executions.)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Replaces:&lt;/strong&gt; &lt;a href="https://zapier.com" rel="noopener noreferrer"&gt;Zapier&lt;/a&gt; (&lt;strong&gt;Professional $19.99/month annual / $29.99 monthly; Team $69/month annual / $103.50 monthly&lt;/strong&gt;) and &lt;a href="https://www.make.com" rel="noopener noreferrer"&gt;Make&lt;/a&gt;. Zapier’s task-metered pricing climbs fast — ~$73/month at 2,000 tasks, ~$343/month at 10,000.&lt;/p&gt;
&lt;h3&gt;
  
  
  What the n8n MCP Does
&lt;/h3&gt;

&lt;p&gt;This community MCP server (21k+ stars, MIT) is the bridge between &lt;a href="https://github.com/n8n-io/n8n" rel="noopener noreferrer"&gt;n8n&lt;/a&gt; — the open-source workflow automation platform — and your AI assistant. It gives Claude/Cursor deep, structured knowledge of n8n’s &lt;strong&gt;1,851 nodes&lt;/strong&gt; and &lt;strong&gt;2,352 workflow templates&lt;/strong&gt;, plus management tools to &lt;strong&gt;build, validate, and deploy workflows from a prompt&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In other words: n8n replaces Zapier as the automation engine, and n8n-MCP lets your AI assemble the automations for you instead of you clicking through a builder.&lt;/p&gt;
&lt;h3&gt;
  
  
  How n8n MCP Replaces Zapier
&lt;/h3&gt;

&lt;p&gt;Zapier bills per task and per seat — and the bill scales with your automation, not your team. n8n self-hosted is free with unlimited executions; you only pay for the box it runs on. The historical catch was that building workflows by hand is fiddly. n8n-MCP removes that: the agent searches templates, configures nodes (validating against real schemas), and deploys to your instance.&lt;/p&gt;

&lt;p&gt;So the swap is two-part: &lt;strong&gt;n8n (free, self-hosted) replaces the Zapier subscription&lt;/strong&gt;, and &lt;strong&gt;n8n-MCP replaces the manual workflow-building labor&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  n8n MCP Features (Nodes, Templates, Deployment)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;1,851 nodes&lt;/strong&gt; documented (822 core + 1,029 community) with 99% property coverage&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;2,352 workflow templates&lt;/strong&gt; the AI can search and adapt&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Build + validate from a prompt:&lt;/strong&gt; multi-level validation before anything deploys&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Direct deployment:&lt;/strong&gt; create/update/test workflows on your n8n instance via API&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Works everywhere:&lt;/strong&gt; Claude Code, Cursor, VS Code, Windsurf, and more&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  How to Install the n8n MCP
&lt;/h3&gt;

&lt;p&gt;Fastest path — the hosted server (free tier, no install): sign up at &lt;a href="https://dashboard.n8n-mcp.com" rel="noopener noreferrer"&gt;dashboard.n8n-mcp.com&lt;/a&gt; for an API key.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Self-hosted via npx:&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "mcpServers": {  
    "n8n-mcp": {  
      "command": "npx",  
      "args": ["n8n-mcp"]  
    }  
  }  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;To let it manage your n8n instance, add API credentials:&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "mcpServers": {  
    "n8n-mcp": {  
      "command": "npx",  
      "args": ["n8n-mcp"],  
      "env": {  
        "N8N_API_URL": "https://your-n8n-instance.com",  
        "N8N_API_KEY": "your_n8n_api_key"  
      }  
    }  
  }  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Never point AI at production workflows directly. Copy/backup first, test in dev, and validate before deploying.&lt;/p&gt;

&lt;h3&gt;
  
  
  n8n MCP Tools List (Core + Management)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Core (7):&lt;/strong&gt; &lt;code&gt;search_nodes&lt;/code&gt;, &lt;code&gt;get_node&lt;/code&gt;, &lt;code&gt;validate_node&lt;/code&gt;, &lt;code&gt;validate_workflow&lt;/code&gt;, &lt;code&gt;search_templates&lt;/code&gt;, &lt;code&gt;get_template&lt;/code&gt;, &lt;code&gt;tools_documentation&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;n8n Management (13, needs API config):&lt;/strong&gt; &lt;code&gt;n8n_create_workflow&lt;/code&gt;, &lt;code&gt;n8n_update_partial_workflow&lt;/code&gt;, &lt;code&gt;n8n_validate_workflow&lt;/code&gt;, &lt;code&gt;n8n_autofix_workflow&lt;/code&gt;, &lt;code&gt;n8n_test_workflow&lt;/code&gt;, &lt;code&gt;n8n_executions&lt;/code&gt;, and more&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  n8n MCP Example Prompts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Build an Automation (replacing Zapier):&lt;/strong&gt; &lt;em&gt;“Build a workflow: when a Typeform is submitted, enrich the lead and post it to our Slack #sales channel.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Template Discovery:&lt;/strong&gt; &lt;em&gt;“Find a simple template for syncing Google Sheets rows to Airtable.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Maintenance:&lt;/strong&gt; &lt;em&gt;“Validate workflow wf-123 and auto-fix any errors.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Migration:&lt;/strong&gt; &lt;em&gt;Recreate this Zapier zap as an n8n workflow.”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; developers and ops teams tired of Zapier’s task meter who want self-hosted automation — with the AI doing the wiring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pro Tips for using n8n MCP
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Don’t do 1:1 Zapier migrations.&lt;/strong&gt; n8n’s execution model differs from Zapier’s task-based Zaps. Ask the MCP to &lt;em&gt;redesign&lt;/em&gt; the workflow for n8n rather than recreate it step-for-step — you’ll get something cleaner and often more efficient.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Ask what nodes exist before building.&lt;/strong&gt; The MCP knows 1,851 nodes. Before describing a workflow, ask “what nodes are available for X” (via &lt;code&gt;search_nodes&lt;/code&gt;) — you'll often find a native integration that beats a generic HTTP request node.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  PostHog MCP: Replace Mixpanel, Amplitude &amp;amp; Heap (Free-Tier Analytics)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/PostHog/posthog/tree/master/services/mcp?source=post_page-----e89cb078d398---------------------------------------" rel="noopener noreferrer"&gt;posthog/services/mcp at master · PostHog/posthog🦔 PostHog is an all-in-one developer platform for building successful products.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/PostHog/posthog/tree/master/services/mcp" rel="noopener noreferrer"&gt;https://github.com/PostHog/posthog/tree/master/services/mcp&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Documentation:&lt;/strong&gt; &lt;a href="https://posthog.com/docs/model-context-protocol" rel="noopener noreferrer"&gt;https://posthog.com/docs/model-context-protocol&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Free Tier:&lt;/strong&gt; Hosted MCP is free; PostHog’s free tier covers &lt;strong&gt;1M product-analytics events/month&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Replaces:&lt;/strong&gt; &lt;a href="https://mixpanel.com" rel="noopener noreferrer"&gt;Mixpanel&lt;/a&gt; (&lt;strong&gt;paid plans from ~$20–24/month, scaling with events&lt;/strong&gt;), &lt;a href="https://amplitude.com" rel="noopener noreferrer"&gt;Amplitude&lt;/a&gt; (&lt;strong&gt;Plus from $49/month&lt;/strong&gt;), and &lt;a href="https://www.heap.io" rel="noopener noreferrer"&gt;Heap&lt;/a&gt;. PostHog charges no per-seat fees — unlimited team members.&lt;/p&gt;
&lt;h3&gt;
  
  
  What the PostHog MCP Does
&lt;/h3&gt;

&lt;p&gt;The PostHog MCP server is a free, hosted endpoint that lets your AI query and manage PostHog in plain text: run analytics (trends, funnels, retention) via HogQL/SQL, create and toggle feature flags, set up experiments, triage error-tracking issues, and inspect dashboards — all without leaving your editor.&lt;/p&gt;
&lt;h3&gt;
  
  
  How PostHog MCP Replaces Mixpanel &amp;amp; Amplitude
&lt;/h3&gt;

&lt;p&gt;Mixpanel and Amplitude are the classic “we pay for analytics but only check it occasionally” subscriptions — and they bill by tracked users/events &lt;em&gt;and&lt;/em&gt; often by seat. PostHog’s free tier is large enough (1M events/month) that most small teams never pay, there are no per-seat charges, and the MCP means you ask questions in natural language instead of building charts.&lt;/p&gt;

&lt;p&gt;You’re swapping a recurring analytics seat for a free-tier, usage-based platform you query conversationally. The analysis capability stays; the monthly invoice and the seat math go away.&lt;/p&gt;
&lt;h3&gt;
  
  
  PostHog MCP Features (Analytics, Flags, Errors)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Plain-English Analytics:&lt;/strong&gt; run trends, funnels, retention, and raw HogQL/SQL&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Feature Flags &amp;amp; Experiments:&lt;/strong&gt; create flags and A/B tests from a prompt&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Error Tracking:&lt;/strong&gt; list top issues and affected users without a separate monitoring tool&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;No per-seat pricing:&lt;/strong&gt; unlimited team members on every plan&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Tool/feature filtering:&lt;/strong&gt; scope the toolset via URL params to avoid context bloat&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  How to Install the PostHog MCP
&lt;/h3&gt;

&lt;p&gt;The install wizard handles Cursor, Claude, Claude Code, VS Code, and Zed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npx @posthog/wizard@latest mcp add
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Manual (remote server with a personal API key):&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "mcpServers": {  
    "posthog": {  
      "command": "npx",  
      "args": [  
        "-y",  
        "mcp-remote@latest",  
        "https://mcp.posthog.com/mcp",  
        "--header",  
        "Authorization:${POSTHOG_AUTH_HEADER}"  
      ],  
      "env": {  
        "POSTHOG_AUTH_HEADER": "Bearer YOUR_PERSONAL_API_KEY"  
      }  
    }  
  }  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate the key with the &lt;a href="https://app.posthog.com/settings/user-api-keys?preset=mcp_server" rel="noopener noreferrer"&gt;MCP Server preset&lt;/a&gt;. EU users should point to &lt;code&gt;mcp-eu.posthog.com&lt;/code&gt;. Limit tools with &lt;code&gt;?features=flags,insights,sql&lt;/code&gt; to keep context lean.&lt;/p&gt;

&lt;h3&gt;
  
  
  PostHog MCP Tools List
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Analytics:&lt;/strong&gt; &lt;code&gt;query-run&lt;/code&gt; (trends/funnels), &lt;code&gt;execute-sql&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Feature Flags:&lt;/strong&gt; &lt;code&gt;create-feature-flag&lt;/code&gt;, &lt;code&gt;feature-flag-get-all&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Experiments:&lt;/strong&gt; &lt;code&gt;experiment-create&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Error Tracking:&lt;/strong&gt; &lt;code&gt;query-error-tracking-issues-list&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Workspace:&lt;/strong&gt; dashboards, insights, cohorts, annotations, docs search&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  PostHog MCP Example Prompts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Ad-hoc Analytics (replacing Mixpanel/Amplitude):&lt;/strong&gt; &lt;em&gt;“How many unique users signed up in the last 7 days, broken down by day?”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Retention:&lt;/strong&gt; &lt;em&gt;“What’s our 7-day retention for users who hit the new onboarding flow?”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Shipping Flags:&lt;/strong&gt; &lt;em&gt;“Create a feature flag ‘new-checkout-flow’ enabled for 20% of users.”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Debugging:&lt;/strong&gt; &lt;em&gt;“What are the top 5 errors this week and how many users are affected?”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; founders and product/eng teams who want analytics, flags, and error tracking in one free-tier stack — queried conversationally instead of through yet another dashboard seat.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pro Tips for using PostHog MCP
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Write HogQL directly if you know SQL.&lt;/strong&gt; HogQL is PostHog’s SQL dialect. Skip the natural-language round-trip and hand the &lt;code&gt;execute-sql&lt;/code&gt; tool a query — faster, more precise, and reusable across conversations with no ambiguity for the model to resolve.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Use a personal API key, not your project key.&lt;/strong&gt; The setup wizard generates a personal key scoped via the &lt;a href="https://app.posthog.com/settings/user-api-keys?preset=mcp_server" rel="noopener noreferrer"&gt;MCP Server preset&lt;/a&gt;. A project API key gives the MCP broader permissions than it needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick Comparison: Which SaaS Each MCP Cancels
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;MCP Server&lt;/th&gt;
&lt;th&gt;Replaces&lt;/th&gt;
&lt;th&gt;What you were paying&lt;/th&gt;
&lt;th&gt;Setup difficulty&lt;/th&gt;
&lt;th&gt;Free tier&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bright Data Web MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PhantomBuster, ZoomInfo data, scraping SaaS&lt;/td&gt;
&lt;td&gt;$69–439/mo (PhantomBuster); ~$15k+/yr (ZoomInfo)&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;5,000 requests/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ElevenLabs MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Otter.ai / Rev (transcription) + Murf / Play.ht (voice)&lt;/td&gt;
&lt;td&gt;$20–48/seat/mo&lt;/td&gt;
&lt;td&gt;Easy&lt;/td&gt;
&lt;td&gt;10,000 credits/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DataForSEO MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ahrefs, SEMrush&lt;/td&gt;
&lt;td&gt;$129–139.95/mo (API access ~$999–1,499/mo)&lt;/td&gt;
&lt;td&gt;Easy&lt;/td&gt;
&lt;td&gt;Pay-per-query (~$50 min deposit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;n8n MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Zapier, Make&lt;/td&gt;
&lt;td&gt;$19.99–103.50/mo (scales with tasks)&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Free self-hosted / 100 calls/day hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PostHog MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mixpanel, Amplitude, Heap&lt;/td&gt;
&lt;td&gt;$20–49+/mo (per events/seats)&lt;/td&gt;
&lt;td&gt;Easy&lt;/td&gt;
&lt;td&gt;1M events/month&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Cross-MCP Workflows: Combining Bright Data, DataForSEO, PostHog &amp;amp; n8n
&lt;/h2&gt;

&lt;p&gt;The real payoff isn’t any single swap — it’s chaining them in one conversation, with no dashboard logins between steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Competitive content audit (Bright Data + DataForSEO):&lt;/strong&gt; Use DataForSEO to find the keywords a competitor ranks for and their backlink profile, then Bright Data to scrape the actual pages behind those rankings. The output is a content-gap analysis that used to need Ahrefs &lt;em&gt;and&lt;/em&gt; a separate scraper subscription.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Automated flag rollback (PostHog + n8n):&lt;/strong&gt; Have PostHog watch error rates after a feature-flag rollout and n8n trigger an automatic rollback — toggle the flag off, alert Slack — when errors spike. That’s a monitoring seat wired to a Zapier automation, replaced by two free-tier MCPs.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Lead-to-outreach pipeline (Bright Data + n8n):&lt;/strong&gt; Bright Data pulls enriched company/contact data; n8n routes it into your CRM and kicks off a sequence. The PhantomBuster-to-Zapier chain, minus both subscriptions.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;GEO content loop (DataForSEO + ElevenLabs):&lt;/strong&gt; Track brand mentions across LLMs with DataForSEO’s &lt;code&gt;AI_OPTIMIZATION&lt;/code&gt; module, then produce localized audio versions of your best-performing content with ElevenLabs — all from the same chat.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  MCP Cost Control: Managing Usage-Based API Pricing &amp;amp; Billing Limits
&lt;/h2&gt;

&lt;p&gt;“Replacing a subscription” with a usage-based API is cheaper &lt;em&gt;until it isn’t&lt;/em&gt;. A flat $129/month dashboard has a predictable bill; pay-per-call APIs don’t. For high-volume or scheduled workloads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Set billing limits&lt;/strong&gt; wherever the tool supports them — PostHog per-product caps, DataForSEO field filtering (&lt;code&gt;DATAFORSEO_FULL_RESPONSE=false&lt;/code&gt;) to shrink responses.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Batch calls&lt;/strong&gt; instead of looping one at a time (bulk endpoints — see the DataForSEO tip above).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Keep the toolset lean&lt;/strong&gt; — load only the modules/features you need (&lt;code&gt;ENABLED_MODULES&lt;/code&gt;, &lt;code&gt;?features=&lt;/code&gt;) so you're not paying in context tokens either.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run the math at &lt;em&gt;your&lt;/em&gt; volume before cancelling anything: at low-to-moderate usage these MCPs are a rounding error next to the subscription; only at very high, sustained volume can a flat-rate plan occasionally win.&lt;/p&gt;

&lt;p&gt;Plenty of &lt;em&gt;official&lt;/em&gt; MCPs (Stripe, Notion, Sentry) just move a SaaS you keep paying for into your AI client (Claude, etc.) — this is useful, but not the same as the five above, which &lt;strong&gt;actually remove a bill outright&lt;/strong&gt;. In 2026, a lot of SaaS can actually increasingly replaced by a client like Claude Desktop etc. as the only interface you need.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Which subscriptions have you replaced with MCP servers? What worked, and where did usage-based pricing bite back? Share below 👇.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Open-Source Devs Need To Ship Distribution, Not Just Code.</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Fri, 12 Jun 2026 08:05:35 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/open-source-devs-need-to-ship-distribution-not-just-code-3j6a</link>
      <guid>https://dev.to/prithwish_nath/open-source-devs-need-to-ship-distribution-not-just-code-3j6a</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: How to build a CLI, skill files, and marketplace manifests so AI coding agents like Cursor, Claude Code, and Codex can find, install, and invoke your open-source tool.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fds6rj8vz5on72h1gctks.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fds6rj8vz5on72h1gctks.jpeg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Publishing your package is no longer the finish line. Developers work in agent loops now (Cursor, Claude Code, Codex). Those systems search skills, call MCP tools, and run shell commands before they &lt;code&gt;npm install&lt;/code&gt; anything. So if you're an OSS dev, you also have to ship &lt;strong&gt;distribution&lt;/strong&gt; — the wiring that lets agents find, understand, and invoke your tool without a human pasting your README into chat.&lt;/p&gt;

&lt;p&gt;I’ll cover two things here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  What a good distribution layer actually looks like (using &lt;a href="https://github.com/sixthextinction/knn" rel="noopener noreferrer"&gt;my own project&lt;/a&gt; as a failed example), and&lt;/li&gt;
&lt;li&gt;  When you should &lt;em&gt;actually&lt;/em&gt; bother shipping an MCP versus just a CLI + a CLAUDE.md or SKILL.md file (using &lt;a href="https://get.brightdata.com/scraping-browser7181?utm_content=open_source_devs_need_to_ship_distribution%2C_not_just_code" rel="noopener noreferrer"&gt;Bright Data’s&lt;/a&gt; &lt;a href="https://github.com/brightdata/brightdata-mcp" rel="noopener noreferrer"&gt;Web MCP&lt;/a&gt; as an example)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the end, you’ll be able to intentionally shape your distribution for developers who build with AI agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is A “Distribution Layer” for Open-Source Software?
&lt;/h2&gt;

&lt;p&gt;A distribution layer is the wiring around a library — a CLI, skill files, plugin bundles, and marketplace manifests — that lets coding agents find, understand, and invoke an OSS tool without a human pasting the README into chat. In 2026, that layer is no longer optional.&lt;/p&gt;

&lt;p&gt;With all this agentic coding we’ve got going on right now, a good API is &lt;em&gt;necessary&lt;/em&gt; for OSS distribution but no longer &lt;em&gt;sufficient&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Cursor, Claude Code, Codex, Copilot don’t &lt;code&gt;npm install&lt;/code&gt; on instinct. They search context, read manifests, follow skills, call MCP tools, and run shell commands — in that rough order. If your project only exists as &lt;code&gt;import { waitUntilReady } from 'my-epic-polling-library'&lt;/code&gt;, you're effectively invisible until a human already knows your name and explicitly asks for you. Definitely not something you want to rely on.&lt;/p&gt;

&lt;p&gt;Compare two hypothetical HTTP polling projects with similar code quality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Project A&lt;/strong&gt; publishes a package and a README with usage examples.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Project B&lt;/strong&gt; ships the same library &lt;em&gt;plus&lt;/em&gt; a CLI, an agent plugin bundle, and marketplace configs so Cursor, Claude Code, and Codex can find and install it without the user pasting your docs into chat.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Project B is the one the agent picks — not because its retry logic is better, but because the agent had a discoverable path to it. &lt;strong&gt;When two tools have equal code quality, the one with a CLI, plugin bundle, and marketplace entry is always the one an agent can find and invoke without human help.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn’t gaming SEO. It’s acknowledging that &lt;strong&gt;distribution is now part of the product surface&lt;/strong&gt;, the same way &lt;code&gt;--help&lt;/code&gt; became part of the product surface when CLIs replaced GUI installers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Distribution Layers of an Agent-Ready OSS Tool: CLI, Plugin Bundle, and Marketplace Config
&lt;/h2&gt;

&lt;p&gt;Think of a small FOSS tool for waiting on HTTP endpoints — call it &lt;code&gt;pollgate&lt;/code&gt;. It might expose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pollgate check https://api.example.com/health  
pollgate &lt;span class="nb"&gt;wait &lt;/span&gt;https://api.example.com/ready &lt;span class="nt"&gt;--timeout&lt;/span&gt; 60s &lt;span class="nt"&gt;--interval&lt;/span&gt; 2s  
pollgate &lt;span class="nb"&gt;wait &lt;/span&gt;https://api.example.com/ready &lt;span class="nt"&gt;--expect-status&lt;/span&gt; 200 &lt;span class="nt"&gt;--json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Okay, now humans can use your tool. Great.&lt;/p&gt;

&lt;p&gt;But imagine if you also shipped:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;your-repo/  
├── bin/pollgate                   # the CLI humans and agents shell out to  
├── plugins/pollgate/              # agent plugin bundle  
│   ├── .cursor-plugin/plugin.json  
│   ├── skills/pollgate/SKILL.md  
│   ├── commands/wait-for-service.md  
│   └── agents/pollgate-debugger.md  
├── .cursor-plugin/marketplace.json  
├── .claude-plugin/marketplace.json  
└── .agents/plugins/marketplace.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, each layer answers a different discovery question. And your tool &lt;em&gt;finally&lt;/em&gt; becomes ready for how devs actually build in the agentic AI era.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. CLI: The Invocation Contract Agents Use to Run Your Tool
&lt;/h3&gt;

&lt;p&gt;Agents are good at running commands. A CLI with predictable flags, stable stdout/stderr, and exit codes is &lt;strong&gt;machine-invokable documentation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So &lt;code&gt;pollgate wait --json&lt;/code&gt; is better than three paragraphs in a README because an agent can run it, observe the output, and adapt. Design your CLI for agents the same way you'd design it for scripts: idempotent where possible, JSON output flags where helpful, errors that say what went wrong and what flag fixes it.&lt;/p&gt;

&lt;p&gt;The CLI is not a wrapper around your library for convenience. It’s the &lt;strong&gt;stable external API&lt;/strong&gt; that survives when your internal modules get refactored.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Plugin Bundle: The Context Contract (Skills, Commands, and Rules) That Tells Agents When to Use Your Tool
&lt;/h3&gt;

&lt;p&gt;The CLI tells an agent &lt;em&gt;how to run&lt;/em&gt; your tool. The plugin bundle tells it &lt;em&gt;when and why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;A skill file (&lt;code&gt;skills/pollgate/SKILL.md&lt;/code&gt;) is operational knowledge: "Use &lt;code&gt;pollgate wait&lt;/code&gt; before migrations, seed scripts, or integration tests that assume a live backend. Prefer it over hand-rolled &lt;code&gt;curl&lt;/code&gt; loops — it handles backoff, timeouts, and exit codes consistently."&lt;/p&gt;

&lt;p&gt;Commands are single-purpose recipes: &lt;code&gt;wait-for-service.md&lt;/code&gt; might say "run &lt;code&gt;pollgate wait $URL --timeout 120s --json&lt;/code&gt;, fail the task if exit code is non-zero, parse the final response from stdout."&lt;/p&gt;

&lt;p&gt;Rules cover footguns, like, “Don’t use &lt;code&gt;pollgate wait&lt;/code&gt; against localhost in CI unless the service is actually started in the same job. Check the URL is reachable first with &lt;code&gt;pollgate check&lt;/code&gt;."&lt;/p&gt;

&lt;p&gt;This is the stuff you’d explain to a new contributor on their first PR — except you’re packaging it so an agent reads it &lt;em&gt;before&lt;/em&gt; writing the wrong code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opinion:&lt;/strong&gt; skills beat bloated MCP servers for most library-shaped OSS. If your tool is a subprocess with clear I/O, shell out. Reserve MCP for stateful sessions, streaming, or things that need to stay in-process. Don’t wrap &lt;code&gt;pollgate wait&lt;/code&gt; in a JSON-RPC server because MCP is trendy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should you ship an MCP server?&lt;/strong&gt; Ship a CLI plus a skill file if your tool is a subprocess with clear input/output — this covers most libraries. Ship an MCP server ONLY if your tool requires a stateful session, streaming, or in-process state across calls — for example, a live browser session that must persist across navigate → click → snapshot steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Ship an MCP Server
&lt;/h3&gt;

&lt;p&gt;Most tools don’t need an MCP. Whatever we’ve talked about until now doesn’t — all of it can be handled by a subprocess with clear I/O, and a CLI + skill will cover it completely.&lt;/p&gt;

&lt;p&gt;But here’s a good example: a case where an MCP is legitimately warranted. Then you can run the same criteria on your own tool.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://github.com/brightdata/brightdata-mcp?source=post_page-----1dbb2dfbfdbd---------------------------------------" rel="noopener noreferrer"&gt;GitHub - brightdata/brightdata-mcp: A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Quick TL;DR — this MCP server gives agents tools to scrape dynamic web pages through Bright Data’s proxy network, for &lt;a href="https://get.brightdata.com/bd-web-unlocker?utm_content=open_source_devs_need_to_ship_distribution%2C_not_just_code" rel="noopener noreferrer"&gt;single-shot fetches that bypass bot detection and CAPTCHAs&lt;/a&gt;, and also for &lt;a href="https://get.brightdata.com/bd-scraping-browser?utm_content=open_source_devs_need_to_ship_distribution%2C_not_just_code" rel="noopener noreferrer"&gt;full browser automation over a remote Chromium session&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Why? Let's look at what BrightData would have faced if they'd shipped a CLI instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brightdata-cli scrape https://competitor.com/pricing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That works for a static page i.e a single request with clean output. But competitor pricing rarely works that way. The price probably loads after a JS event, or sits behind a "See pricing" click, or only appears once you scroll past a cookie banner. So an agent driving this needs to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Navigate to the page&lt;/li&gt;
&lt;li&gt; Click the element that triggers the price reveal&lt;/li&gt;
&lt;li&gt; Wait for the JS render&lt;/li&gt;
&lt;li&gt; Snapshot the DOM&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each step needs to run on the &lt;strong&gt;same live browser session&lt;/strong&gt;. With a CLI process, the session would die after every call — there's no loaded DOM to snapshot on step 4, no session to click into on step 2. BrightData would have to re-navigate, re-solve anti-bot, and re-establish the proxy session on every single invocation. The approach would be a complete disaster.&lt;/p&gt;

&lt;p&gt;This is why their MCP model works. With &lt;code&gt;GROUPS="browser"&lt;/code&gt; enabled, the &lt;code&gt;scraping_browser_*&lt;/code&gt; tools in this MCP server keep a remote Chromium session alive across the agent's tool calls:&lt;/p&gt;

&lt;p&gt;So a realistic sequence for the pricing scenario above would be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scraping_browser_navigate       → load /pricing, open or reuse the session  
scraping_browser_snapshot       → ARIA tree + element refs for "See pricing"  
scraping_browser_click_ref      → click that ref  
scraping_browser_wait_for_ref   → wait for the price row to render  
scraping_browser_get_text       → read the DOM back to the agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool &lt;code&gt;scraping_browser_navigate&lt;/code&gt; explicitly reuses an existing session when one is already open — &lt;strong&gt;that's the bit a one-shot CLI can't do&lt;/strong&gt;. Same session, same proxy IP, with CAPTCHA already handled. The agent steps through it like a real user would.&lt;/p&gt;

&lt;p&gt;And yes, they also ship distribution correctly: tool documentation by group, a hosted SSE endpoint, and framework integrations for LangChain, CrewAI, and LlamaIndex.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How can you use this criteria for your own tool?&lt;/strong&gt; Here's what you should take away from this example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Is the session itself the unit of work, not the individual operation?&lt;/strong&gt; In the pricing example, "scrape this page" isn't a single call but a live session that an agent has to drive step by step. The session is what you should pay attention to.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Does re-invoking your tool from scratch on each agent call lose state that can't be cheaply reconstructed?&lt;/strong&gt; This would make us lose the proxy IP, the CAPTCHA solve, and the loaded DOM on every cold start. That's not recoverable (without cost, anyway).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Would a well-designed CLI require a persistent daemon to work correctly?&lt;/strong&gt; If yes, you've already decided you need a long-running process. MCP gives you that — plus a standard protocol, tool introspection, and framework compatibility — for free.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If your tool answers YES to these, then include an MCP for your OSS release.&lt;/strong&gt; If each agent tool call would require a cold start that destroys irrecoverable state, MCP &lt;em&gt;will&lt;/em&gt; solve that problem for you.&lt;/p&gt;

&lt;p&gt;But if you find your tool is a subprocess with clear input and output — a linter, a code generator, a test runner, a wait-for-endpoint utility — a CLI plus a skill file is WAY simpler, more debuggable, and works in every agent loop that can run shell commands. &lt;/p&gt;

&lt;p&gt;That's 90% of all devtools, tbh. Don't wrap a &lt;code&gt;pollgate wait&lt;/code&gt; in a JSON-RPC server just because MCP is trendy.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Marketplace Config: The Discovery Contract That Lets Agents Find and Install Your Tool
&lt;/h3&gt;

&lt;p&gt;Marketplace manifests are how ecosystems &lt;em&gt;index&lt;/em&gt; you.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;.cursor-plugin/marketplace.json&lt;/code&gt; lists installable plugins for Cursor. &lt;code&gt;.claude-plugin/marketplace.json&lt;/code&gt; does the same for Claude Code. &lt;code&gt;.agents/plugins/marketplace.json&lt;/code&gt; is what Codex reads from a repo-local catalog.&lt;/p&gt;

&lt;p&gt;Without these, your plugin bundle is a folder someone has to manually copy. With them, &lt;code&gt;agents-pkg install your-org/pollgate&lt;/code&gt; (or the IDE's marketplace panel) wires skills, commands, and rules into the user's environment.&lt;/p&gt;

&lt;p&gt;Think of it like this — with manifests, you’re not waiting for a central registry to bless you. You’re shipping the catalog entry &lt;em&gt;yourself&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Publishing Only the Library Fails for AI Coding Agents
&lt;/h2&gt;

&lt;p&gt;Say a developer asks their agent: “Wait until the staging API returns 200 before running the migration.”&lt;/p&gt;

&lt;p&gt;The agent, with no plugin context, will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Writes a bash &lt;code&gt;while&lt;/code&gt; loop with &lt;code&gt;curl -sf&lt;/code&gt; and a fixed &lt;code&gt;sleep 5&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; Never sets a timeout — the CI job hangs for an hour&lt;/li&gt;
&lt;li&gt; Treats any HTTP response as success, including &lt;code&gt;503&lt;/code&gt; with a JSON &lt;code&gt;"status":"starting"&lt;/code&gt; body&lt;/li&gt;
&lt;li&gt; Opens a PR. It works when staging happens to be up.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But the developer with &lt;code&gt;pollgate&lt;/code&gt; installed via marketplace:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Their agent loads the &lt;code&gt;pollgate&lt;/code&gt; skill from context&lt;/li&gt;
&lt;li&gt; Runs &lt;code&gt;pollgate wait https://staging.example.com/ready --timeout 120s --expect-status 200 --json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; Fails fast with a clear exit code when the deadline hits, because the skill documents &lt;code&gt;--timeout&lt;/code&gt; as required in CI&lt;/li&gt;
&lt;li&gt; Ships in one turn.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Same library capability, but a completely different outcome — &lt;strong&gt;because distribution came with the library&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real Example: Good Code That Fails Agent Adoption Without a Distribution Layer
&lt;/h2&gt;

&lt;p&gt;Here's a real example, using one of my public repos on GitHub — &lt;a href="https://github.com/sixthextinction/knn" rel="noopener noreferrer"&gt;sixthextinction/knn&lt;/a&gt; — that turns a Google SERP corpus into an explorable knowledge graph using pure k-NN. It's a SERP ingest → DuckDB → Ollama embeddings → Chroma → FastAPI UI with cross-query neighbor exploration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/sixthextinction/knn?source=post_page-----1dbb2dfbfdbd---------------------------------------" rel="noopener noreferrer"&gt;GitHub - sixthextinction/knn: POC for turning a Google corpus into an explorable knowledge graph…POC for turning a Google corpus into an explorable knowledge graph using pure k-NN. - sixthextinction/knngithub.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’ve included tests, there’s a detailed &lt;code&gt;.env&lt;/code&gt; reference table, the full &lt;code&gt;docker-compose.yml&lt;/code&gt; you'll need, and the README is thorough enough.&lt;/p&gt;

&lt;p&gt;But the pipeline has a strict execution order with hard dependencies. Each step assumes the previous one completed. No skill that communicates the order before an agent starts executing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;   &lt;span class="c"&gt;# Chroma must be running first  &lt;/span&gt;
python ingest.py       &lt;span class="c"&gt;# Bright Data → DuckDB  &lt;/span&gt;
python embed.py        &lt;span class="c"&gt;# DuckDB → Chroma (assumes ingest already ran)  &lt;/span&gt;
python serve.py        &lt;span class="c"&gt;# FastAPI UI (assumes Chroma is populated)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This won’t be a problem for a real human reading instructions step by step, obviously.&lt;/p&gt;

&lt;p&gt;But an agent cold-starting on a “set this up” command? It will almost certainly run &lt;code&gt;serve.py&lt;/code&gt; first — it's the obvious entry point for "start the thing." It gets a running server over an empty corpus. No error, just an empty UI. Then it tries &lt;code&gt;embed.py&lt;/code&gt;, which completes without complaint but writes nothing useful because DuckDB has no rows yet. Then &lt;code&gt;ingest.py&lt;/code&gt; — which hard-crashes on a missing &lt;code&gt;BRIGHT_DATA_API_KEY&lt;/code&gt; and &lt;code&gt;BRIGHT_DATA_ZONE&lt;/code&gt;. Two required credentials here, neither with a default, and no SKILL.md to surface them before execution begins.&lt;/p&gt;

&lt;p&gt;There’s also a Ollama footgun — &lt;code&gt;nomic-embed-text:latest&lt;/code&gt; needs to be pulled before &lt;code&gt;embed.py&lt;/code&gt; runs. It's in the README under Prerequisites, but there's no context file that puts it in front of an agent before it starts.&lt;/p&gt;

&lt;p&gt;The tests exist, and &lt;code&gt;pytest&lt;/code&gt; will pass green on a cold machine with none of the infrastructure running. That's completely correct behavior for a code dump. For something an agent tries to adopt, though? It'll probably end up as a false "setup succeeded" when it really didn't.&lt;/p&gt;

&lt;p&gt;Here’s what a single &lt;code&gt;SKILL.md&lt;/code&gt; would have prevented:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Order matters — run in this sequence  &lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; &lt;span class="sb"&gt;`docker compose up -d`&lt;/span&gt; (Chroma on :8000)  
&lt;span class="p"&gt;2.&lt;/span&gt; &lt;span class="sb"&gt;`ollama pull nomic-embed-text`&lt;/span&gt; (required before embed step)  
&lt;span class="p"&gt;3.&lt;/span&gt; &lt;span class="sb"&gt;`python ingest.py`&lt;/span&gt; (needs BRIGHT_DATA_API_KEY + BRIGHT_DATA_ZONE in .env)  
&lt;span class="p"&gt;4.&lt;/span&gt; &lt;span class="sb"&gt;`python embed.py`&lt;/span&gt;  
&lt;span class="p"&gt;5.&lt;/span&gt; &lt;span class="sb"&gt;`python serve.py`&lt;/span&gt; → http://127.0.0.1:8766/  
&lt;span class="gu"&gt;## Do NOT skip ahead  &lt;/span&gt;
Running serve.py before embed.py gives a silently empty UI.  
Running embed.py before ingest.py writes nothing to Chroma.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I never did that, though 😅&lt;/p&gt;

&lt;p&gt;Full disclosure: I built this as a &lt;strong&gt;code dump for&lt;/strong&gt; &lt;a href="https://medium.com/gitconnected/turning-google-into-an-explorable-knowledge-graph-using-pure-k-nn-490613f3080d" rel="noopener noreferrer"&gt;this accompanying article&lt;/a&gt;, not as a product. That’s the right call for that goal. But if I were trying to ship it as something agents could actually adopt in 2026, it would die horribly at step ONE — not because the code is bad or the pitch not interesting enough, but because a distribution layer agents in 2026 can actually use was never shipped.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ship Distribution on Day One, Not After 1.0
&lt;/h2&gt;

&lt;p&gt;You have no idea how many times I’ve heard “We’re pre-1.0. We’ll add agent stuff when we’re stable.”&lt;/p&gt;

&lt;p&gt;That’s ass backwards. &lt;strong&gt;Early adopters of agent-assisted dev are &lt;em&gt;exactly&lt;/em&gt; your target audience&lt;/strong&gt; — they’re the ones wiring up dev environments, CI scripts, and glue code at high velocity. If they can’t find your tool &lt;em&gt;now&lt;/em&gt;, they’ll standardize on whatever the agent already knows — the tool that’s officially blessed, or just famous.&lt;/p&gt;

&lt;p&gt;Minimum viable distribution for a project like &lt;code&gt;pollgate&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;CLI with&lt;/strong&gt; &lt;code&gt;**--help**&lt;/code&gt; &lt;strong&gt;that reads like docs&lt;/strong&gt; — Agents invoke before they import.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;One skill: when to use, when not to&lt;/strong&gt; — Prevents wrong-pattern PRs.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;One command per common workflow&lt;/strong&gt; — e.g. &lt;code&gt;wait-for-service&lt;/code&gt;, &lt;code&gt;check-endpoint&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Marketplace JSON for your primary ecosystem&lt;/strong&gt; — Discoverable install.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**AGENTS.md**&lt;/code&gt; &lt;strong&gt;or skill pointer in README&lt;/strong&gt; — Fallback for agents without marketplace.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's roughly a day of work, not a quarter. The plugin bundle can live in the same repo as the library — &lt;code&gt;plugins/pollgate/&lt;/code&gt; — versioned together, released together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Ecosystem Distribution: Shipping Plugins to Cursor, Claude Code, and Codex
&lt;/h2&gt;

&lt;p&gt;Agent ecosystems use incompatible plugin layouts: Cursor reads &lt;code&gt;.cursor-plugin/&lt;/code&gt;, Claude Code reads &lt;code&gt;.claude-plugin/&lt;/code&gt;, Codex reads &lt;code&gt;.agents/plugins/&lt;/code&gt;, and Copilot uses its own. It's fragmented and irritating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ship to the ecosystems your users actually use.&lt;/strong&gt; If 80% of your stars come from Cursor users, start there. But don’t pretend fragmentation means you can ignore distribution — it means you pick a primary manifest and generate the rest, or use a scaffold tool, and move on.&lt;/p&gt;

&lt;p&gt;The alternative is hoping each IDE vendor crawls your README. Spoiler; they won’t!&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for Shipping OSS Distribution to Coding Agents
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Distribution is not marketing fluff.&lt;/strong&gt; It’s an interface. Breaking changes to your CLI or skill files should semver like API breaks — because downstream agents depend on them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don’t ship an MCP server unless you need one.&lt;/strong&gt; Most OSS projects add MCP because it feels modern. A well-designed CLI plus a skill is simpler, debuggable, and works in every agent loop that can run shell commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don’t make agents read your entire README.&lt;/strong&gt; READMEs are for humans skimming on GitHub. Skills are for agents executing tasks. Different audiences, different formats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test your distribution like you test your code.&lt;/strong&gt; Can a fresh agent session, with only your marketplace plugin installed, successfully add a &lt;code&gt;pollgate wait&lt;/code&gt; step before a migration script? If not, your distribution is broken — even if &lt;code&gt;npm test&lt;/code&gt; passes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The best OSS in 2026 is boring code with excellent wiring.&lt;/strong&gt; Not the flashiest API — the one the agent reaches for because you gave it a path.&lt;/p&gt;

&lt;h2&gt;
  
  
  OSS Distribution and MCP FAQ: Common Questions About Shipping for AI Agents
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Do I need to build an MCP server for my tool, or is a CLI enough?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; For most open-source tools, a CLI plus a skill file is enough — and is simpler to build, debug, and run in any agent loop that executes shell commands. Build an MCP server only when your tool needs a stateful session, streaming, or in-process state that must persist across calls, such as a live browser session that survives navigate → click → snapshot steps. Don’t ship an MCP server just because it feels modern; a subprocess with clear input/output should shell out through a CLI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What is a distribution layer for an open-source tool?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; A distribution layer is the wiring around a library — a CLI, skill files (&lt;code&gt;SKILL.md&lt;/code&gt;), plugin bundles, and marketplace manifests — that lets AI coding agents like Cursor, Claude Code, and Codex find, understand, and invoke the tool without a human pasting the README into chat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How do AI coding agents like Cursor, Claude Code, and Codex discover and install tools?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; AI coding agents discover tools by reading marketplace manifests, not by guessing package names. Cursor reads &lt;code&gt;.cursor-plugin/marketplace.json&lt;/code&gt;, Claude Code reads &lt;code&gt;.claude-plugin/marketplace.json&lt;/code&gt;, and Codex reads &lt;code&gt;.agents/plugins/marketplace.json&lt;/code&gt; from a repo-local catalog. With a marketplace entry, the agent (or the IDE's marketplace panel) wires your skills, commands, and rules into the user's environment; without one, your plugin bundle is just a folder someone has to copy by hand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Why isn’t publishing an npm or PyPI package enough for AI coding agents?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; Publishing a package isn’t enough because coding agents don’t &lt;code&gt;npm install&lt;/code&gt; on instinct — they search context, read manifests, follow skills, call MCP tools, and run shell commands before installing anything. A tool that exists only as an importable package is effectively invisible to agents until a human already knows its name and asks for it explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What should an agent plugin bundle include?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; An agent plugin bundle should include three context files: a skill file (&lt;code&gt;SKILL.md&lt;/code&gt;) describing when and when not to use the tool, single-purpose commands for common workflows (e.g. &lt;code&gt;wait-for-service.md&lt;/code&gt;), and rules that document footguns. Pair this with a CLI whose &lt;code&gt;--help&lt;/code&gt; reads like docs and a marketplace JSON for your primary ecosystem. For a small tool, shipping this minimum viable distribution is about a day of work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Agents Are the New Package Managers for OSS Distribution
&lt;/h2&gt;

&lt;p&gt;We’re past the era where publishing a package was the finish line. Agents are the new package managers — but they only install what they can &lt;strong&gt;find, understand, and invoke.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you’re building CLIs, small libraries, devtools, or anything you ACTUALLY WANT other devs to use, ship the library, yes, but also ship the CLI contract, the plugin bundle, and the marketplace entry. Ship how discovery happens. Code dumps just aren’t enough anymore.&lt;/p&gt;

&lt;p&gt;Distribution is how devs actually discover and drive code in 2026.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Hurl vs Postman: Git-Friendly API Testing With Proxy-Aware Egress (2026)</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Fri, 12 Jun 2026 06:56:07 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/hurl-vs-postman-git-friendly-api-testing-with-proxy-aware-egress-2026-3i19</link>
      <guid>https://dev.to/prithwish_nath/hurl-vs-postman-git-friendly-api-testing-with-proxy-aware-egress-2026-3i19</guid>
      <description>&lt;p&gt;TL;DR: You'll learn how to replace Postman collections with plain-text Hurl files that live in Git, run in CI, and test your API’s geo behavior from any country.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwio9tk3lo5dc2s38bupa.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwio9tk3lo5dc2s38bupa.jpeg" width="640" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Debugging API issues always boils down to taking the tests you already have, running them from a different network or region, and comparing what your API receives. But I bet most of us can’t actually do that right away because our test infrastructure itself is fragmented. Your CI could be using a different config &lt;em&gt;entirely&lt;/em&gt; from the “correct” one on a local dev machine, and half the collections may still point to a staging URL that changed months ago. Postman doesn’t produce diff-friendly artifacts, so &lt;em&gt;even figuring out what changed&lt;/em&gt; is a full-on investigation.&lt;/p&gt;

&lt;p&gt;If any of that sounds familiar, I’d recommend finally taking the big step and replacing your Postman collection with &lt;a href="https://hurl.dev" rel="noopener noreferrer"&gt;Hurl&lt;/a&gt; — a command-line HTTP test runner &lt;strong&gt;where tests are plain&lt;/strong&gt; &lt;code&gt;**.hurl**&lt;/code&gt; &lt;strong&gt;text files that live in Git.&lt;/strong&gt; For this tutorial, I’ll also walk you through multi-region testing — &lt;a href="https://get.brightdata.com/bd-proxy-network?utm_content=hurl_vs_postman_git_friendly_api_testing_with_proxy_aware_egress_2026" rel="noopener noreferrer"&gt;how to run those tests using a proxy&lt;/a&gt; to get a German egress IP, and see what the API actually returns from there.&lt;/p&gt;

&lt;p&gt;Everything below is self-contained, you’ll only need a new Git repo, then create the tests (three plaintext files.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The Test Suite at a Glance
&lt;/h2&gt;

&lt;p&gt;Create a folder (name it whatever you like — &lt;code&gt;hurl-api-tests&lt;/code&gt; works) and these are the files we're going to be creating at its &lt;strong&gt;root&lt;/strong&gt;. That folder is your Git repo root; paths in commands and CI are relative to it.&lt;/p&gt;

&lt;p&gt;Note how we’re using multiple &lt;code&gt;.env&lt;/code&gt; files. If you're new to API testing, you'll find out why in a bit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hurl-api-tests/  
├── tests/  
│   ├── health.hurl           # smoke test -- Makes sure Hurl works, basic asserts. Skip if you want  
│   ├── auth-flow.hurl        # Tests chaining + jsonpath capture + Bearer header  
│   └── geo-detail.hurl       # Tests two egress profiles, real country/ASN diff (direct or proxied)  
├── .env.local.example  
├── .env.ci.example  
├── .env.proxy-de.example  
├── run-tests.mjs             # cross-platform runner  
└── .github/workflows/api-tests.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And of course, a &lt;code&gt;.gitignore&lt;/code&gt; and &lt;code&gt;package.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You’ll &lt;code&gt;npm install&lt;/code&gt; once and use two commands to run the whole thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npm test                      # direct, ISP exit  
npm run test:proxy-de         # same tests but via a Germany exit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Stage 1: Install and the First Green Run
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Orange-OpenSource/hurl/releases/latest" rel="noopener noreferrer"&gt;Hurl ships as a native binary&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Orange-OpenSource/hurl?source=post_page-----3d48a8a9cd3a---------------------------------------" rel="noopener noreferrer"&gt;GitHub - Orange-OpenSource/hurl: Hurl, run and test HTTP requests with plain text.Hurl, run and test HTTP requests with plain text. Contribute to Orange-OpenSource/hurl development by creating an…github.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you live in Node land, the npm package downloads the right binary for your platform and exposes &lt;code&gt;hurl&lt;/code&gt; under &lt;code&gt;node_modules/.bin/&lt;/code&gt;. Afterwards, &lt;code&gt;npx hurl&lt;/code&gt;resolves it from there.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npm install --save-dev @orangeopensource/hurl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then you can get started.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mkdir hurl-api-tests &amp;amp;&amp;amp; cd hurl-api-tests  
mkdir tests  
git init  
npm init -y
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first step is to create a &lt;code&gt;tests/health.hurl&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Smoke test — public status endpoint (uses no auth, no proxy).  
GET https://httpbin.org/status/200  
HTTP 200  
Content-Type: text/html; charset=utf-8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;GET &amp;lt;url&amp;gt;&lt;/code&gt; followed by &lt;code&gt;HTTP 200&lt;/code&gt; is an implicit status assert, and the &lt;code&gt;Content-Type:&lt;/code&gt; line is an implicit response-header assert. No separate &lt;code&gt;[Asserts]&lt;/code&gt; block needed for either.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;a href="https://httpbin.org" rel="noopener noreferrer"&gt;httpbin.org&lt;/a&gt; is convenient for tutorials but has had intermittent availability issues over the years — it’s a community-maintained project not a dedicated SLA endpoint. If you see unexpected failures on the smoke check, &lt;a href="https://httpstat.us/200" rel="noopener noreferrer"&gt;httpstat.us/200&lt;/a&gt; works as a URL swap — but update the_ &lt;code&gt;Content-Type:&lt;/code&gt; line to match whatever that endpoint actually returns. For your own projects, point this at a &lt;code&gt;/health&lt;/code&gt; or &lt;code&gt;/status&lt;/code&gt; on your actual staging API.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After you’ve installed hurl, you can run the smoke test like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# leave out 'npx' if you installed the binary  
npx hurl --test tests/health.hurl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For reference, you’ll find docs for Hurl’s &lt;code&gt;--test&lt;/code&gt; option &lt;a href="https://hurl.dev/docs/running-tests.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This step just proves Hurl works before you add the runner.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Success tests/health.hurl (1 request(s) in 986 ms)  
--------------------------------------------------------------------------------  
Executed files:    1  
Executed requests: 1  
Succeeded files:   1 (100.0%)  
Failed files:      0 (0.0%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s pretty much the typical pattern for writing .hurl files: A URL, some sort of status assert, and a header assert — in a file you can &lt;code&gt;git diff&lt;/code&gt; easily. The simplicity &lt;em&gt;is&lt;/em&gt; the point of Hurl.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 2: The Postman-Style Flow, Without Postman
&lt;/h2&gt;

&lt;p&gt;You know the shape every API test suite needs? Log in, grab a token, hit an authenticated endpoint? Postman handles it with environment variables, pre-request scripts, and a bloated UI that hides where the token came from. Hurl does it in a single text file.&lt;/p&gt;

&lt;p&gt;Let’s do it. Create a &lt;code&gt;tests/auth-flow.hurl&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# tests/auth-flow.hurl  
# Uses DummyJSON (https://dummyjson.com/docs/auth) — free, no API key.  
POST {{dummyjson_base}}/user/login  
Content-Type: application/json  
{  
  "username": "{{dummyjson_username}}",  
  "password": "{{dummyjson_password}}"  
}  
HTTP 200  
[Captures]  
token: jsonpath "$.accessToken"  
[Asserts]  
jsonpath "$.username" == "{{dummyjson_username}}"  

GET {{dummyjson_base}}/auth/me  
Authorization: Bearer {{token}}  
HTTP 200  
[Asserts]  
jsonpath "$.username" == "{{dummyjson_username}}"  
jsonpath "$.email" exists
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lots to explain here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  That &lt;code&gt;{{dummyjson_base}}&lt;/code&gt; and friends within curly braces are template variables. In Hurl, you can inject variables from a &lt;a href="https://hurl.dev/docs/templates.html" rel="noopener noreferrer"&gt;--variables-file&lt;/a&gt; command line option that we'll talk about in the next section.&lt;/li&gt;
&lt;li&gt;  Hurl’s &lt;strong&gt;Captures&lt;/strong&gt; &lt;a href="https://hurl.dev/docs/capturing-response.html" rel="noopener noreferrer"&gt;(read about that here)&lt;/a&gt; read &lt;code&gt;$.accessToken&lt;/code&gt; from the response JSON and binds it to &lt;code&gt;{{token}}&lt;/code&gt; for the rest of the file. (DummyJSON specifically returns the field as &lt;code&gt;accessToken&lt;/code&gt;, not &lt;code&gt;token&lt;/code&gt; — when you adapt this to your own API, check what key your auth endpoint actually uses in the response body.)&lt;/li&gt;
&lt;li&gt;  The second request uses &lt;code&gt;Authorization: Bearer {{token}}&lt;/code&gt; — that's the exact same syntax you'd write by hand with &lt;code&gt;curl&lt;/code&gt;, but the token is now dynamic per run.&lt;/li&gt;
&lt;li&gt;  Hurl’s &lt;strong&gt;Asserts&lt;/strong&gt; &lt;a href="https://hurl.dev/docs/asserting-response.html" rel="noopener noreferrer"&gt;(read about that here)&lt;/a&gt; use standard &lt;a href="https://goessner.net/articles/JsonPath/" rel="noopener noreferrer"&gt;JSONPath&lt;/a&gt;. &lt;code&gt;exists&lt;/code&gt; is one of several Hurl predicates; there are also &lt;code&gt;==&lt;/code&gt;, &lt;code&gt;contains&lt;/code&gt;, &lt;code&gt;matches&lt;/code&gt;, &lt;code&gt;count&lt;/code&gt;, type checks, and numeric comparisons.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;So what makes this reviewable?&lt;/strong&gt; When the contract for &lt;code&gt;/auth/me&lt;/code&gt; ever changes — say, the user's plan tier flips from &lt;code&gt;"free"&lt;/code&gt; to &lt;code&gt;"pro"&lt;/code&gt; and the test should follow, you can easily inspect it via a simple git diff:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- jsonpath "$.plan" == "free"  
+ jsonpath "$.plan" == "pro"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Postman equivalent for this would be a several-hundred-line minified JSON export with embedded GUIDs that rotate on every save.&lt;/strong&gt; Even when the &lt;em&gt;behavioral&lt;/em&gt; change is one line, the diff never is — it’s just all noise across the whole collection, and whoever is reviewing this will probably stop reading after the third unrelated chunk.&lt;/p&gt;

&lt;p&gt;The Hurl diff above, though? Reviewable in &lt;em&gt;five seconds&lt;/em&gt; tops. The Postman one on the otehr hand is the reason your team quietly gave up on PR-reviewing API tests two quarters ago. 😅&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 DummyJSON is used as the test backend here precisely because it’s the kind of API you’d actually want to mock around: real login, real Bearer token, real &lt;code&gt;/me&lt;/code&gt;. Free, no key needed, no rate limits. Swap it for your own staging URL when you adopt the pattern; the &lt;code&gt;.hurl&lt;/code&gt; file barely changes. DummyJSON's &lt;a href="https://dummyjson.com/docs/auth#login-user-and-get-tokens" rel="noopener noreferrer"&gt;current login docs&lt;/a&gt; show &lt;code&gt;/auth/login&lt;/code&gt;; this file uses &lt;code&gt;/user/login&lt;/code&gt;, which still works — if yours 404s, try &lt;code&gt;/auth/login&lt;/code&gt; instead.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Stage 3: Environments That Don’t Hide State
&lt;/h2&gt;

&lt;p&gt;The Postman version of this section would be a series of dropdown screenshots and a slack thread asking “which environment are you in?” The Hurl version is simply three text files, one per profile, with an enforced contract.&lt;/p&gt;

&lt;p&gt;Here's&lt;code&gt;.env.local.example&lt;/code&gt; for local / direct runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dummyjson_base=https://dummyjson.com  
dummyjson_username=emilys  
dummyjson_password=emilyspass  
expected_country=XX
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;XX&lt;/code&gt; is a placeholder for your two-letter ISO code (&lt;code&gt;US&lt;/code&gt;, &lt;code&gt;NZ&lt;/code&gt;, &lt;code&gt;IN&lt;/code&gt;, etc.) matching your ISP egress.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Quick Tip:&lt;/strong&gt; Run &lt;code&gt;curl -s https://geo.brdtest.com/mygeo.json --insecure | grep country&lt;/code&gt; (macOS/Linux) to find yours, then set it in_ &lt;code&gt;.env.local&lt;/code&gt; _before running &lt;code&gt;npm test&lt;/code&gt;. On Windows, open the JSON in a browser or use &lt;code&gt;curl.exe -sk …&lt;/code&gt; and read the "country" field manually. A mismatch here is the single most common gotchas on tutorials like this one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Create &lt;code&gt;.env.ci.example&lt;/code&gt; by copying &lt;code&gt;.env.local.example&lt;/code&gt; unchanged -- same four keys -- only &lt;code&gt;.env.local&lt;/code&gt; needs a real country code. The only functional difference is that the &lt;code&gt;ci&lt;/code&gt; profile skips &lt;code&gt;geo-detail&lt;/code&gt; entirely, so &lt;code&gt;expected_country&lt;/code&gt; is never read or asserted in CI.&lt;/p&gt;

&lt;p&gt;Similarly, we’ll create our proxy profile — let’s say, for Germany (DE). Create a &lt;code&gt;.env.proxy-de.example&lt;/code&gt; at the root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dummyjson_base=https://dummyjson.com  
dummyjson_username=emilys  
dummyjson_password=emilyspass  
expected_country=DE  
BRIGHT_DATA_PROXY_HOST=brd.superproxy.io  
BRIGHT_DATA_PROXY_PORT=33335  
BRIGHT_DATA_PROXY_USERNAME=  
BRIGHT_DATA_PROXY_PASSWORD=
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the proxies, get your credentials from your control panel/dashboard. If you’re following along, &lt;a href="https://get.brightdata.com/lp-proxy-network-acf?utm_content=hurl_vs_postman_git_friendly_api_testing_with_proxy_aware_egress_2026" rel="noopener noreferrer"&gt;sign up here&lt;/a&gt; to get them first.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://get.brightdata.com/bd7914?utm_content=hurl_vs_postman_git_friendly_api_testing_with_proxy_aware_egress_2026&amp;amp;source=post_page-----3d48a8a9cd3a---------------------------------------" rel="noopener noreferrer"&gt;Bright Data - All in One Platform for Proxies and Web Scraping&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Copy each to its matching gitignored file (&lt;code&gt;.env.local&lt;/code&gt;, &lt;code&gt;.env.ci&lt;/code&gt;, &lt;code&gt;.env.proxy-de&lt;/code&gt;). In &lt;code&gt;.env.local&lt;/code&gt;, replace &lt;code&gt;XX&lt;/code&gt; with your real ISP country code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The &lt;code&gt;.example&lt;/code&gt; contract
&lt;/h2&gt;

&lt;p&gt;The idea is you create separate files per profile, with strict roles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;**.env.&lt;/code&gt;&lt;code&gt;.example**&lt;/code&gt; — committed to Git. Defines the &lt;em&gt;schema&lt;/em&gt; of the environment: every variable name, with safe placeholder values. &lt;strong&gt;For all intents and purposes, this will be the documentation.&lt;/strong&gt; New team members read it, and PR reviewers diff it when fields are added or removed.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**.env.&lt;/code&gt;&lt;code&gt;**&lt;/code&gt; — Obviously, this will be &lt;strong&gt;gitignored&lt;/strong&gt;. Contains the real values: your password, your Bright Data credentials, the &lt;code&gt;expected_country&lt;/code&gt; you actually live in. Should never be tracked.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The runner refuses to start if the real file is missing — it tells you to copy from &lt;code&gt;.example&lt;/code&gt;. That refusal &lt;em&gt;is&lt;/em&gt; the contract (the runner enforces it; see below).&lt;/p&gt;

&lt;p&gt;While we’re at it, create the &lt;code&gt;.gitignore&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.env.local  
.env.ci  
.env.proxy-de  
node_modules/  
.venv/  
__pycache__/  
*.pyc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every other env file is a &lt;code&gt;.example&lt;/code&gt; template. The schema is always reviewable, and the secrets are never kept in Git.&lt;/p&gt;

&lt;h2&gt;
  
  
  Add Hurl test scripts to package.json
&lt;/h2&gt;

&lt;p&gt;You already have one from &lt;code&gt;npm init -y&lt;/code&gt; in Stage 1, with &lt;code&gt;@orangeopensource/hurl&lt;/code&gt; in &lt;code&gt;devDependencies&lt;/code&gt;. Paste the block below — or replace the whole file if you prefer a clean slate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "name": "hurl-api-tests",  
  "private": true,  
  "scripts": {  
    "test": "node run-tests.mjs",  
    "test:ci": "node run-tests.mjs ci",  
    "test:proxy-de": "node run-tests.mjs proxy-de"  
  },  
  "devDependencies": {  
    "@orangeopensource/hurl": "^8.0.1"  
  }  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should be self explanatory — after &lt;code&gt;npm install&lt;/code&gt;, &lt;code&gt;npm test&lt;/code&gt; runs the direct profile; &lt;code&gt;npm run test:ci&lt;/code&gt; and &lt;code&gt;npm run test:proxy-de&lt;/code&gt; select the other env files.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Simple Test Runner
&lt;/h2&gt;

&lt;p&gt;You’ll need a way to actually run these tests — a simple shell script .sh for bash, and an .mjs file for cross-platform (get this one if you’re on Windows). But this is pure plumbing, and not part of the lesson I’m trying to teach here.&lt;/p&gt;

&lt;p&gt;So just copy the full source from my GitHub Gist, put it at the repo root:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://gist.github.com/sixthextinction/3d036516b5589efe18faf85bfc9d3663" rel="noopener noreferrer"&gt;This one for cross-platform&lt;/a&gt; (.mjs)&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://gist.github.com/sixthextinction/1a11d9bd2e5b96c69561c7f8b17109d7" rel="noopener noreferrer"&gt;This one if you prefer bash&lt;/a&gt; (.sh)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can write it yourself, actually — a test runner simply loads the right &lt;code&gt;.env&lt;/code&gt; file per profile (&lt;code&gt;direct&lt;/code&gt; → &lt;code&gt;.env.local&lt;/code&gt;, &lt;code&gt;ci&lt;/code&gt; → &lt;code&gt;.env.ci&lt;/code&gt;, &lt;code&gt;proxy-de&lt;/code&gt; → &lt;code&gt;.env.proxy-de&lt;/code&gt;), refuses to start if that file is missing, and on &lt;code&gt;ci&lt;/code&gt; skips &lt;code&gt;geo-detail&lt;/code&gt; because GitHub runner egress isn't pinned to a country.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;: on &lt;code&gt;proxy-de&lt;/code&gt;, &lt;code&gt;http_proxy&lt;/code&gt; and &lt;code&gt;https_proxy&lt;/code&gt; are set &lt;strong&gt;only&lt;/strong&gt; for the &lt;code&gt;geo-detail.hurl&lt;/code&gt; run — &lt;code&gt;health&lt;/code&gt; and &lt;code&gt;auth-flow&lt;/code&gt; stay direct. Proxied auth against DummyJSON would just test whether Bright Data can reach DummyJSON, not whether your API behaves correctly for German traffic. 😅&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🚨 Don’t run &lt;code&gt;npm test&lt;/code&gt; yet! The direct profile also runs &lt;code&gt;tests/geo-detail.hurl&lt;/code&gt;, which you'll create in Stage 4._&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Type of Bug This Stops
&lt;/h2&gt;

&lt;p&gt;How many times have you been exasperated by a test that passes locally, fails in CI, and three hours later you discover that it happened because a local dev laptop had a stale &lt;code&gt;base_url&lt;/code&gt; someone forgot to update after the staging URL moved?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Postman version of this bug is invisible&lt;/strong&gt; — your environments live inside the GUI, and exports of them are huge. No PR will ever review “the environment file.”&lt;/p&gt;

&lt;p&gt;With Hurl’s committed &lt;code&gt;.example&lt;/code&gt; templates, &lt;code&gt;base_url&lt;/code&gt; is literally just a tracked field in a tracked file. When it changes, the diff WILL show up in a PR, and the fix will be one line. That entire class of bug stops annoying you overnight.&lt;/p&gt;

&lt;p&gt;For CI specifically, &lt;code&gt;.env.ci.example&lt;/code&gt; ships the &lt;em&gt;shape&lt;/em&gt;; the direct job copies it to &lt;code&gt;.env.ci&lt;/code&gt; at runtime — no secrets needed for DummyJSON's public test users. When you swap in your own staging API, put real URLs and credentials in &lt;a href="https://docs.github.com/en/actions/security-guides/encrypted-secrets" rel="noopener noreferrer"&gt;GitHub Actions secrets&lt;/a&gt; and inject them in the workflow the same way the Bright Data job does below. &lt;strong&gt;The committed&lt;/strong&gt; &lt;code&gt;**.example**&lt;/code&gt; &lt;strong&gt;file is your schema, so production values never land in Git.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 4: Proxies as Scriptable Egress
&lt;/h2&gt;

&lt;p&gt;What if we could use the same Hurl file, same assertions, &lt;strong&gt;but check multiple regions/countries with it&lt;/strong&gt;?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why would you want that?&lt;/strong&gt; Let’s say you wrote those Hurl tests and all your PRs run it. Everything’s green and both the local laptop and the CI runner pass. Then you ship a feature that shows EUR pricing for users in Germany, and three days later a customer in Hamburg reports they’re being charged USD.&lt;/p&gt;

&lt;p&gt;Uh oh.&lt;/p&gt;

&lt;p&gt;You check staging — it’s fine. You check the admin — pricing is correctly set. You re-run the test suite — nothing pops, everything’s still green.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What went wrong?&lt;/strong&gt; The bug actually isn’t in your code. It’s in the fact that &lt;strong&gt;every machine that ever ran that test suite exited the internet from somewhere your API thinks is the United States.&lt;/strong&gt; The German behavior was broken the entire time, because nobody could see it, because nobody’s test traffic ever &lt;em&gt;looked&lt;/em&gt; German.&lt;/p&gt;

&lt;p&gt;Unless you have team members &lt;em&gt;actually living in Germany&lt;/em&gt;, the instinct here is to grab a VPN, switch to a German exit, and re-run manually. &lt;strong&gt;That will work as a one-off, but it doesn’t run on every PR.&lt;/strong&gt; It doesn’t run at 2am when someone merges a pricing change, and it &lt;em&gt;certainly&lt;/em&gt; doesn’t live in the repo next to the code it’s testing. A VPN you open manually is always going to be a one-off check. What we want is a &lt;em&gt;test&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This is exactly what &lt;strong&gt;scriptable egress&lt;/strong&gt; means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Proxy configuration that can be put in an env file,&lt;/li&gt;
&lt;li&gt;  Applied selectively to the requests where geography matters,&lt;/li&gt;
&lt;li&gt;  Is repeatable by anyone with the credentials.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly why we use Bright Data proxies for this. The implementation itself can be as simple as a flag you pass to the runner like &lt;code&gt;npm run test:proxy-de&lt;/code&gt; from your laptop, or triggered from a GitHub Actions &lt;code&gt;workflow_dispatch&lt;/code&gt;. &lt;strong&gt;And it'll work for any country Bright Data has an exit for.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The endpoint we use to demonstrate this is &lt;a href="https://geo.brdtest.com/mygeo.json" rel="noopener noreferrer"&gt;Bright Data’s geo test JSON&lt;/a&gt; — it returns the country, ASN, and city Bright Data sees for whatever IP hits it.&lt;/p&gt;

&lt;p&gt;Create &lt;code&gt;tests/geo-detail.hurl&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# tests/geo-detail.hurl  
# Geo fingerprint via Bright Data's test endpoint — same URL, direct or proxied.  
GET https://geo.brdtest.com/mygeo.json  
[Options]  
insecure: true   # brdtest.com uses a demo cert — never use this against production endpoints  
HTTP 200  
[Asserts]  
jsonpath "$.country" == "{{expected_country}}"  
jsonpath "$.asn.org_name" exists  
jsonpath "$.asn.asnum" &amp;gt; 0  
jsonpath "$.geo.city" exists
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response you’ll get back is something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "country": "DE",  
  "asn": { "asnum": 3320, "org_name": "Deutsche Telekom AG" },  
  "geo": { "city": "Wetzlar", "region": "HE", "tz": "Europe/Berlin", ... }  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The point of this file is to see if the endpoint sees a different IP, ASN, and city depending on whether your traffic exits from your office or through a proxy that pins a country.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;💡&lt;/em&gt; The assertions in this Hurl test have to be deliberately tight on country (the thing you control) and loose on ASN/city (Bright Data’s exit nodes rotate within a country).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To pin a country, Bright Data accepts a country suffix on the proxy username:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;brd-customer-hl_XXXXXXXX-zone-YOURZONE-country-de
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fill &lt;code&gt;BRIGHT_DATA_PROXY_USERNAME&lt;/code&gt; and &lt;code&gt;BRIGHT_DATA_PROXY_PASSWORD&lt;/code&gt; in &lt;code&gt;.env.proxy-de&lt;/code&gt; (copied from &lt;code&gt;.env.proxy-de.example&lt;/code&gt;), then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npm run test:proxy-de
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The runner exports &lt;code&gt;http_proxy&lt;/code&gt; and &lt;code&gt;https_proxy&lt;/code&gt; from your env vars &lt;em&gt;only&lt;/em&gt; for &lt;code&gt;geo-detail.hurl&lt;/code&gt; — DummyJSON and httpbin keep running direct (as described in Stage 3).&lt;/p&gt;

&lt;p&gt;The before/after:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Profile&lt;/th&gt;
&lt;th&gt;Egress&lt;/th&gt;
&lt;th&gt;&lt;code&gt;country&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;ASN&lt;/th&gt;
&lt;th&gt;City&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;npm test&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Local ISP&lt;/td&gt;
&lt;td&gt;your code &lt;em&gt;(not &lt;code&gt;XX&lt;/code&gt; - see Stage 3)&lt;/em&gt;
&lt;/td&gt;
&lt;td&gt;(your ISP)&lt;/td&gt;
&lt;td&gt;(your ISP)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;npm run test:proxy-de&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Bright Data DE&lt;/td&gt;
&lt;td&gt;DE&lt;/td&gt;
&lt;td&gt;Deutsche Telekom AG (AS3320)&lt;/td&gt;
&lt;td&gt;Wetzlar&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No custom infrastructure needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOW you can run the full direct suite&lt;/strong&gt; (because now, all three &lt;code&gt;.hurl&lt;/code&gt; files exist):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cp .env.local.example .env.local   # Git Bash / macOS / Linux; on Windows, copy the file manually  
# replace XX with your ISP country code (e.g. US, IN, DE)  
npm test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Where the geo determination actually happens (read this before you copy the pattern)
&lt;/h2&gt;

&lt;p&gt;In this project, &lt;strong&gt;the real geo signal is your egress IP&lt;/strong&gt;, not anything the client sends. Bright Data routes through a German exit; the endpoint inspects the IP it received — i.e. the response says &lt;code&gt;DE&lt;/code&gt;. The client doesn't tell the API where it is; the network does.&lt;/p&gt;

&lt;p&gt;A real region-aware production API works the same way. It reads the source IP off the connection, optionally consults an API gateway / WAF / GeoIP layer, and decides what to return.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;💡&lt;/em&gt; When you point these tests at your own staging API, drop any client-side geo headers and let proxy egress be the signal.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Stage 5: Using GitHub CI Jobs for API Testing
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.github/workflows/api-tests.yml&lt;/code&gt; at the &lt;strong&gt;same level&lt;/strong&gt; as &lt;code&gt;package.json&lt;/code&gt; (repo root). The workflow assumes every file from this article lives there — no subfolder, no extra checkout steps.&lt;/p&gt;

&lt;p&gt;Before pushing, run &lt;code&gt;npm install&lt;/code&gt; locally once and &lt;strong&gt;commit&lt;/strong&gt; &lt;code&gt;**package-lock.json**&lt;/code&gt;. The workflow uses &lt;a href="https://docs.npmjs.com/cli/v10/commands/npm-ci" rel="noopener noreferrer"&gt;npm ci&lt;/a&gt;, which requires that lockfile.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: api-tests  
on:  
  push:  
  pull_request:  
  workflow_dispatch:  
jobs:  
  api-tests-direct:  
    runs-on: ubuntu-latest  
    steps:  
      - uses: actions/checkout@v4  
      - uses: actions/setup-node@v4  
        with:  
          node-version: "20"  
          cache: npm  
          cache-dependency-path: package-lock.json  
      - name: Install dependencies  
        run: npm ci  
      - name: Configure env  
        run: cp .env.ci.example .env.ci  
      - name: Run direct tests  
        run: npm run test:ci  
  api-tests-brightdata:  
    runs-on: ubuntu-latest  
    if: github.event_name == 'workflow_dispatch'  
    steps:  
      - uses: actions/checkout@v4  
      - uses: actions/setup-node@v4  
        with:  
          node-version: "20"  
          cache: npm  
          cache-dependency-path: package-lock.json  
      - name: Install dependencies  
        run: npm ci  
      - name: Configure env  
        run: cp .env.proxy-de.example .env.proxy-de  
      - name: Run Bright Data tests  
        env:  
          BRIGHT_DATA_PROXY_USERNAME: ${{ secrets.BRIGHT_DATA_PROXY_USERNAME }}  
          BRIGHT_DATA_PROXY_PASSWORD: ${{ secrets.BRIGHT_DATA_PROXY_PASSWORD }}  
        run: |  
          if [[ -n "$BRIGHT_DATA_PROXY_USERNAME" ]]; then  
            echo "BRIGHT_DATA_PROXY_USERNAME=$BRIGHT_DATA_PROXY_USERNAME" &amp;gt;&amp;gt; .env.proxy-de  
            echo "BRIGHT_DATA_PROXY_PASSWORD=$BRIGHT_DATA_PROXY_PASSWORD" &amp;gt;&amp;gt; .env.proxy-de  
          fi  
          npm run test:proxy-de
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; &lt;code&gt;api-tests-direct&lt;/code&gt; runs on every push and PR — &lt;code&gt;health&lt;/code&gt; + &lt;code&gt;auth-flow&lt;/code&gt; only. &lt;strong&gt;It deliberately does not run&lt;/strong&gt; &lt;code&gt;**geo-detail**&lt;/code&gt;&lt;strong&gt;.&lt;/strong&gt; GitHub-hosted runners exit through Azure ranges that rotate by region and over time; asserting &lt;code&gt;country == "US"&lt;/code&gt; from CI would pass on Monday and fail on Wednesday. &lt;strong&gt;We're not in the business of asserting on infrastructure we don't control.&lt;/strong&gt; The &lt;code&gt;ci&lt;/code&gt; profile in &lt;code&gt;run-tests.mjs&lt;/code&gt; skips &lt;code&gt;geo-detail&lt;/code&gt; for that reason.&lt;/p&gt;

&lt;p&gt;Also note how &lt;code&gt;api-tests-brightdata&lt;/code&gt; is &lt;code&gt;workflow_dispatch&lt;/code&gt; &lt;a href="https://docs.github.com/en/actions/using-workflows/manually-running-a-workflow" rel="noopener noreferrer"&gt;(GitHub docs here)&lt;/a&gt; only — trigger it manually from the Actions tab (or add a cron later). It runs the full suite &lt;em&gt;including&lt;/em&gt; &lt;code&gt;geo-detail&lt;/code&gt; through a pinned DE exit.&lt;/p&gt;

&lt;p&gt;For that job, add two GitHub repo secrets before your first manual run: &lt;code&gt;BRIGHT_DATA_PROXY_USERNAME&lt;/code&gt; (with &lt;code&gt;-country-de&lt;/code&gt; appended) and &lt;code&gt;BRIGHT_DATA_PROXY_PASSWORD&lt;/code&gt;. The workflow appends them to &lt;code&gt;.env.proxy-de&lt;/code&gt; at runtime, then throws the runner away. Nothing sensitive will ever be in source control.&lt;/p&gt;

&lt;p&gt;The lesson you should take away from all of this is that &lt;strong&gt;only the egress you explicitly control is worth asserting on&lt;/strong&gt;. CI direct verifies that auth and HTTP plumbing work, while CI proxy verifies that &lt;em&gt;geographic&lt;/em&gt; behavior works.&lt;/p&gt;

&lt;p&gt;Mixing the two would give you a flaky suite that’s worse than no test at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gotchas
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;HTTP proxy env vars are lowercase.&lt;/strong&gt; Hurl reads &lt;code&gt;http_proxy&lt;/code&gt; and &lt;code&gt;https_proxy&lt;/code&gt;. Not &lt;code&gt;HTTP_PROXY&lt;/code&gt;. PowerShell will helpfully suggest the uppercase versions; ignore that suggestion. 😅&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Localhost is not reachable through Bright Data.&lt;/strong&gt; If you have a local mock server, set &lt;code&gt;no_proxy=127.0.0.1,localhost&lt;/code&gt; before running Hurl, or unset the proxy vars before that specific test. The runner does both.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Free public geo APIs do not love datacenter traffic.&lt;/strong&gt; My first instinct was to use &lt;a href="https://ip-api.com" rel="noopener noreferrer"&gt;ip-api.com&lt;/a&gt; — it’s the go-to for “what country is this IP” lookups and has a generous free tier. Hooked it up, ran the direct profile, got a clean response. Then I ran it through the Bright Data proxies I had (&lt;a href="https://get.brightdata.com/proxy-types-datacenter-proxies" rel="noopener noreferrer"&gt;these&lt;/a&gt;) and got a HTTP 402. Turns out ip-api.com’s free tier detects any datacenter egress and refuses to serve it. The proxy was working perfectly; the endpoint just didn’t want to talk to it. This is why I switched to using the &lt;code&gt;geo.brdtest.com/mygeo.json&lt;/code&gt; URL.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;However,&lt;/strong&gt; &lt;code&gt;**geo.brdtest.com**&lt;/code&gt; &lt;strong&gt;uses a demo TLS cert.&lt;/strong&gt; This is fine for now, Hurl's &lt;code&gt;[Options] insecure: true&lt;/code&gt; per-request flag handles this without weakening security globally.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GET https://geo.brdtest.com/mygeo.json  
[Options]  
insecure: true   # brdtest.com uses a demo cert — never use this against production endpoints  
HTTP 200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;ASN and city rotate within a country.&lt;/strong&gt; Bright Data’s German exit IPs are not pinned to one ASN or city. This is why your assertion has to be &lt;code&gt;country == "DE"&lt;/code&gt;, &lt;em&gt;not&lt;/em&gt; &lt;code&gt;city == "Wetzlar"&lt;/code&gt;. Assert on the thing you control (country pin), not rotating values, and verify presence/shape on everything else (&lt;code&gt;asn.org_name exists&lt;/code&gt;, &lt;code&gt;asn.asnum &amp;gt; 0&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What the Pattern Is Actually Good For
&lt;/h2&gt;

&lt;p&gt;This project demonstrates &lt;strong&gt;three&lt;/strong&gt; uses of proxy-aware API testing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Country-specific behavior&lt;/strong&gt; — same endpoint, two egress profiles, asserts on &lt;code&gt;country&lt;/code&gt; differ. (&lt;code&gt;geo-detail.hurl&lt;/code&gt; direct vs proxy.)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;IP reputation as a signal&lt;/strong&gt; — the ASN of your test traffic matters. &lt;code&gt;geo-detail.hurl&lt;/code&gt; asserts on &lt;code&gt;asn.org_name&lt;/code&gt;, which is exactly what fraud and risk systems look at to score requests. (Bright Data residential vs datacenter exits will produce different ASN orgs.)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Routing or gateway behavior by region&lt;/strong&gt; — the geo fingerprint Bright Data sees is the same fingerprint your CDN, API gateway, or partner integration sees. If your test suite always exits from one place, you’re not testing the routing layer; you’re testing the routing layer’s behavior &lt;em&gt;for traffic from your office&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But there are things this pattern &lt;em&gt;enables&lt;/em&gt; that this walkthrough does not fully demonstrate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Localized pricing or currency&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Tax / VAT eligibility&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Compliance blocks and allowed-market checks&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Important use cases, but also things you can’t legitimately demonstrate without your own staging API. 😅 So I’m not covering these — swap DummyJSON and &lt;code&gt;geo.brdtest.com&lt;/code&gt; for your endpoints and keep the same &lt;code&gt;.hurl&lt;/code&gt; + proxy profile shape.&lt;/p&gt;

&lt;p&gt;Biggest lesson you should take away here is that &lt;strong&gt;not every API test needs a proxy.&lt;/strong&gt; Use proxy-aware tests when behavior depends on geography, IP type, routing, or compliance — that’s the population. Adding Bright Data to every smoke test would be silly. &lt;strong&gt;Adding it to the tests where regional behavior matters is the entire point, naturally.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It Yourself
&lt;/h2&gt;

&lt;p&gt;This is everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cd hurl-api-tests  
npm install                    # creates package-lock.json — commit it for CI  
cp .env.local.example .env.local   # or copy manually on Windows  
# Replace XX in .env.local with your real ISP country code (see Stage 3)  
npm test                       # direct: health + auth + geo-detail  
cp .env.proxy-de.example .env.proxy-de  
# Fill BRIGHT_DATA_PROXY_USERNAME (with -country-de) and BRIGHT_DATA_PROXY_PASSWORD  
npm run test:proxy-de
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our &lt;code&gt;npm test&lt;/code&gt; runs &lt;code&gt;geo-detail&lt;/code&gt; against your own ISP egress — you set &lt;code&gt;expected_country&lt;/code&gt; in &lt;code&gt;.env.local&lt;/code&gt;, so the country assertion is pinned to something you control. CI skips it because GitHub runner IP ranges rotate; geography is asserted only in the manual Bright Data job.&lt;/p&gt;

&lt;p&gt;Sample output from a clean run on both profiles (and the CI profile):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;== profile: direct ==  
Success tests/auth-flow.hurl (2 request(s) in 727 ms)  
Success tests/health.hurl   (1 request(s) in 986 ms)  
-- geo-detail (direct) --  
Success tests/geo-detail.hurl (1 request(s) in 1375 ms)  
All tests passed.  

== profile: ci ==  
Success tests/health.hurl   (1 request(s) in 968 ms)  
Success tests/auth-flow.hurl (2 request(s) in 1016 ms)  
-- geo-detail skipped (CI direct egress is not pinned) --  
All tests passed.  

== profile: proxy-de ==  
Success tests/auth-flow.hurl (2 request(s) in 763 ms)  
Success tests/health.hurl   (1 request(s) in 1001 ms)  
-- geo-detail (via Bright Data DE) --  
Success tests/geo-detail.hurl (1 request(s) in 5924 ms)  
All tests passed.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That 5924ms on &lt;code&gt;geo-detail&lt;/code&gt; in the Bright Data profile is expected — routing through a proxy exit in Germany from wherever you're sitting will obviously add latency. It's not a bug, and I'd actually argue it's useful: if your regional endpoint is asserting on the right response, knowing it takes ~6 seconds from a German IP versus ~1.4 seconds direct, tells you something about your CDN routing or API gateway performance that you'd otherwise never observe from CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Beats the Postman Status Quo
&lt;/h2&gt;

&lt;p&gt;There’s a tempting version of this article that just says “Hurl is better than Postman!” But it’s not the interesting version in my experience. Hurl wins on the workflow axis — text files, Git diffs, CI integration, no GUI — but Postman has &lt;a href="https://learning.postman.com/docs/collections/using-newman-cli/command-line-integration-with-newman/" rel="noopener noreferrer"&gt;Newman&lt;/a&gt;, Newman runs in CI, and you can hold most of these arguments at arm’s length if you’ve invested in tooling.&lt;/p&gt;

&lt;p&gt;If your API’s behavior depends on where the request comes from (and for any business that operates in multiple countries, it usually does) &lt;strong&gt;your test suite has literally never tested the thing that readily breaks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;API engineering is hard enough, please don’t make things harder for yourself. 😅&lt;/p&gt;

&lt;p&gt;Hurl and Bright Data are your dev force multipliers here. &lt;strong&gt;With this combination, you get tests you can actually review like code, running from network origins that actually look like your customers.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Postman equivalent of this project would have fought me every step of the way. Specifically for my use case, routing &lt;em&gt;some&lt;/em&gt; requests through a proxy while keeping auth direct would have meant pre-request scripts or duplicated folders — definitely not a one-line env change.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>webdev</category>
      <category>python</category>
      <category>api</category>
    </item>
    <item>
      <title>A Practical Guide To Entity Resolution in Python (No Database, No Machine Learning)</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Tue, 26 May 2026 07:36:44 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/a-practical-guide-to-entity-resolution-in-python-no-database-no-machine-learning-3pnl</link>
      <guid>https://dev.to/prithwish_nath/a-practical-guide-to-entity-resolution-in-python-no-database-no-machine-learning-3pnl</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Learn a very simple way to normalize, dedupe, and fuzzy-match records that refer to the same real-world entity in Python, without a database or any ML pipelines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffrntl4il70o67lkilfz7.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffrntl4il70o67lkilfz7.jpeg" alt="Image" width="800" height="568"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was working on a Crunchbase dataset last Friday. I joined it against our CRM, and got 56 hits out of 96. The other 40 were sitting &lt;em&gt;right there&lt;/em&gt; in both tables — &lt;code&gt;Necker FinTech&lt;/code&gt; in the extracted data was&lt;code&gt;Necker FinTech Holdings Inc.&lt;/code&gt; in the CRM; &lt;code&gt;Investing.com&lt;/code&gt; in the data was&lt;code&gt;Fusion Media Limited&lt;/code&gt; in the CRM — but &lt;code&gt;JOIN ... ON name = name&lt;/code&gt; obviously doesn't care, it will shrug and return nothing. If I'd shipped that, some sales rep would end up cold-pitching an existing customer because of it. 😅&lt;/p&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Record_linkage" rel="noopener noreferrer"&gt;This is the core problem of entity resolution&lt;/a&gt;: the same real-world entity wearing different names in different systems. Naive text equality checks are borderline useless in the real world. I’d been meaning to do something less embarrassing than a raw &lt;code&gt;==&lt;/code&gt; for a while, so I spent the rest of the weekend on a simple pipeline — scrape company names from Crunchbase hubs via &lt;a href="https://get.brightdata.com/bd7914?utm_content=a_practical_guide_to_entity_resolution_in_python_no_database_no_machine_learning" rel="noopener noreferrer"&gt;Bright Data&lt;/a&gt;, normalize, deduplicate, and fuzzy-match against the CRM list using &lt;a href="https://github.com/rapidfuzz/RapidFuzz" rel="noopener noreferrer"&gt;RapidFuzz&lt;/a&gt; (&lt;code&gt;fuzz.WRatio&lt;/code&gt;). &lt;strong&gt;Deliberately choosing to NOT use ML, vector embeddings, or a database.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The join rate on this dataset jumped from &lt;strong&gt;~58% to 100%&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Exact (normalized string)&lt;/th&gt;
&lt;th&gt;Fuzzy (WRatio ≥ 90)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scraped hub rows → CRM&lt;/td&gt;
&lt;td&gt;58.3% (56 / 96)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;100%&lt;/strong&gt; (96 / 96)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CRM rows → scraped data&lt;/td&gt;
&lt;td&gt;34.8% (48 / 138)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;100%&lt;/strong&gt; (138 / 138)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The reason exact matching loses so badly is that &lt;em&gt;any&lt;/em&gt; real CRM list you’re handed will almost always have multiple legal-name variants per company — I had three different &lt;em&gt;Necker&lt;/em&gt; spellings pointing at one hub listing alone. Fuzzy matching earns its keep by collapsing those variants back into a single canonical cluster, and that’s most of what the rest of this post is about.&lt;/p&gt;

&lt;p&gt;I’ll walk through it; I hope it’s useful for anyone starting with fuzzy algorithms!&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Entity Resolution vs Fuzzy Matching?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Entity resolution&lt;/strong&gt; matches records that describe the same company under different surface strings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you use exact matching, you ask&lt;/strong&gt;: &lt;em&gt;are these two strings identical?&lt;/em&gt; After you lowercase and strip punctuation, &lt;code&gt;"Necker FinTech"&lt;/code&gt; and &lt;code&gt;"Necker FinTech Holdings Inc."&lt;/code&gt; are still different strings — so a SQL &lt;code&gt;JOIN&lt;/code&gt; or a Python &lt;code&gt;==&lt;/code&gt; check will incorrectly say no match.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Exact join on raw names returns no row when spellings differ  &lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;company_name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hub_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;company_name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;crm_name&lt;/span&gt;  
&lt;span class="k"&gt;FROM&lt;/span&gt;   &lt;span class="n"&gt;hub_scrape&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;  
&lt;span class="k"&gt;JOIN&lt;/span&gt;   &lt;span class="n"&gt;crm_accounts&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;company_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;company_name&lt;/span&gt;  
&lt;span class="k"&gt;WHERE&lt;/span&gt;  &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;company_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Necker FinTech'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="c1"&gt;-- This will return 0 rows   &lt;/span&gt;
&lt;span class="c1"&gt;-- Remember, CRM has "Necker FinTech Holdings Inc.", not the Crunchbase title&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This is why you use Fuzzy matching.&lt;/strong&gt; That asks a looser question: &lt;em&gt;how similar are these two strings&lt;/em&gt;? You get a score — usually 0 to 100 — instead of &lt;code&gt;true&lt;/code&gt;or &lt;code&gt;false&lt;/code&gt;. Names that are clearly the same company but spelled differently (&lt;code&gt;Necker FinTech&lt;/code&gt; vs &lt;code&gt;Necker FinTech Holdings Inc.&lt;/code&gt;) will score high, while unrelated names will score low. You pick a &lt;strong&gt;threshold&lt;/strong&gt; (we use 90): if the score is at or above it, you treat the pair as a match; otherwise you don't.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rapidfuzz&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;fuzz&lt;/span&gt;  
&lt;span class="n"&gt;THRESHOLD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;  
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fuzz&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;WRatio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;THRESHOLD&lt;/span&gt;  
&lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Necker FinTech&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Necker FinTech Holdings Inc.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;   &lt;span class="c1"&gt;# same company, legal suffix  
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PointsKash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Points Kash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;                        &lt;span class="c1"&gt;# same company, spacing  
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Investing.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fusion Media Limited&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;            &lt;span class="c1"&gt;# brand vs legal entity  
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stripe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Climate Corp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;                           &lt;span class="c1"&gt;# different companies  
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fuzz&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;WRatio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;5.1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  match=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;THRESHOLD&lt;/span&gt;&lt;span class="si"&gt;!s:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="s"&gt;  vs  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the same scoring logic we’ll use for the rest of the tutorial, so &lt;code&gt;pip install rapidfuzz&lt;/code&gt; is all you need to follow along.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/rapidfuzz/RapidFuzz?source=post_page-----89d55badaeac---------------------------------------" rel="noopener noreferrer"&gt;GitHub - rapidfuzz/RapidFuzz: Rapid fuzzy string matching in Python using various string metricsRapid fuzzy string matching in Python using various string metrics - rapidfuzz/RapidFuzzgithub.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Running the demo pairs above with &lt;code&gt;fuzz.WRatio&lt;/code&gt; and &lt;strong&gt;WRatio threshold 90&lt;/strong&gt; yields:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pair&lt;/th&gt;
&lt;th&gt;WRatio&lt;/th&gt;
&lt;th&gt;Match at ≥ 90?&lt;/th&gt;
&lt;th&gt;Drift type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;Necker FinTech&lt;/code&gt; vs &lt;code&gt;Necker FinTech Holdings Inc.&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;90.0&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Legal suffix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;PointsKash&lt;/code&gt; vs &lt;code&gt;Points Kash&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;95.2&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Token spacing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;Investing.com&lt;/code&gt; vs &lt;code&gt;Fusion Media Limited&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;30.0&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Brand vs legal entity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;Stripe&lt;/code&gt; vs &lt;code&gt;Climate Corp&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;45.0&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Unrelated companies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Think of it like a strict spell-check or a “did you mean X?” suggestion, but for whole company names. &lt;strong&gt;It is not machine learning — no model is trained on your data.&lt;/strong&gt; The library compares characters and words using fixed rules: how many edits to turn one string into another, whether one name is contained in the other, whether the same words appear in a different order. That’s why it’s fast, easy to audit, and good enough for a large class of real-world messiness — extra words, &lt;code&gt;Inc.&lt;/code&gt; vs &lt;code&gt;LLC&lt;/code&gt;, odd spacing, punctuation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;💡&lt;/em&gt; If two names share almost no letters — &lt;em&gt;Investing.com&lt;/em&gt; and &lt;em&gt;Fusion Media Limited&lt;/em&gt; for example — the score stays low and fuzzy matching correctly refuses to merge them. Those cases need a real identifier (domain, LEI, enrichment API, some sort of ML pipeline etc.), not smarter string math.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Fuzzy matching vs Lookup table vs ML
&lt;/h2&gt;

&lt;p&gt;Here’s a quick summary.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Best when&lt;/th&gt;
&lt;th&gt;Used in this pipeline?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Fuzzy matching&lt;/strong&gt; (RapidFuzz WRatio)&lt;/td&gt;
&lt;td&gt;Same entity, stylistic drift — legal suffixes, spacing, punctuation&lt;/td&gt;
&lt;td&gt;Yes — primary method&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lookup table / enrichment API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Brand vs legal name; names share almost no tokens&lt;/td&gt;
&lt;td&gt;Partial — &lt;code&gt;RESEARCHED&lt;/code&gt; dict in &lt;code&gt;build_sample_crm.py&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(GLEIF, Clearbit, domain)&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;ML record linkage&lt;/strong&gt; (Dedupe, Splink)&lt;/td&gt;
&lt;td&gt;Large-scale probabilistic linkage, many fields beyond name&lt;/td&gt;
&lt;td&gt;No — names-only, no training step&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Basically, choose fuzzy matching when two name strings likely describe the same company but spell it differently.&lt;/p&gt;

&lt;p&gt;Only choose a lookup or enrichment layer when the strings are &lt;em&gt;related&lt;/em&gt; entities (brand vs operator) rather than variants of one name.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pipeline at a Glance
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Entity resolution in this pipeline&lt;/strong&gt; is a fetch → extract → normalize → fuzzy-cluster → join loop on &lt;code&gt;canonical_id&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hub_urls.json  
      │  
      ▼  
fetch_hubs.py ──calls──► bright_data_unlocker.py     Bright Data POST → page body (markdown/HTML)  
      │                           │  
      └──calls──► parse_hubs.py ◄─┘                  regex → org slug + display name  
      │  
      ▼  
hub_snapshot.json                                    (+ cached bodies in data/hub_responses/)  

extract.py ──► raw_records.json                      flat table  

reconcile.py ──► reconciled.json                     canonical clusters + aliases  

run_fuzzy.py                                         CLI part. This just runs extract + reconcile   

── optional eval ──  
post_fuzzy_eval.py                                   All done, so run a real-world test, calc metrics, then print to stdout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Each stage is a pure transform: JSON in, JSON out.&lt;/strong&gt; Nothing stateful, nothing that requires a running service, and nothing I can't &lt;code&gt;git diff&lt;/code&gt; between runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 1: Fetching Data from Crunchbase Hubs
&lt;/h2&gt;

&lt;p&gt;I’m scraping four Crunchbase hub leaderboard pages, defined in a &lt;code&gt;hub_urls.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fintech"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.crunchbase.com/hub/fintech-companies-seed-funding"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cybersecurity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.crunchbase.com/hub/cyber-security-startups"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"saas"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;                   &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.crunchbase.com/hub/saas-companies-seed-funding"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"artificial_intelligence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.crunchbase.com/hub/artificial-intelligence-companies-early-stage-venture-funding"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace with your own, obviously.&lt;/p&gt;

&lt;p&gt;Crunchbase is a JavaScript-heavy SPA — it won’t respond to a plain  &lt;code&gt;requests.get&lt;/code&gt;. So before we fetch, I use  &lt;a href="https://get.brightdata.com/bd-web-unlocker?utm_content=a_practical_guide_to_entity_resolution_in_python_no_database_no_machine_learning" rel="noopener noreferrer"&gt;Bright Data's Web Unlocker&lt;/a&gt;, which handles JS rendering and anti-bot for me.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/bd-web-unlocker?utm_content=a_practical_guide_to_entity_resolution_in_python_no_database_no_machine_learning&amp;amp;source=post_page-----89d55badaeac---------------------------------------" rel="noopener noreferrer"&gt;Sign up here -- Automated Web Unblocker&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I set up a reusable client for this, and this is just a thin wrapper around their single  &lt;code&gt;POST&lt;/code&gt; endpoint  &lt;code&gt;https://api.brightdata.com/request&lt;/code&gt;. Make sure you’ve signed up, and have these set in your .env file first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;BRIGHTDATA_API_TOKEN&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;your_api_token  &lt;/span&gt;
&lt;span class="py"&gt;BRIGHTDATA_ZONE&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;your_web_unlocker_zone_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;bright_data_unlocker.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetch hub/listing pages as HTML or markdown.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;ContentFormat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BrightDataUnlockerClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;POST https://api.brightdata.com/request (Web Unlocker zone).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_UNLOCKER_ZONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_COUNTRY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# optional
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.brightdata.com/request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY is required.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_UNLOCKER_ZONE is required. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create a Web Unlocker API zone in Bright Data.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetch page body. markdown =&amp;gt; format=raw + data_format=markdown (Bright Data).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;last_err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_do_fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;last_err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;last_err&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;last_err&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_do_fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# data_format=markdown often returns the page body directly, not a JSON envelope
&lt;/span&gt;            &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bright Data Unlocker empty response body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bright Data unexpected response type: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;inner_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;inner_status&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;inner_status&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bright Data Unlocker status_code=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;inner_status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bright Data Unlocker empty body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bright Data Unlocker missing body: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;nested&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nested&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nested&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nested&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;pass&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bright Data Unlocker empty body string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note how we can request&lt;code&gt;data_format=markdown&lt;/code&gt;. Using this param, Bright Data returns a sanitized markdown rendering of the page, which is  &lt;em&gt;much&lt;/em&gt;  easier to parse with regex than raw HTML.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 If markdown still yields zero orgs for a hub, fetch_hubs.py --fallback-html can fetch or use cached HTML and run the HTML parser instead.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With that in place, here’s our actual fetch script —  &lt;code&gt;fetch_hubs.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;fetch_hubs.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetch Crunchbase hub pages via Bright Data Web Unlocker; write hub_snapshot.json.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timezone&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bright_data_unlocker&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BrightDataUnlockerClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;parse_hubs&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;parse_organizations&lt;/span&gt;

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;
&lt;span class="n"&gt;_DEFAULT_RESPONSES_DIR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hub_responses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_hub_urls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hub_urls.json must be a JSON array&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_response_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;safe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isalnum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;safe&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ext&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_response_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;responses_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;_response_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_cached_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_response_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_file&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stat&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;st_size&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_response_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_response_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_manifest_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;responses_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manifest.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_load_manifest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_manifest_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_file&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hubs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_upsert_manifest_entry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fetched_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hub_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fetched_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;fetched_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_load_manifest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;hubs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hubs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;hubs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hubs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hubs&lt;/span&gt;
    &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;_manifest_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_parse_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;parse_organizations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fetch Crunchbase hub pages (Web Unlocker) and extract organization URLs.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--hubs-json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hub_urls.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--out&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hub_snapshot.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--max-orgs-per-hub&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--delay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--responses-dir&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_DEFAULT_RESPONSES_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Directory for cached raw hub page bodies (default: data/hub_responses).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--refetch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;store_true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Call Bright Data even if a cached response file exists.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--parse-only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;store_true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Parse cached responses only; never call Bright Data.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--fallback-html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;store_true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If markdown parse finds 0 orgs, try cached or fetched HTML.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;responses_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;

    &lt;span class="n"&gt;hubs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_hub_urls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hubs_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;hubs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SystemExit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No hubs in hub_urls.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;BrightDataUnlockerClient&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_only&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BrightDataUnlockerClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;

    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fetched_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bright_data_web_unlocker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;responses_dir&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hubs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;n_hubs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hubs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hub&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hubs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hub&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hub&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: starting...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hub_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;_response_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;parse_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;refetch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_cached_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_only&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;FileNotFoundError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no cached response at &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;_response_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(run without --parse-only to fetch)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: fetching (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
                &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: fetch done &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chars)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;saved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;save_response_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;_upsert_manifest_entry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;saved&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;fetched_at&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: saved &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;saved&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: using cache &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;_response_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: parsing...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_parse_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parse_format&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_orgs_per_hub&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fallback_html&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;parse_format&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;html_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_cached_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;html_body&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_only&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: markdown had 0 orgs, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fetching HTML...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
                    &lt;span class="n"&gt;html_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: HTML fetch done &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html_body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chars)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;saved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;save_response_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;html_body&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: saved &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;saved&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;html_body&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;FileNotFoundError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no cached HTML at &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;_response_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: markdown had 0 orgs, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;using cached HTML...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: parsing HTML...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_parse_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html_body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_orgs_per_hub&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;parse_format&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_response_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parse_format&lt;/span&gt;
            &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: done - &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; organizations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] hub [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: failed - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hubs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_only&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hubs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;All hubs processed. Wrote &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; organizations across &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_hubs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; hubs).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note how I’m caching the raw bodies under &lt;code&gt;data/hub_responses/&lt;/code&gt; so re-runs with &lt;code&gt;--parse-only&lt;/code&gt; don't burn any API credits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 2: Parsing from Markdown into Organization Rows
&lt;/h2&gt;

&lt;p&gt;Our &lt;code&gt;parse_hubs.py&lt;/code&gt; pulls organization slugs and display names out of the cached page bodies from the previous step. It runs three regex patterns in priority order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# parse_hubs.py  
&lt;/span&gt;
&lt;span class="c1"&gt;# Priority 1: Bright Data relative markdown links
# Matches: ](/organization/slug "Display Name")
&lt;/span&gt;&lt;span class="n"&gt;_ORG_REL_LINK&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\]\(/organization/([a-z0-9_-]+)(?:\s+\"([^\"]*)\")?\s*\)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Priority 2: Standard absolute markdown links
# Matches: [Company Name](https://www.crunchbase.com/organization/slug)
&lt;/span&gt;&lt;span class="n"&gt;_ORG_MD_LINK&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\[([^\]]+)\]\(\s*&amp;lt;?https?://[^&amp;gt;\s)]*crunchbase.com/organization/([a-z0-9_-]+)/?&amp;gt;?\s*\)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Fallback: bare /organization/slug anywhere in text
&lt;/span&gt;&lt;span class="n"&gt;_ORG_IN_TEXT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?:https?://[^/\s]*crunchbase.com)?/organization/([a-z0-9_-]+)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each hub gets parsed into rows like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://www.crunchbase.com/organization/lovable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"slug"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lovable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Lovable"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s the full code for &lt;code&gt;parse_hubs.py&lt;/code&gt;. Note that I also keep a blocklist of well-known VCs and accelerators (&lt;code&gt;y-combinator&lt;/code&gt;, &lt;code&gt;techstars&lt;/code&gt;, &lt;code&gt;andreessen-horowitz&lt;/code&gt;, etc.) that show up on hub pages but are the &lt;em&gt;investors&lt;/em&gt;, not the companies being listed. Without this, you get YC ranked #1 on every hub it's ever touched, which is obviously not what we want.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;parse_hubs.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Parse Crunchbase hub pages (markdown or HTML) for /organization/ links.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Set&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urljoin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urlparse&lt;/span&gt;

&lt;span class="n"&gt;ContentFormat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;_ORG_IN_TEXT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?:https?://[^/\s]*crunchbase\.com)?/organization/([a-z0-9_-]+)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# [Company Name](https://www.crunchbase.com/organization/slug)
&lt;/span&gt;&lt;span class="n"&gt;_ORG_MD_LINK&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\[([^\]]+)\]\(\s*&amp;lt;?https?://[^&amp;gt;\s)]*crunchbase\.com/organization/([a-z0-9_-]+)/?&amp;gt;?\s*\)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Bright Data markdown: multi-line link ending with ](/organization/slug "Display Name")
&lt;/span&gt;&lt;span class="n"&gt;_ORG_REL_LINK&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\]\(/organization/([a-z0-9_-]+)(?:\s+\"([^\"]*)\")?\s*\)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;_ORG_BLOCKLIST&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y-combinator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;techstars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;national-science-foundation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;masschallenge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;easme&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;andreessen-horowitz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sequoia-capital&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;slug_to_display_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_append_org&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;seen_slugs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="n"&gt;slug&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_ORG_BLOCKLIST&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen_slugs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="n"&gt;seen_slugs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;netloc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;slug_to_display_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;urljoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/organization/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;slug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_organizations_from_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract orgs from markdown links; fall back to bare organization URLs.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;seen_slugs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_ORG_REL_LINK&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;finditer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;slug&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;_append_org&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seen_slugs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_ORG_MD_LINK&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;finditer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;_append_org&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seen_slugs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_ORG_IN_TEXT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;finditer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;_append_org&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;seen_slugs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_organizations_from_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract unique organization rows from hub page HTML.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;seen_slugs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_ORG_IN_TEXT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;finditer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;_append_org&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;seen_slugs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_organizations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;content_format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContentFormat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content_format&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;parse_organizations_from_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;parse_organizations_from_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_orgs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;First-run gotcha I hit was a classic.&lt;/strong&gt; My original parser expected absolute URLs (&lt;code&gt;https://www.crunchbase.com/organization/...&lt;/code&gt;), but Bright Data's markdown renderer produces &lt;em&gt;relative&lt;/em&gt; links (&lt;code&gt;/organization/slug "Display Name"&lt;/code&gt;) 🙃. So zero companies extracted on the first run — &lt;em&gt;simply because the regex didn't match&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So I just added&lt;code&gt;_ORG_REL_LINK&lt;/code&gt; to the parser and re-ran Stage 1 with &lt;code&gt;--parse-only&lt;/code&gt;, fixing it at no additional API cost. &lt;strong&gt;This is why we cached our raw response bodies.&lt;/strong&gt; Your parser will probably need trial-and-erroring more than once, and you don’t want to actually re-fetch the data for that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output of this stage:&lt;/strong&gt; A &lt;code&gt;hub_snapshot.json&lt;/code&gt; — 96 organizations across 4 hubs (Fintech produced 26, Cybersecurity: 24, SaaS: 22, AI: 24). Note that these are &lt;em&gt;hub leaderboard&lt;/em&gt; entries, not full Crunchbase exports.&lt;/p&gt;

&lt;p&gt;Because the full Crunchbase lists run to &lt;em&gt;thousands&lt;/em&gt;; I'm taking the curated top slice on purpose, because the cleaner my source is, the more clearly the fuzzy lift shows up against it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 3: Turn Scraped Records into a Flat Table
&lt;/h2&gt;

&lt;p&gt;Before clustering, I flatten the nested snapshot into one uniform record per company appearance. &lt;code&gt;extract.py&lt;/code&gt; handles this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;From hub_snapshot.json to raw_records.json with company_name per organization.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;    

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;    

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timezone&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;    


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;records_from_hub_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;    
    &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;    
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;hi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hubs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;    
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
            &lt;span class="k"&gt;continue&lt;/span&gt;    
        &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    
        &lt;span class="n"&gt;hub_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hub_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;    
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;    
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
                &lt;span class="k"&gt;continue&lt;/span&gt;    
            &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/organization/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;    
                &lt;span class="k"&gt;continue&lt;/span&gt;    
            &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    
            &lt;span class="n"&gt;slug&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;slug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    
            &lt;span class="n"&gt;company_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;company_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                &lt;span class="k"&gt;continue&lt;/span&gt;    
            &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;    
                &lt;span class="p"&gt;{&lt;/span&gt;    
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hub:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hi&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ri&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crunchbase_hub&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;company_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;company_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;www.crunchbase.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hub_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hub_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;position&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ri&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                &lt;span class="p"&gt;}&lt;/span&gt;    
            &lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;    


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_raw_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;    
    &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;    
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hubs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: expected hub snapshot with &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hubs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; array&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;records_from_hub_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;    
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;    
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snapshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;    
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;record_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;    
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="p"&gt;}&lt;/span&gt;    


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_raw_records&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;    
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_raw_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;    
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;id&lt;/code&gt; field (&lt;code&gt;hub:0:3&lt;/code&gt;, &lt;code&gt;hub:2:11&lt;/code&gt;, etc.) is our stable key that links each raw record to its canonical cluster in Stage 4. Deterministic, derivable from position, and most importantly, easy to debug.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; &lt;code&gt;raw_records.json&lt;/code&gt; — 96 rows, all &lt;code&gt;source: "crunchbase_hub"&lt;/code&gt;fields, tagged by category.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 4: Reconciliation — Normalize + Dedupe + Fuzzy Cluster
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Entity resolution reconciliation&lt;/strong&gt; (Stage 4) collapses duplicate company names into canonical clusters. In this dataset, 96 scraped rows become &lt;strong&gt;88 canonical companies&lt;/strong&gt; after normalization and fuzzy clustering. Four names show up on more than one hub — &lt;strong&gt;Callaghan Innovation&lt;/strong&gt; and &lt;strong&gt;EISMEA&lt;/strong&gt; on all four leaderboards, &lt;strong&gt;PayTic&lt;/strong&gt; and &lt;strong&gt;SixThirty&lt;/strong&gt; on two — which gives duplicate rows before clustering.&lt;/p&gt;

&lt;p&gt;After exact normalization there are &lt;strong&gt;88&lt;/strong&gt; distinct normalized names, which happens to be the same count as final clusters at WRatio threshold 90 — meaning &lt;em&gt;no additional fuzzy merges were needed beyond collapsing the cross-hub duplicates&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I run reconciliation in two passes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;See full code here for &lt;strong&gt;reconcile.py:&lt;/strong&gt; &lt;a href="https://gist.github.com/sixthextinction/5c711e48353f4f7765e13cc4bb1b25de" rel="noopener noreferrer"&gt;https://gist.github.com/sixthextinction/5c711e48353f4f7765e13cc4bb1b25de&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Pass 1: Exact normalization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# reconcile.py  
&lt;/span&gt;&lt;span class="n"&gt;_LEGAL&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b(inc.?|llc.?|ltd.?|plc.?|corp.?|corporation|co.?|company|limited)b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;_NON_ALNUM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[^ws]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UNICODE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize_company_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_NON_ALNUM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# strip punctuation  
&lt;/span&gt;    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_LEGAL&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;# drop legal suffixes  
&lt;/span&gt;    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After normalization, I group records by their normalized string. &lt;code&gt;"Lovable"&lt;/code&gt;, &lt;code&gt;"lovable"&lt;/code&gt;, and &lt;code&gt;"Lovable."&lt;/code&gt; all collapse into the same group. This removes trivial duplicates &lt;em&gt;before&lt;/em&gt; the more expensive fuzzy-matching pass. &lt;strong&gt;TL;DR: Do the cheap pass first, expensive pass second&lt;/strong&gt; — same reason you’d put a &lt;code&gt;WHERE&lt;/code&gt; clause &lt;em&gt;before&lt;/em&gt; a &lt;code&gt;JOIN&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For this dataset that’s &lt;strong&gt;96 hash inserts&lt;/strong&gt; — one &lt;code&gt;normalize_company_name()&lt;/code&gt; + one dict lookup per row — roughly &lt;strong&gt;~O(n)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The important optimization is that normalization shrinks the search space &lt;em&gt;before&lt;/em&gt; the quadratic fuzzy pass runs. Without Pass 1, naïve all-pairs fuzzy matching over &lt;code&gt;n = 10,000&lt;/code&gt; unique names would require:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;n(n−1)/2 ≈ 50 million comparisons&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It takes my laptop ~&lt;strong&gt;1.3 µs&lt;/strong&gt; per RapidFuzz &lt;code&gt;WRatio&lt;/code&gt; call on ~30-character names, so that pushes our runtime toward ~60 seconds instead of milliseconds. Not ideal — which is exactly why Pass 1 exists, to reduce &lt;code&gt;n&lt;/code&gt; before the O(n²) step becomes expensive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pass 2: Fuzzy merge
&lt;/h3&gt;

&lt;p&gt;I then compare each exact group against existing clusters using &lt;strong&gt;WRatio&lt;/strong&gt; from RapidFuzz.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# reconcile.py  
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rapidfuzz&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;fuzz&lt;/span&gt;  
&lt;span class="n"&gt;_FUZZY_SCORER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fuzz&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WRatio&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_fuzzy_merge_groups&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;    &lt;span class="n"&gt;groups&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]],&lt;/span&gt;  
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# default: 90.0) -&amp;gt; List[Cluster]:  
&lt;/span&gt;    &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Cluster&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;groups&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_source_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
        &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
    &lt;span class="p"&gt;)):&lt;/span&gt;  
        &lt;span class="n"&gt;rep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pick_canonical_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;placed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;  
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;_FUZZY_SCORER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;members&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pick_canonical_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;members&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_id&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;make_canonical_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="n"&gt;placed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  
                &lt;span class="k"&gt;break&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;placed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Cluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;canonical_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;make_canonical_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                &lt;span class="n"&gt;canonical_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;members&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="p"&gt;))&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;clusters&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, we have to compare each group’s representative against existing cluster canonicals. So the worst case with &lt;code&gt;g = 88&lt;/code&gt; exact groups would be&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;0 + 1 + 2 + … + 87 = 3,828 comparisons&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s roughly ~O(g²).&lt;/p&gt;

&lt;h3&gt;
  
  
  What is WRatio and Why Use It?
&lt;/h3&gt;

&lt;p&gt;RapidFuzz ships several scorers — see the &lt;a href="https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html" rel="noopener noreferrer"&gt;rapidfuzz.fuzz docs&lt;/a&gt; for the full list. We use &lt;a href="https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#wratio" rel="noopener noreferrer"&gt;fuzz.WRatio&lt;/a&gt; (weighted ratio; same algorithm family as &lt;a href="https://github.com/seatgeek/fuzzywuzzy" rel="noopener noreferrer"&gt;FuzzyWuzzy’s WRatio&lt;/a&gt;) because company names drift in different ways and no single metric covers all of them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#wratio" rel="noopener noreferrer"&gt;WRatio&lt;/a&gt; is a &lt;strong&gt;meta-scorer&lt;/strong&gt;: for each pair of strings it runs several ratio algorithms internally (with length-based weighting) and returns the best score. It combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#ratio" rel="noopener noreferrer"&gt;ratio&lt;/a&gt; — character-level edit distance. Good for typos; bad when one name is much longer (&lt;code&gt;Necker FinTech&lt;/code&gt; vs &lt;code&gt;Necker FinTech Holdings Inc.&lt;/code&gt; looks like a poor match).&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#partial-ratio" rel="noopener noreferrer"&gt;partial_ratio&lt;/a&gt; — treats the shorter string as a substring of the longer one. Catches display names contained in legal names.&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#token-sort-ratio" rel="noopener noreferrer"&gt;token_sort_ratio&lt;/a&gt; — splits on words, sorts them, then compares. Handles reordered tokens.&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#token-set-ratio" rel="noopener noreferrer"&gt;token_set_ratio&lt;/a&gt; — compares word &lt;em&gt;sets&lt;/em&gt;, ignoring duplicates and extras like &lt;code&gt;Inc.&lt;/code&gt; or &lt;code&gt;Group&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You rarely know in advance &lt;em&gt;which&lt;/em&gt; kind of drift a CRM row will have — suffix appended, spacing changed, words reordered. &lt;strong&gt;WRatio picks the strategy that scores highest for that specific pair&lt;/strong&gt;, which is exactly what you want for entity resolution on names alone.&lt;/p&gt;

&lt;p&gt;We default to &lt;strong&gt;threshold 90&lt;/strong&gt;: strict enough that unrelated pairs (&lt;code&gt;Stripe&lt;/code&gt; vs &lt;code&gt;Climate Corp&lt;/code&gt;) stay out, loose enough that real variants (&lt;code&gt;PointsKash&lt;/code&gt; vs &lt;code&gt;Points Kash&lt;/code&gt;) merge. Tune it on your data.&lt;/p&gt;

&lt;p&gt;On this dataset specifically, WRatio handles the drift patterns we actually see in company names (or historically have, anyway):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hub / scraped name&lt;/th&gt;
&lt;th&gt;CRM variant (in &lt;code&gt;sample_crm.json&lt;/code&gt;)&lt;/th&gt;
&lt;th&gt;Drift type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Necker FinTech&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Necker FinTech Holdings Inc.&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Legal suffix + spacing (&lt;code&gt;Fin Tech&lt;/code&gt; vs &lt;code&gt;FinTech&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PANTA&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PANTA Group&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Type descriptor appended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Physical Intelligence&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Physical Intelligence (Pi), Inc.&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Parenthetical + legal suffix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PointsKash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Points Kash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Token spacing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qBotica&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;q Botica&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Token spacing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Pure &lt;code&gt;ratio&lt;/code&gt; (edit distance) would heavily penalize &lt;code&gt;Necker FinTech&lt;/code&gt; vs &lt;code&gt;Necker FinTech Holdings Inc.&lt;/code&gt; because three extra words add significant distance. So&lt;code&gt;partial_ratio&lt;/code&gt; handles containment and&lt;code&gt;token_set_ratio&lt;/code&gt; handles reordering. &lt;strong&gt;WRatio picks the strategy that produces the best score for each specific pair&lt;/strong&gt; — which is exactly the behavior you want when you don't know in advance &lt;em&gt;how&lt;/em&gt; a name is going to drift.&lt;/p&gt;

&lt;p&gt;Display names with &lt;strong&gt;no shared tokens&lt;/strong&gt; to the legal entity — e.g. hub title &lt;code&gt;Investing.com&lt;/code&gt; vs operator &lt;code&gt;Fusion Media Limited&lt;/code&gt;, or &lt;code&gt;Lyrie.ai&lt;/code&gt; vs &lt;code&gt;OTT Cybersecurity Inc.&lt;/code&gt; — stay below threshold. WRatio correctly refuses to merge them. That’s a good thing — those belong in a lookup table or enrichment API, not in a string-similarity pass (see Caveats).&lt;/p&gt;

&lt;p&gt;One last thing before we move on to the demo — every time a cluster gains new members, I re-evaluate its canonical name. The source ranking (&lt;code&gt;crunchbase_hub&lt;/code&gt; = 0, anything else = 99) ensures that short, clean display names win over longer legal variants:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pick_canonical_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;members&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sort_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_source_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;members&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sort_key&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A Crunchbase display name like &lt;code&gt;"Lovable"&lt;/code&gt; will always beat &lt;code&gt;"Lovable Technologies Inc."&lt;/code&gt; as the canonical — it's from a trusted source &lt;em&gt;and&lt;/em&gt; it's shorter. The legal variant ends up as an alias, which is exactly the right relationship.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output of this stage:&lt;/strong&gt; &lt;code&gt;reconciled.json&lt;/code&gt; — 88 canonical clusters, alias mappings with WRatio scores, and CRM join metrics.&lt;/p&gt;

&lt;p&gt;That’s it, we’re all done with the fuzzy pipeline. Let’s see if that improved things.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 5: Performing a Real CRM Join
&lt;/h2&gt;

&lt;p&gt;Our &lt;code&gt;sample_crm.json&lt;/code&gt; simulates the data you’d get from a real CRM — I simply researched legal names and known alternate spellings online for the companies I had, and put it in a JSON file. &lt;strong&gt;This gave me 138 rows representing the same 88 canonical companies.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some companies had one exact-match entry — these are easy for us to handle. Others had three or four variants that I’d name like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"crm:necker_fintech_0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"company_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Necker Fin Tech"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"crm:necker_fintech_1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"company_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Necker FinTech Group"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"crm:necker_fintech_2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"company_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Necker FinTech Holdings Inc."&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our join logic in the&lt;code&gt;post_fuzzy_eval.py&lt;/code&gt; demo runs exact normalization first, then falls back to fuzzy — note how this is the same “cheap pass first” pattern as the cluster builder:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;post_fuzzy_eval.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Optional CRM join evaluation — exact vs fuzzy match rates (not part of core reconcile).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rapidfuzz&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;fuzz&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;reconcile&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Cluster&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DEFAULT_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;_exact_groups&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;normalize_company_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;_FUZZY_SCORER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fuzz&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WRatio&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_crm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;companies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;companies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: expected list or {{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;companies&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: [...]}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crm:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_record_to_cluster_map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Cluster&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;members&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_id&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;crm_to_canonical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;crm_rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Cluster&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;crm_rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;norm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize_company_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;matched&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="nf"&gt;normalize_company_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;members&lt;/span&gt;
            &lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;matched&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_id&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;matched&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;best_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
            &lt;span class="n"&gt;best_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_FUZZY_SCORER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;best_score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;best_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
                    &lt;span class="n"&gt;best_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_id&lt;/span&gt;
            &lt;span class="n"&gt;matched&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;best_id&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;best_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;matched&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;join_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;crm_rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Cluster&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;record_to_cid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_record_to_cluster_map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;crm_to_cid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;crm_to_canonical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crm_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;crm_norms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;normalize_company_name&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;crm_rows&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;normalize_company_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;crm_mapped_cids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;crm_to_cid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;scraped_exact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;scraped_fuzzy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;norm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize_company_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;crm_norms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;scraped_exact&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;cid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;record_to_cid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cid&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;cid&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;crm_mapped_cids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;scraped_fuzzy&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="n"&gt;crm_exact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;crm_fuzzy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;scraped_norms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;normalize_company_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;scraped_cids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record_to_cid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;crm_rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;norm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize_company_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scraped_norms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;crm_exact&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;cid_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;cid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;crm_to_cid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cid_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cid&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;cid&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scraped_cids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;crm_fuzzy&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="n"&gt;n_scraped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;n_crm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crm_rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraped_rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crm_rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crm_rows&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;canonical_clusters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact_normalized_unique&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_exact_groups&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraped_exact_join_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;scraped_exact&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_scraped&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraped_fuzzy_join_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;scraped_fuzzy&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_scraped&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crm_exact_join_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;crm_exact&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_crm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crm_fuzzy_join_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;crm_fuzzy&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_crm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;eval_crm_join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Cluster&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;crm_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DEFAULT_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Load CRM file and compute join metrics against existing clusters.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;crm_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_crm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crm_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;join_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;crm_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Here’s how we measure this JOIN operation&lt;/strong&gt; (&lt;code&gt;join_metrics&lt;/code&gt; in &lt;code&gt;post_fuzzy_eval.py&lt;/code&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Scraped data to CRM, exact:&lt;/strong&gt; the scraped row’s &lt;em&gt;normalized&lt;/em&gt; &lt;code&gt;company_name&lt;/code&gt; appears in the set of normalized CRM names.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Scraped data to CRM, fuzzy:&lt;/strong&gt; the scraped row’s canonical cluster id is also reached by at least one CRM row (exact norm match on a cluster member, or WRatio ≥ threshold on the cluster’s canonical name).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;CRM to scraped data:&lt;/strong&gt; the symmetric checks from the CRM row’s perspective.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Post-Fuzzy Matching Results
&lt;/h2&gt;

&lt;p&gt;So how did we do?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Exact match&lt;/th&gt;
&lt;th&gt;Fuzzy (WRatio ≥ 90)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Of &lt;strong&gt;96 scraped rows&lt;/strong&gt;, how many link to a CRM row?&lt;/td&gt;
&lt;td&gt;58.3% (56 rows)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;100%&lt;/strong&gt; (96 rows)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Of &lt;strong&gt;138 CRM rows&lt;/strong&gt;, how many link back to scraped data?&lt;/td&gt;
&lt;td&gt;34.8% (48 rows)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;100%&lt;/strong&gt; (138 rows)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 58.3% exact baseline isn’t actually bad — over half of raw hub titles normalize to a CRM string exactly. The other 41.7% however, absolutely need fuzzy matching via WRatio because the CRM holds legal or alternate spellings (&lt;code&gt;Necker FinTech Holdings Inc.&lt;/code&gt; vs hub &lt;code&gt;Necker FinTech&lt;/code&gt;, etc.) that no amount of lowercasing or other normalization will save you from.&lt;/p&gt;

&lt;p&gt;The fuzzy pass closes the gap on this dataset at WRatio threshold 90. &lt;strong&gt;WRatio is strict enough to avoid merging unrelated names while still picking up suffix and token drift&lt;/strong&gt; — which is fantastic — just what we want!&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It
&lt;/h2&gt;

&lt;p&gt;Commands below assume Python 3.10+ and a venv. All of this runs locally; the only network calls are to Bright Data during the initial fetch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install deps  &lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;rapidfuzz requests python-dotenv  

&lt;span class="c"&gt;# Fetch all 4 hubs (costs API credits)  &lt;/span&gt;
python fetch_hubs.py  

&lt;span class="c"&gt;# Already have cached responses? Re-parse for free  &lt;/span&gt;
python fetch_hubs.py &lt;span class="nt"&gt;--parse-only&lt;/span&gt;  

&lt;span class="c"&gt;# Extract + reconcile + print CRM metrics (default: both stages)  &lt;/span&gt;
python run_fuzzy.py  

&lt;span class="c"&gt;# Regenerate sample_crm.json from raw_records (optional)  &lt;/span&gt;
python build_sample_crm.py  

&lt;span class="c"&gt;# Tune the threshold (try 85 for more aggressive merging)  &lt;/span&gt;
python run_fuzzy.py &lt;span class="nt"&gt;--threshold&lt;/span&gt; 85  

&lt;span class="c"&gt;# Run individual stages  &lt;/span&gt;
python run_fuzzy.py &lt;span class="nt"&gt;--extract&lt;/span&gt;  
python run_fuzzy.py &lt;span class="nt"&gt;--reconcile&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sample CLI output after a full run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wrote data/raw_records.json  
  records: 96  
  category artificial_intelligence: 24  
  category cybersecurity: 24  
  category fintech: 26  
  category saas: 22  

wrote data/reconciled.json  

&lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;join &lt;/span&gt;metrics &lt;span class="o"&gt;(&lt;/span&gt;CRM&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt;  
  scraped rows: 96 | exact-normalized unique: 88 | canonical clusters: 88  
  scraped -&amp;gt; CRM  exact: 58.3% | fuzzy: 100.0%  
  CRM -&amp;gt; scraped   exact: 34.8% | fuzzy: 100.0%  

&lt;span class="nt"&gt;--&lt;/span&gt; top 10 canonicals &lt;span class="o"&gt;(&lt;/span&gt;by &lt;span class="nb"&gt;alias &lt;/span&gt;count&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt;  
  Callaghan Innovation  &lt;span class="o"&gt;(&lt;/span&gt;4 aliases, sources: crunchbase_hub&lt;span class="o"&gt;)&lt;/span&gt;  
  EISMEA  &lt;span class="o"&gt;(&lt;/span&gt;4 aliases, sources: crunchbase_hub&lt;span class="o"&gt;)&lt;/span&gt;  
  PayTic  &lt;span class="o"&gt;(&lt;/span&gt;2 aliases, sources: crunchbase_hub&lt;span class="o"&gt;)&lt;/span&gt;  
  SixThirty  &lt;span class="o"&gt;(&lt;/span&gt;2 aliases, sources: crunchbase_hub&lt;span class="o"&gt;)&lt;/span&gt;  
  ...more
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  A Quick Note: The Review Queue
&lt;/h2&gt;

&lt;p&gt;I’ve also added a diagnostic queue into the pipeline for low-confidence alias assignments — records whose WRatio against their cluster’s canonical falls &lt;em&gt;below&lt;/em&gt; the threshold. This will show us merges that look suspicious and deserve a human eye:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# reconcile.py  
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;review_queue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;  
    &lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Sequence&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Cluster&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;  
    &lt;span class="n"&gt;rid_to_cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clusters&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;members&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="n"&gt;lows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;c&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rid_to_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
        &lt;span class="n"&gt;name&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;  
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_FUZZY_SCORER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;lows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
    &lt;span class="n"&gt;lows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;lows&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production this would feed a human-review UI or write to a &lt;code&gt;needs_review&lt;/code&gt; table. Here it just prints to stdout — but my point stands: &lt;strong&gt;fuzzy matching isn't a black box. You can always surface the borderline decisions and let a human confirm them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That’s everything, thanks for reading!&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Do you need ML or vector embeddings for company name matching?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; No, not for stylistic drift (legal suffixes, spacing, punctuation). Our pipeline uses RapidFuzz &lt;a href="https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#wratio" rel="noopener noreferrer"&gt;fuzz.WRatio&lt;/a&gt; —which is a rule-based string similarity, not a trained model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What similarity threshold should you use with WRatio?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; Start at &lt;strong&gt;WRatio threshold 90&lt;/strong&gt;. At 90, unrelated pairs like &lt;code&gt;Stripe&lt;/code&gt; vs &lt;code&gt;Climate Corp&lt;/code&gt; score 45.0 and stay out, while suffix/spacing variants like &lt;code&gt;Necker FinTech&lt;/code&gt; vs &lt;code&gt;Necker FinTech Holdings Inc.&lt;/code&gt; score 90.0+ and merge. See the &lt;a href="https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#wratio" rel="noopener noreferrer"&gt;score_cutoff&lt;/a&gt; parameter in the docs if you want early-exit optimization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: When does fuzzy matching fail for company names?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; When names share almost no tokens — e.g. brand &lt;code&gt;Investing.com&lt;/code&gt; vs legal entity &lt;code&gt;Fusion Media Limited&lt;/code&gt; (WRatio 30.0). Use a lookup table, domain, LEI, or enrichment API instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Why not join on&lt;/strong&gt; &lt;code&gt;company_name&lt;/code&gt; &lt;strong&gt;in SQL?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; Because raw name joins will often miss legal variants. Resolve each row to a &lt;code&gt;canonical_id&lt;/code&gt; in Python, load clusters into Postgres, and only then can you safely do a &lt;code&gt;JOIN ... USING (canonical_id)&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Caveats
&lt;/h2&gt;

&lt;p&gt;I should clear some things up about this tutorial.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The sample is leaderboard-only.&lt;/strong&gt; These are the &lt;strong&gt;top-ranked companies on each hub&lt;/strong&gt;, not a random draw from Crunchbase’s full 6,000+ company lists. Leaderboards are always curated. Noisier source data would push the exact-match baseline &lt;em&gt;down&lt;/em&gt;, making the fuzzy lift even bigger.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The CRM examples are handmade.&lt;/strong&gt; I researched exact hub titles plus known legal variants like &lt;code&gt;ARYZE ApS&lt;/code&gt;, &lt;code&gt;Count Finance LTD&lt;/code&gt;, &lt;code&gt;PANTA Group&lt;/code&gt;. A real CRM would actually be dirtier: misspellings, stale names, entries from multiple import sources with inconsistent formatting. In practice the fuzzy pass may not hit 100%, but it'll still get you much closer than exact matching does.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Some display names need a lookup table, not fuzzy strings.&lt;/strong&gt; Pairs like &lt;code&gt;Investing.com&lt;/code&gt; / &lt;code&gt;Fusion Media Limited&lt;/code&gt; or &lt;code&gt;Lyrie.ai&lt;/code&gt; / &lt;code&gt;OTT Cybersecurity Inc.&lt;/code&gt; share almost no tokens, so WRatio stays low and that's &lt;em&gt;correct behavior&lt;/em&gt;. For irreconcilable aliases like that you still want &lt;a href="https://www.gleif.org/en" rel="noopener noreferrer"&gt;GLEIF&lt;/a&gt;, &lt;a href="https://clearbit.com/" rel="noopener noreferrer"&gt;Clearbit&lt;/a&gt;, or simply a maintained &lt;code&gt;slug → legal_name&lt;/code&gt; map. Fuzzy matching handles stylistic drift on the &lt;em&gt;same&lt;/em&gt; name; it can’t handle unrelated brand vs legal entity pairs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Use Cases for Entity Resolution in Python
&lt;/h2&gt;

&lt;p&gt;The normalize → exact-group → fuzzy-cluster → CRM join pattern I’ve described here applies directly to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;CRM deduplication:&lt;/strong&gt; merge &lt;code&gt;Acme Corp&lt;/code&gt;, &lt;code&gt;Acme Corporation&lt;/code&gt;, and &lt;code&gt;ACME&lt;/code&gt; before they become three separate accounts in your sales pipeline.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Lead enrichment:&lt;/strong&gt; match inbound form submissions against existing accounts without requiring exact name entry from the user.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;M&amp;amp;A / investor research:&lt;/strong&gt; reconcile company lists from multiple data vendors (Crunchbase, PitchBook, LinkedIn) that use different display names for the same entity.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Changelog and release tracking:&lt;/strong&gt; match product names across sources (GitHub repo name, npm package name, marketing site name) that follow different conventions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;That WRatio threshold is something you should play around with.&lt;/strong&gt; At WRatio threshold 90 (the default in this pipeline), clearly unrelated pairs stay out (&lt;code&gt;Stripe&lt;/code&gt; vs &lt;code&gt;Climate Corp&lt;/code&gt; scores 45.0) while suffix and spacing drift gets in. Drop to 80 and you'll catch more variants but start seeing false positives. This will differ based on your dataset, obviously, and the review queue is your safety net either way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next step in production:&lt;/strong&gt; load &lt;code&gt;reconciled.json&lt;/code&gt; into Postgres, resolve each CRM row to a &lt;code&gt;canonical_id&lt;/code&gt; (same logic as &lt;code&gt;_crm_to_canonical&lt;/code&gt; in Python), then join on that key instead of &lt;code&gt;company_name&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Tables loaded from pipeline output (reconciled.json + raw_records + sample_crm)  &lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;canonicals&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
  &lt;span class="n"&gt;canonical_id&lt;/span&gt;   &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="n"&gt;canonical_name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;  
&lt;span class="p"&gt;);&lt;/span&gt;  

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;entity_aliases&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
  &lt;span class="n"&gt;canonical_id&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;canonicals&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
  &lt;span class="n"&gt;alias_name&lt;/span&gt;   &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="k"&gt;source&lt;/span&gt;       &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="n"&gt;match_score&lt;/span&gt;  &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alias_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;);&lt;/span&gt;  

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;hub_scrape&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
  &lt;span class="n"&gt;id&lt;/span&gt;            &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="n"&gt;company_name&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="n"&gt;canonical_id&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;canonicals&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
  &lt;span class="n"&gt;category&lt;/span&gt;      &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="n"&gt;url&lt;/span&gt;           &lt;span class="nb"&gt;TEXT&lt;/span&gt;  
&lt;span class="p"&gt;);&lt;/span&gt;  

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;crm_accounts&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
  &lt;span class="n"&gt;id&lt;/span&gt;            &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="n"&gt;company_name&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="n"&gt;canonical_id&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;canonicals&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;-- from Python CRM mapping  &lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;  

&lt;span class="c1"&gt;-- Broken: join on raw company_name  &lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;matched_rows&lt;/span&gt;  
&lt;span class="k"&gt;FROM&lt;/span&gt;   &lt;span class="n"&gt;hub_scrape&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;  
&lt;span class="k"&gt;JOIN&lt;/span&gt;   &lt;span class="n"&gt;crm_accounts&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;company_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;company_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="c1"&gt;-- 56 / 96 (~58%) on this dataset  &lt;/span&gt;

&lt;span class="c1"&gt;-- Fixed: join on canonical_id (assigned during ETL from reconciled.json)  &lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;company_name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hub_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;company_name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;crm_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
       &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_id&lt;/span&gt;  
&lt;span class="k"&gt;FROM&lt;/span&gt;   &lt;span class="n"&gt;hub_scrape&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;  
&lt;span class="k"&gt;JOIN&lt;/span&gt;   &lt;span class="n"&gt;crm_accounts&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="k"&gt;WHERE&lt;/span&gt;  &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;company_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Necker FinTech'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="c1"&gt;-- hub_name: Necker FinTech  &lt;/span&gt;
&lt;span class="c1"&gt;-- crm_name: Necker FinTech Holdings Inc.  (or Necker Fin Tech, etc.)  &lt;/span&gt;
&lt;span class="c1"&gt;-- canonical_id: c_necker_fintech&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;  Load &lt;code&gt;canonicals&lt;/code&gt; and &lt;code&gt;entity_aliases&lt;/code&gt; from &lt;code&gt;reconciled.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;  Set &lt;code&gt;hub_scrape.canonical_id&lt;/code&gt; from the &lt;code&gt;aliases&lt;/code&gt; array (&lt;code&gt;id&lt;/code&gt; → &lt;code&gt;canonical_id&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;  Set &lt;code&gt;crm_accounts.canonical_id&lt;/code&gt; with the same &lt;code&gt;_crm_to_canonical&lt;/code&gt; logic you already run in Python (exact norm match, then WRatio ≥ 90).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After that, SQL stays a plain equi-join — &lt;strong&gt;fuzzy matching happens once upstream, and not inside the database.&lt;/strong&gt; I won’t cover that though; the &lt;em&gt;pattern&lt;/em&gt; is the point, not the warehouse you choose to use.&lt;/p&gt;

&lt;p&gt;None of this is new — entity resolution is a well-studied problem with industrial-strength tools (Dedupe, &lt;a href="https://moj-analytical-services.github.io/splink/demos/examples/duckdb/deterministic_dedupe.html" rel="noopener noreferrer"&gt;Splink&lt;/a&gt;, various record linkage toolkits) when you need them. &lt;strong&gt;But for the common case of &lt;em&gt;“I have two lists of company names and I need to join them,&lt;/em&gt;” you really don’t.&lt;/strong&gt; A normalization pass and a WRatio threshold gets you most of the way there in an afternoon, in pure Python, with zero infrastructure.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>tutorial</category>
      <category>datascience</category>
      <category>python</category>
    </item>
    <item>
      <title>5 Production Stacks for Live Data Ingestion at Scale (Without Getting Blocked)</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Tue, 19 May 2026 09:53:42 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/5-production-stacks-for-live-data-ingestion-at-scale-without-getting-blocked-25k9</link>
      <guid>https://dev.to/prithwish_nath/5-production-stacks-for-live-data-ingestion-at-scale-without-getting-blocked-25k9</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Most teams over-engineer data ingestion. They use Kafka before they’ve hit their first rate limit, or Playwright before they’ve checked the network tab. This guide shows five production-tested stacks for live data ingestion from minimal fetch + cron up to LLM agents calling Model Context Protocol (MCP) tools — with the specific failure mode each one solves.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You’ll Learn
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  When a plain fetch loop with a cron job is truly all you need&lt;/li&gt;
&lt;li&gt;  How &lt;a href="https://get.brightdata.com/blog-never-get-blocked-again5132?utm_content=five_production_stacks_for_live_data_ingestion_at_scale_without_getting_blocked" rel="noopener noreferrer"&gt;an agent calling MCP tools&lt;/a&gt; adapts where other methods can’t&lt;/li&gt;
&lt;li&gt;  The &lt;a href="https://github.com/cloudflare/workers-sdk" rel="noopener noreferrer"&gt;serverless&lt;/a&gt; + &lt;a href="https://developers.cloudflare.com/r2/" rel="noopener noreferrer"&gt;object storage&lt;/a&gt; pattern that handles high fan-out volume&lt;/li&gt;
&lt;li&gt;  How to add retries, idempotency, and replay to any I/O layer without rewriting it&lt;/li&gt;
&lt;li&gt;  When (and &lt;em&gt;only&lt;/em&gt; when) to reach for a headless browser&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The right ingestion stack is the one with the fewest moving parts that still handles your specific failure mode. Not someone else’s failure mode. &lt;em&gt;Yours&lt;/em&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Failure mode it solves&lt;/th&gt;
&lt;th&gt;Vendor surface&lt;/th&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;th&gt;Cost floor&lt;/th&gt;
&lt;th&gt;Real ceiling&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Bun/Node fetch + allowlist&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stable public APIs, no anti-bot story&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Upstream rate limits, not your hardware. Most public APIs cap at 60–6,000 req/min.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Agent + Bright Data MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High-complexity targets: anti-bot, JS rendering, multi-step flows, adaptive extraction — at moderate volume&lt;/td&gt;
&lt;td&gt;Bright Data (MCP tooling w/ free tier)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Bright Data free tier&lt;/td&gt;
&lt;td&gt;Rapid: 5K req/mo free. Pro tools and &lt;code&gt;web_data_*&lt;/code&gt; extractors bill separately — check Bright Data pricing before you rely on them.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Serverless cron → object storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bursty ingest, fan-out, raw payload durability&lt;/td&gt;
&lt;td&gt;Cloud provider only&lt;/td&gt;
&lt;td&gt;Low-medium&lt;/td&gt;
&lt;td&gt;~$0&lt;/td&gt;
&lt;td&gt;Sub-request limit per invocation (50 Free / 1K Paid on CF Workers). Bypass with Queues.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Durable workflow engine + swappable I/O&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Flaky upstreams, retries, idempotency, replay&lt;/td&gt;
&lt;td&gt;Workflow engine + optional proxy&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Varies by engine&lt;/td&gt;
&lt;td&gt;Concurrency, history size, replay behavior, and hosted usage limits vary — model fan-out before you scale.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5. Minimal Playwright headless&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JS-rendered pages, SPAs, click-to-render flows&lt;/td&gt;
&lt;td&gt;Optional proxy vendor&lt;/td&gt;
&lt;td&gt;Medium-high&lt;/td&gt;
&lt;td&gt;Compute cost&lt;/td&gt;
&lt;td&gt;Memory-bound. Each Chromium context ~200–500 MB. Parallelize based on your instance's RAM, not intuition.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  1. Bun/Node &lt;code&gt;fetch&lt;/code&gt; + Allowlist — The Boring Baseline That Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Documentation:&lt;/strong&gt; &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API" rel="noopener noreferrer"&gt;MDN: Fetch API&lt;/a&gt; · &lt;a href="https://bun.sh/docs/api/fetch" rel="noopener noreferrer"&gt;Bun HTTP&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; N/A&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Free Tier:&lt;/strong&gt; N/A&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Stable public APIs, RSS feeds, open datasets, internal tooling that hits your own endpoints, anything where the anti-bot story is &lt;em&gt;there is no anti-bot story.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://bun.com/?source=post_page-----63f63ac4667f---------------------------------------" rel="noopener noreferrer"&gt;Bun - A fast all-in-one JavaScript runtimeBundle, install, and run JavaScript &amp;amp; TypeScript - all in Bun. Bun is a new JavaScript runtime with a native bundler…bun.com&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What is the fetch + cron stack?
&lt;/h3&gt;

&lt;p&gt;Plain &lt;code&gt;fetch&lt;/code&gt; against a list of known-good URLs, on a timer, writing results to flat files or SQLite. No framework, queue, or service dependencies. A script you can read start to finish in five minutes.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why use fetch + cron for live data ingestion?
&lt;/h3&gt;

&lt;p&gt;I want to be honest about how often this is &lt;em&gt;all&lt;/em&gt; you need. If you’re hitting stable public APIs — government data portals, RSS feeds, well-behaved JSON endpoints, open datasets that publish on a schedule — there is no failure mode that demands anything more than this.&lt;/p&gt;

&lt;p&gt;This is the stack everything else is measured against. Before you add a workflow engine like Temporal/Trigger, agents, a proxy, or &lt;em&gt;anything&lt;/em&gt; else — see if this one is enough. It usually is.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// bun run ingest.ts   &lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Database&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;bun:sqlite&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ALLOWLIST&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://api.github.com/repos/vercel/next.js/releases&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://registry.npmjs.org/react&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://data.gov/some-dataset.json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="c1"&gt;// whatever else  &lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;  

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Database&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./data.sqlite&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`  
  CREATE TABLE IF NOT EXISTS raw_payloads (  
    id INTEGER PRIMARY KEY AUTOINCREMENT,  
    url TEXT,  
    fetched_at TEXT,  
    payload TEXT  
  )  
`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;ingest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;ALLOWLIST&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;User-Agent&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-ingest-bot/1.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;  
        &lt;span class="na"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
      &lt;span class="p"&gt;});&lt;/span&gt;  

      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`[&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;] &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
        &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
      &lt;span class="p"&gt;}&lt;/span&gt;  

      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  

      &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="s2"&gt;`INSERT INTO raw_payloads (url, fetched_at, payload) VALUES (?, ?, ?)`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
      &lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;  

&lt;span class="nf"&gt;ingest&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it with a system cron, a GitHub Actions schedule, or a simple &lt;code&gt;setInterval&lt;/code&gt;. That's literally the whole stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Handle Pagination
&lt;/h3&gt;

&lt;p&gt;Most real APIs paginate. A very simple, intuitive pattern is to keep one loop, store each payload, and stop only when the API gives no next pointer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt; &lt;span class="nl"&gt;next_cursor&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;next_url&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;ingestPaginated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;User-Agent&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-ingest-bot/1.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;  
      &lt;span class="na"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
    &lt;span class="p"&gt;});&lt;/span&gt;  
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`[&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;] page &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; — stopping`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
      &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;Page&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

    &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
      &lt;span class="s2"&gt;`INSERT INTO raw_payloads (url, fetched_at, payload) VALUES (?, ?, ?)`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
      &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;  
    &lt;span class="p"&gt;);&lt;/span&gt;  

    &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  
      &lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;next_url&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt;  
      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;next_cursor&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;?cursor=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;next_cursor&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt; &lt;span class="c1"&gt;// polite pacing  &lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;  

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Ingested &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; pages from &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For offset-based APIs (&lt;code&gt;?page=1&amp;amp;per_page=100&lt;/code&gt;), increment &lt;code&gt;page&lt;/code&gt; and stop when the response is empty. For link-header APIs (&lt;code&gt;Link:&lt;/code&gt;&lt;code&gt;; rel="next"&lt;/code&gt;), parse &lt;code&gt;res.headers.get("link")&lt;/code&gt; or similar for the next URL.&lt;/p&gt;

&lt;h3&gt;
  
  
  When fetch + cron isn’t enough
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  You start getting 403s or rate-limited responses → Stack 2 (Agent + Bright Data MCP) or Stack 5 (Playwright)&lt;/li&gt;
&lt;li&gt;  You need retries with backoff and idempotency → Stack 4 (durable workflow engine)&lt;/li&gt;
&lt;li&gt;  Your list of URLs grows to thousands and you need fanout → Stack 3 (serverless)&lt;/li&gt;
&lt;li&gt;  The pages require JS execution → Stack 5 (Playwright)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What I got wrong
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Always store the raw payload, just in case!&lt;/strong&gt; I built a pipeline that extracted five fields and stored them in a normalized SQLite table — didn’t see the point of keeping the raw response. Three weeks later I needed a sixth field, one that had been in every response the whole time. &lt;strong&gt;Re-fetching that government dataset took four days because of their rate limits.&lt;/strong&gt; So, yeah. The disk cost of storing &lt;code&gt;res.text()&lt;/code&gt; verbatim is trivial. The cost of finding out your schema was wrong after the fact and wasting time and money re-ingesting, is decidedly not. 😅 You can parse as you want downstream, separately, later.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Agent + Bright Data MCP — Complexity Scale Without the Infra Tax
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/brightdata/brightdata-mcp" rel="noopener noreferrer"&gt;https://github.com/brightdata/brightdata-mcp&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License&lt;/strong&gt;: MIT&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Free Tier:&lt;/strong&gt; 5,000 requests/month&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; High-complexity, moderate-volume targets (competitive pricing, job market analysis, funding data, SERP monitoring) where the hard problem is anti-bot, adaptive navigation, or frequent site changes; not throughput.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/scraping-browser7181?utm_content=five_production_stacks_for_live_data_ingestion_at_scale_without_getting_blocked&amp;amp;source=post_page-----63f63ac4667f---------------------------------------" rel="noopener noreferrer"&gt;The Web MCP by Bright Data - Start with a Free PlanConnect LLMs and AI agents to real-time web data with Bright Data MCP Server. Search, crawl, and automate web tasks at…brightdata.com&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What is the agent + Bright Data MCP stack?
&lt;/h3&gt;

&lt;p&gt;An agent loop — running in the IDE, as a headless script, or on a schedule — wired to Bright Data’s MCP (Model Context Protocol) server as its acquisition layer. The agent calls MCP tools, gets structured data back, and writes results to whatever sink fits your pipeline — files, NDJSON, a database, object storage — without fixed selectors, without a scraping framework, and without proxy infra you operate yourself.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why use Bright Data MCP for agentic data extraction?
&lt;/h3&gt;

&lt;p&gt;Most guides treat “scale” as a throughput problem. This stack solves a different one: &lt;em&gt;complexity scale&lt;/em&gt; — targets where the hard problem is defeating defenses, not managing volume.&lt;/p&gt;

&lt;p&gt;A traditional scraper against a heavily defended site is a &lt;strong&gt;maintenance contract&lt;/strong&gt;. Every DOM restructure breaks a selector, every bot detection upgrade breaks your fingerprint, and every geo-block breaks your IP pool. &lt;strong&gt;You spend more time maintaining the scraper than using the data&lt;/strong&gt;. An agent in the loop sidesteps this — it reads what’s on the page and derives the extraction schema from the current DOM, the same way a person would. When the site changes, the agent adapts — autonomously configuring + calling into Bright Data primitives like its proxy network for bot bypass, &lt;a href="https://get.brightdata.com/bd-scraping-browser?utm_content=five_production_stacks_for_live_data_ingestion_at_scale_without_getting_blocked" rel="noopener noreferrer"&gt;Scraping Browser&lt;/a&gt; when a real browser session is required, and the &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=five_production_stacks_for_live_data_ingestion_at_scale_without_getting_blocked" rel="noopener noreferrer"&gt;SERP API&lt;/a&gt; when you need to hit Google/Bing etc.&lt;/p&gt;

&lt;p&gt;This is also the right stack when you need &lt;em&gt;adaptive acquisition&lt;/em&gt; — targets where the data you want depends on what you find at each step. Navigating a site to a specific product variant, following a pagination trail that changes shape, clicking through a login flow — these aren’t hard for a browser-capable agent and are genuinely painful to script deterministically. The proven production pattern here is &lt;strong&gt;LLM as orchestrator, pre-built tools as the acquisition layer&lt;/strong&gt; — which is exactly what MCP provides.&lt;/p&gt;

&lt;p&gt;A MCP Client like Claude Desktop, Cursor, etc. is just the easiest entry point:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Basic setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"Bright Data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="s2"&gt;"mcp-remote"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="s2"&gt;"https://mcp.brightdata.com/mcp?token=YOUR_API_TOKEN"&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;For local + advanced config:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"Bright Data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"@brightdata/mcp"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="nl"&gt;"API_TOKEN"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YOUR_API_TOKEN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="nl"&gt;"PRO_MODE"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="nl"&gt;"WEB_UNLOCKER_ZONE"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"custom"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="nl"&gt;"BROWSER_ZONE"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"custom_browser"&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the same MCP wiring can also run headlessly from a script, triggered by cron, or invoked from any orchestrator. The agent loop is not coupled to a GUI.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to run Bright Data MCP headlessly (without Cursor or Claude Desktop)
&lt;/h3&gt;

&lt;p&gt;You don’t actually need Cursor, Claude Desktop, or any hosted client. The MCP TypeScript SDK gives you a reference &lt;code&gt;Client&lt;/code&gt; — install &lt;a href="https://www.npmjs.com/package/@modelcontextprotocol/client" rel="noopener noreferrer"&gt;@modelcontextprotocol/client&lt;/a&gt; (or use the umbrella &lt;a href="https://www.npmjs.com/package/@modelcontextprotocol/sdk" rel="noopener noreferrer"&gt;@modelcontextprotocol/sdk&lt;/a&gt; with the &lt;code&gt;.../client/*.js&lt;/code&gt; paths.) That client can spawn &lt;code&gt;@brightdata/mcp&lt;/code&gt; as a subprocess, negotiate the MCP handshake, and expose &lt;code&gt;listTools()&lt;/code&gt; / &lt;code&gt;callTool()&lt;/code&gt; the same way an IDE-hosted MCP client does.&lt;/p&gt;

&lt;p&gt;It looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Client&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/client&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;StdioClientTransport&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/client/stdio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ingest-client&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;  
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transport&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;StdioClientTransport&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;npx&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@brightdata/mcp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;API_TOKEN&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;API_TOKEN&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;PRO_MODE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;  
&lt;span class="p"&gt;});&lt;/span&gt;  

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

&lt;span class="c1"&gt;// Call any Bright Data MCP tool (SDK: distinguish result.isError from thrown ProtocolError/SdkError)  &lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;callTool&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;scrape_as_markdown&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="na"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://example.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;  
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the hosted endpoint instead of stdio, swap &lt;code&gt;StdioClientTransport&lt;/code&gt; for &lt;code&gt;StreamableHTTPClientTransport&lt;/code&gt; and point it at &lt;code&gt;https://mcp.brightdata.com/mcp?token=…&lt;/code&gt;. The &lt;a href="https://ts.sdk.modelcontextprotocol.io/v2/documents/Documents.Client_Guide.html" rel="noopener noreferrer"&gt;MCP TypeScript SDK client guide&lt;/a&gt; covers transports and error handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I got wrong
&lt;/h3&gt;

&lt;p&gt;To explain what I did wrong, first, here are the tools included in the Bright Data MCP:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;**scrape_as_markdown**&lt;/code&gt; &lt;strong&gt;/&lt;/strong&gt; &lt;code&gt;**scrape_as_html**&lt;/code&gt; — General-purpose scraping with bot bypass&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**search_engine**&lt;/code&gt; — SERP (search engine results page) data without writing a scraper&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;**navigate**&lt;/code&gt; &lt;strong&gt;/&lt;/strong&gt; &lt;code&gt;**click**&lt;/code&gt; &lt;strong&gt;/&lt;/strong&gt; &lt;code&gt;**type**&lt;/code&gt; — Full browser automation for flows that require interaction&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;60+ specialized extractors&lt;/strong&gt; for Amazon, LinkedIn, Crunchbase, Yahoo Finance, and more&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first lesson was a tier/billing one. The default &lt;strong&gt;Rapid&lt;/strong&gt; mode gives you search and Web Unlocker-backed scraping — not the &lt;strong&gt;60+ Pro tools&lt;/strong&gt;, &lt;strong&gt;browser automation&lt;/strong&gt;, or the &lt;code&gt;**web_data_***&lt;/code&gt; &lt;strong&gt;APIs&lt;/strong&gt;. Those require &lt;code&gt;PRO_MODE=true&lt;/code&gt;, which is &lt;strong&gt;pay-as-you-go on top of the free tier&lt;/strong&gt;. I'd skimmed the marketing copy and missed that footnote entirely.&lt;/p&gt;

&lt;p&gt;The second lesson was an API semantics one. I had lowered &lt;code&gt;POLLING_TIMEOUT&lt;/code&gt; thinking it was a standard request timeout. It isn't — those &lt;code&gt;web_data_*&lt;/code&gt; tools submit a background data-collection job and then poll for the result, and &lt;code&gt;POLLING_TIMEOUT&lt;/code&gt; controls how long that polling is allowed to run. Slow extractions just need more time. &lt;code&gt;BASE_TIMEOUT&lt;/code&gt; and &lt;code&gt;BASE_MAX_RETRIES&lt;/code&gt; are what you actually want for the base tools (&lt;code&gt;search_engine&lt;/code&gt;, &lt;code&gt;scrape_as_markdown&lt;/code&gt;) — they don't affect the polling path at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Serverless Cron + Object Storage — Disposable Compute, Durable Data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/cloudflare/workers-sdk" rel="noopener noreferrer"&gt;https://github.com/cloudflare/workers-sdk&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Documentation:&lt;/strong&gt; &lt;a href="https://developers.cloudflare.com/workers/" rel="noopener noreferrer"&gt;Cloudflare Workers&lt;/a&gt; · &lt;a href="https://developers.cloudflare.com/r2/" rel="noopener noreferrer"&gt;Cloudflare R2&lt;/a&gt; · &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/welcome.html" rel="noopener noreferrer"&gt;AWS Lambda&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Free Tier:&lt;/strong&gt; Cloud provider free tiers (limits apply — e.g. &lt;a href="https://developers.cloudflare.com/workers/platform/limits" rel="noopener noreferrer"&gt;Workers limits&lt;/a&gt;)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; High-volume URL lists, fan-out workloads, raw payload archiving.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://developers.cloudflare.com/workers/?source=post_page-----63f63ac4667f---------------------------------------" rel="noopener noreferrer"&gt;Overview · Cloudflare Workers docsBuild and deploy serverless applications across Cloudflare's global network with Workers.developers.cloudflare.com&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What is the serverless cron + object storage pattern?
&lt;/h3&gt;

&lt;p&gt;Basically, Cloudflare Workers / AWS Lambda + Cloudflare R2/AWS S3 + optional manifest (DynamoDB, Postgres, or a key in R2 itself). A short-lived worker that fires on a schedule, fetches one or more payloads, and lands them in object storage as raw files. The compute is fully disposable. The storage is durable. A small manifest (optional but useful) tracks what’s been fetched and when.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why use serverless workers + R2/S3 for fan-out ingestion?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;This is the pattern I reach for when I need fan-out.&lt;/strong&gt; If Stack 1 is “one script, one machine, one process,” this is “N concurrent workers, each responsible for a slice of the work, all landing to the same durable sink.” You can ingest a thousand URLs in parallel without managing a server, and the raw payloads survive whatever happens to the compute.&lt;/p&gt;

&lt;p&gt;The interesting design decision is the separation: &lt;em&gt;when&lt;/em&gt; you run (cron) is completely decoupled from &lt;em&gt;what&lt;/em&gt; does the fetching (the worker). That worker is a pure function — give it a URL, it gives you a payload in storage. You can swap the acquisition layer (direct fetch today, proxy tomorrow) without touching the scheduling or the storage format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Cloudflare Worker (wrangler.toml has crons configured)  &lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;scheduled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ScheduledEvent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ExecutionContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getUrlBatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// from KV, D1, or hardcoded slice  &lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;allSettled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
      &lt;span class="nx"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
          &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="c1"&gt;// Cap per-request wait so one bad host doesn't stall the batch (good hygiene, not CF's max wall time)  &lt;/span&gt;

            &lt;span class="na"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
          &lt;span class="p"&gt;});&lt;/span&gt;  

          &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

          &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`raw/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nf"&gt;encodeURIComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;.json`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

          &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;BUCKET&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="na"&gt;httpMetadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;contentType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;  

            &lt;span class="na"&gt;customMetadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
              &lt;span class="na"&gt;source_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  

              &lt;span class="na"&gt;fetched_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;  

              &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="p"&gt;},&lt;/span&gt;  
          &lt;span class="p"&gt;});&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
          &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  
      &lt;span class="p"&gt;}),&lt;/span&gt;  
    &lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="p"&gt;},&lt;/span&gt;  
&lt;span class="p"&gt;};&lt;/span&gt;  

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;wrangler.toml&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[[r2_buckets]]&lt;/span&gt;  
&lt;span class="py"&gt;binding&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"BUCKET"&lt;/span&gt;  
&lt;span class="py"&gt;bucket_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"my-ingest-bucket"&lt;/span&gt;  

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;triggers&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
&lt;span class="py"&gt;crons&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"0 * * * *"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Do I need a manifest for serverless ingest?
&lt;/h3&gt;

&lt;p&gt;You don’t always need one. But if you need to know “did I already ingest this URL today?” or “which keys are new since the last pipeline run?”, a manifest pays for itself fast. Cheapest would be a JSON file in R2 itself, keyed by date. Need more? Try a DynamoDB table or a single Postgres table with &lt;code&gt;(url, date, key)&lt;/code&gt; rows.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I got wrong
&lt;/h3&gt;

&lt;p&gt;The biggest gotcha is that Cloudflare Workers silently caps outbound &lt;code&gt;fetch&lt;/code&gt; calls at 50 sub-requests on the Free plan per invocation — the excess doesn't error, &lt;strong&gt;it simply doesn't fire&lt;/strong&gt;. I learned this from the logs that weren't there. For any batch larger than that cap, you need to dispatch to Cloudflare Queues and process in smaller chunks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The other thing is that Workers has a wall-clock time limit&lt;/strong&gt; — 30 seconds on the Free plan. When you fan out to 40 URLs and three of them are a slow government portal that takes 28 seconds to respond, &lt;strong&gt;those three &lt;em&gt;will&lt;/em&gt; get cut off at the execution limit&lt;/strong&gt;, and the logs will show nothing wrong. The only way I caught it was tracking expected vs. actually-written object counts in the manifest — when those numbers diverged, something had timed out quietly. Per-request &lt;code&gt;AbortSignal.timeout&lt;/code&gt; helps, but a manifest count is the only reliable canary.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. A Durable Workflow Engine + Swappable I/O — The Stable Orchestration Layer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/triggerdotdev/trigger.dev" rel="noopener noreferrer"&gt;Trigger.dev&lt;/a&gt; · &lt;a href="https://github.com/temporalio/temporal" rel="noopener noreferrer"&gt;Temporal&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Documentation:&lt;/strong&gt; &lt;a href="https://trigger.dev/docs" rel="noopener noreferrer"&gt;Trigger.dev docs&lt;/a&gt; · &lt;a href="https://docs.temporal.io/" rel="noopener noreferrer"&gt;Temporal docs&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; Varies by engine (Trigger.dev: Apache 2.0; Temporal: MIT)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Free Tier:&lt;/strong&gt; Varies by engine — hosted usage tiers, self-hosted deployments, and managed-cloud limits all differ&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Any ingest workload where “what failed and why” needs to be answerable, upstreams are flaky, or you’re running at a scale where silent failures are unacceptable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://trigger.dev/docs/introduction?source=post_page-----63f63ac4667f---------------------------------------" rel="noopener noreferrer"&gt;Welcome to the Trigger.dev docs - Trigger.devFind all the resources and guides you need to get startedtrigger.dev&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What is the workflow engine stack?
&lt;/h3&gt;

&lt;p&gt;A workflow engine (Trigger.dev, Temporal, Inngest, AWS Step Functions, etc.) handles retries, idempotency, scheduling, replay, and observability — while the actual acquisition is a swappable I/O step inside the workflow.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why use Temporal, Inngest, AWS Step Functions for resilient ingestion?
&lt;/h3&gt;

&lt;p&gt;The counterintuitive argument here is that this stack isn’t heavier than Stack 3 in any meaningful sense — it just makes the complexity &lt;em&gt;visible&lt;/em&gt; instead of hiding it in ad-hoc retry logic and &lt;code&gt;try/catch&lt;/code&gt; soup.&lt;/p&gt;

&lt;p&gt;Every serious ingestion system eventually needs automatic retries with backoff, deduplication of runs, visibility into what failed and why, and the ability to replay a failed run without re-running the whole pipeline. Trigger.dev, Temporal, Inngest, and Step Functions all live in this category. The APIs differ but the job is the same.&lt;/p&gt;

&lt;p&gt;So with durable orchestration stabilized, your I/O step — direct fetch, proxy, browser job — is the interchangeable part. When your upstream starts blocking you, you change one function. The retries, scheduling, idempotency, and replay story stay intact.&lt;/p&gt;

&lt;p&gt;Here’s a Trigger.dev example for this stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// trigger.dev task (see current SDK import path for your version — often `@trigger.dev/sdk`)  &lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;idempotencyKeys&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@trigger.dev/sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ingestUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;task&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ingest-url&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="na"&gt;retry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="na"&gt;maxAttempts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="na"&gt;factor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="na"&gt;minTimeoutInMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="na"&gt;maxTimeoutInMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;},&lt;/span&gt;  
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="c1"&gt;// Swap this block for proxy / Web Unlocker / browser job when direct fetch isn't enough.  &lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="na"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AbortSignal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
    &lt;span class="p"&gt;});&lt;/span&gt;  
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`HTTP &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;writeToStorage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  
  &lt;span class="p"&gt;},&lt;/span&gt;  
&lt;span class="p"&gt;});&lt;/span&gt;  
&lt;span class="c1"&gt;// Trigger a batch from a cron or an API route  &lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ingestBatch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;task&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ingest-batch&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;0 */6 * * *&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getTargetUrls&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
      &lt;span class="nx"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;  
        &lt;span class="na"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;date&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;  
        &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
          &lt;span class="na"&gt;idempotencyKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;idempotencyKeys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`ingest:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="na"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;global&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
          &lt;span class="p"&gt;}),&lt;/span&gt;  
        &lt;span class="p"&gt;},&lt;/span&gt;  
      &lt;span class="p"&gt;}))&lt;/span&gt;  
    &lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ingestUrl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;batchTriggerAndWait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="p"&gt;},&lt;/span&gt;  
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  When to swap the I/O step in a workflow
&lt;/h3&gt;

&lt;p&gt;When the I/O step starts failing — consistent 403s, CAPTCHAs, geo-blocks — you replace &lt;code&gt;fetch(url)&lt;/code&gt; with a call through a proxy or unlocker API. The retry logic, the scheduling, the idempotency — none of it changes. You changed one line. That's the payoff.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before  &lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
&lt;span class="c1"&gt;// After — swap the I/O layer for a proxy or unlocker when direct fetch isn't enough  &lt;/span&gt;
&lt;span class="c1"&gt;// const res = await proxyClient.fetch(url);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What I got wrong
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Don’t neglect idempotency keys!&lt;/strong&gt; They feel optional until the first time you need to replay something, which is &lt;em&gt;always&lt;/em&gt;, eventually.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Minimal Playwright Headless — The Last Resort
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/microsoft/playwright" rel="noopener noreferrer"&gt;https://github.com/microsoft/playwright&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Documentation:&lt;/strong&gt; &lt;a href="https://playwright.dev/docs/intro" rel="noopener noreferrer"&gt;https://playwright.dev/docs/intro&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; Apache 2.0&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Free Tier:&lt;/strong&gt; Unlimited (open source; you pay for compute/hosting)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; SPAs, client-rendered dashboards, sites that gate content behind click interactions, any page that simply doesn’t exist until JavaScript runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/microsoft/playwright?source=post_page-----63f63ac4667f---------------------------------------" rel="noopener noreferrer"&gt;GitHub - microsoft/playwright: Playwright is a framework for Web Testing and Automation. It allows…Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single…github.com&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What is the minimal Playwright headless stack?
&lt;/h3&gt;

&lt;p&gt;One Playwright browser context, strict timeouts, screenshot and/or HTML captured to storag, with optional residential/datacenter proxy. No parallel contexts unless you’ve actually measured you need them. No fancy orchestration unless you’ve already tried Stack 4 with a browser job.&lt;/p&gt;
&lt;h3&gt;
  
  
  When to use Playwright for scraping (and why it’s a last resort)
&lt;/h3&gt;

&lt;p&gt;Headless browsers are the right tool for exactly &lt;em&gt;one&lt;/em&gt; specific failure mode: the page doesn’t exist until JavaScript executes it. SPAs, client-side rendered dashboards, pages that require a click to reveal pricing — these can’t be handled by any of the previous stacks without adding a browser layer.&lt;/p&gt;

&lt;p&gt;But headless is expensive. CPU, memory, time. A Playwright context consumes dramatically more resources than a &lt;code&gt;fetch&lt;/code&gt;. You serialize concurrency. Cold starts on serverless are brutal. If you're reaching for Playwright because you &lt;em&gt;might&lt;/em&gt; need it, &lt;strong&gt;DON'T&lt;/strong&gt;. Try one of the above stacks first.&lt;/p&gt;

&lt;p&gt;The minimal version of this stack will be familiar to most readers: one browser context, one page, strict timeouts, then write to disk. Proxies are optional.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;chromium&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;playwright&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;writeFile&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;fs/promises&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;scrapeWithBrowser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;outputDir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
    &lt;span class="na"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--no-sandbox&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--disable-setuid-sandbox&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;--disable-dev-shm-usage&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// critical for containerized environments  &lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;  
  &lt;span class="p"&gt;});&lt;/span&gt;  

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newContext&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
    &lt;span class="c1"&gt;// Proxy config goes here when you need it:  &lt;/span&gt;
    &lt;span class="c1"&gt;// proxy: { server: "http://proxy.brightdata.com:22225", username: "...", password: "..." },  &lt;/span&gt;
    &lt;span class="na"&gt;userAgent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="na"&gt;viewport&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1280&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;  
  &lt;span class="p"&gt;});&lt;/span&gt;  

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  

  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="c1"&gt;// Hard timeout on navigation — don't wait for every lazy-loaded asset  &lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="na"&gt;waitUntil&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;domcontentloaded&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// not 'networkidle' — too slow  &lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;});&lt;/span&gt;  

    &lt;span class="c1"&gt;// Wait for the element you actually want, not the whole page  &lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;[data-testid='price']&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;content&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;slug&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;encodeURIComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;  
      &lt;span class="nf"&gt;writeFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;outputDir&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.html`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;html&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
      &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;outputDir&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.png`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="na"&gt;fullPage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
      &lt;span class="p"&gt;}),&lt;/span&gt;  
    &lt;span class="p"&gt;]);&lt;/span&gt;  

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="c1"&gt;// Always close — leaked contexts accumulate fast  &lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  When to add a proxy to your Playwright stack
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Only when you’ve been blocked, definitively.&lt;/strong&gt; Not before. The minimal version without a proxy will work on the vast majority of public pages. When you start seeing CAPTCHAs, bot challenges, or suspiciously empty responses, that’s when you plug in residential proxies or a scraping browser service (which handles fingerprinting and unblocking at the browser level).&lt;/p&gt;

&lt;h3&gt;
  
  
  What I got wrong
&lt;/h3&gt;

&lt;p&gt;Of course, the one gotcha that catches everyone on the containerization side: &lt;code&gt;--disable-dev-shm-usage&lt;/code&gt;. I deployed without it. Container crashed with exit code 1 and no useful error. Two hours of debugging: Chromium was exhausting Docker's default 64MB &lt;code&gt;/dev/shm&lt;/code&gt;. The flag routes Chromium's shared memory to &lt;code&gt;/tmp&lt;/code&gt; instead. It is mandatory in any containerized environment. It is never in the first tutorial you read. It is always in the second incident report.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Tree: Choosing Your Ingestion Stack
&lt;/h2&gt;

&lt;p&gt;Start with the simplest viable ingestion layer, then add capabilities only as you actually need more complexity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Stack 1 → Simple public APIs and small URL sets&lt;/li&gt;
&lt;li&gt;  Stack 2 → Agentic navigation, anti-bot handling, or validation&lt;/li&gt;
&lt;li&gt;  Stack 3 → Large-scale fan-out across many URLs&lt;/li&gt;
&lt;li&gt;  Stack 4 → Durable orchestration, retries, and observability&lt;/li&gt;
&lt;li&gt;  Stack 5 → JavaScript-rendered pages via headless browsers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Remember: these stacks are composable layers, not mutually exclusive choices.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4hmr9c6qw97x8z1i9sjv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4hmr9c6qw97x8z1i9sjv.png" width="640" height="970"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The most common production pattern I’ve seen is Stack 4 (orchestration) wrapping Stack 5 (browser) or Stack 3 (serverless fan-out) as the I/O step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions (FAQ)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: What is the simplest data ingestion stack for a small project?&lt;br&gt;&lt;br&gt;
A:&lt;/strong&gt; A plain &lt;code&gt;fetch&lt;/code&gt; loop on a cron schedule writing to SQLite or flat files. No framework, no queue, no service dependencies — a script you can read end-to-end in five minutes. This works for the majority of stable public APIs, RSS feeds, and open datasets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can I run Bright Data MCP without Claude Desktop or Cursor?&lt;br&gt;&lt;br&gt;
A:&lt;/strong&gt; Yes. The official &lt;a href="https://www.npmjs.com/package/@modelcontextprotocol/sdk" rel="noopener noreferrer"&gt;@modelcontextprotocol/sdk&lt;/a&gt; for TypeScript ships an MCP client (&lt;code&gt;Client&lt;/code&gt; + &lt;code&gt;StdioClientTransport&lt;/code&gt;) that spawns &lt;code&gt;@brightdata/mcp&lt;/code&gt; as a subprocess from any Node script. From there, bridge the tool loop to whatever LLM you prefer: Anthropic's Messages API, Ollama's &lt;code&gt;/api/chat&lt;/code&gt; with a local tool-calling model, or any OpenAI-compatible endpoint. No proprietary client install required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Should I store raw API payloads or only the parsed fields I need?&lt;br&gt;&lt;br&gt;
A:&lt;/strong&gt; Always store the raw payload first. Schemas evolve, fields you didn’t think you needed become important, and re-fetching upstream data — especially rate-limited public APIs — can take days.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: When is it worth paying for a proxy or unblocking service?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; When you’re seeing 403s, 429s, CAPTCHAs, or geo-blocks on your target. Not before. Vanilla &lt;code&gt;fetch&lt;/code&gt; and a minimal Playwright context work on the vast majority of public pages. Add a paid proxy or unblocking layer only when you've measured the specific failure mode you're solving. The same applies to LLM-agent stacks: validate the data exists and has the shape you need before committing to paid infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How do I run an MCP server programmatically from a backend service?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: Use the MCP TypeScript SDK’s &lt;code&gt;Client&lt;/code&gt; with the transport that matches your deployment: &lt;code&gt;StdioClientTransport&lt;/code&gt; for local subprocesses (e.g. &lt;code&gt;npx @brightdata/mcp&lt;/code&gt;), or &lt;code&gt;StreamableHTTPClientTransport&lt;/code&gt; / &lt;code&gt;SSEClientTransport&lt;/code&gt; for a remote MCP URL like &lt;code&gt;https://mcp.brightdata.com/mcp?token=…&lt;/code&gt;. Call &lt;code&gt;mcp.connect()&lt;/code&gt;, then &lt;code&gt;listTools()&lt;/code&gt; / &lt;code&gt;callTool()&lt;/code&gt; exactly like an IDE would. The &lt;a href="https://ts.sdk.modelcontextprotocol.io/v2/documents/Documents.Client_Guide.html" rel="noopener noreferrer"&gt;MCP TypeScript SDK client guide&lt;/a&gt; covers all transports.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: When should I combine stacks rather than pick one?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; Almost always! These patterns are primitives, not solo architectures. The most common production shape is Stack 4 (durable orchestration and retries) wrapping Stack 5 (Playwright as the browser I/O step) or Stack 3 (serverless fan-out for parallel fetching). Stack 1 or 2 for ad-hoc validation before you commit to building any of it. Treat the decision tree as "which I/O layer do I need?" not "which complete system should I adopt?"&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What stack are you running for live data ingestion? Have you ran into walls I didn’t cover here? Drop it in the comments. 👇&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>javascript</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Turning Google into an Explorable Knowledge Graph Using Pure k-NN</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Fri, 15 May 2026 11:09:58 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/turning-google-into-an-explorable-knowledge-graph-using-pure-k-nn-2878</link>
      <guid>https://dev.to/prithwish_nath/turning-google-into-an-explorable-knowledge-graph-using-pure-k-nn-2878</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;TL;DR:&lt;/strong&gt; I ran K-Nearest Neighbors (KNN) over a Google search corpus to find cross-query connections no single search can ever surface.
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnki4xluo7r9abyd2jlau.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnki4xluo7r9abyd2jlau.png" alt="Image" width="800" height="567"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Human learning is all about building &lt;em&gt;connections&lt;/em&gt; in your head. Like last week, I read &lt;a href="https://arxiv.org/html/2411.02530v1" rel="noopener noreferrer"&gt;an ArXiv paper on quantization&lt;/a&gt;, which prompted me to do some Google-fu for a &lt;a href="https://forums.developer.nvidia.com/t/same-inference-speed-for-int8-and-fp16/66971" rel="noopener noreferrer"&gt;FP16 vs INT8 comparison&lt;/a&gt; on NVIDIA’s forums, and then make a &lt;code&gt;site:github.com&lt;/code&gt; search for &lt;a href="https://github.com/ikawrakow/ik_llama.cpp" rel="noopener noreferrer"&gt;a Llama.cpp fork with optimized kernels&lt;/a&gt; to try it myself. &lt;strong&gt;This takes time.&lt;/strong&gt; Google — or an LLM — can’t make these mental hops for you.&lt;/p&gt;

&lt;p&gt;So I wanted to see if I could speed this up by &lt;strong&gt;programmatically finding and shortlisting these connections for me to review later&lt;/strong&gt;, using &lt;a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm" rel="noopener noreferrer"&gt;a classic algorithm from 1951&lt;/a&gt;. To collect the raw material, I used my &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=turning_google_into_an_explorable_knowledge_graph_using_pure_k_nn" rel="noopener noreferrer"&gt;SERP API&lt;/a&gt; to run 100 varied Google searches on a specific topic — then merged the ~800 results into one corpus, embedded every row, and ran cosine k-NN over the whole thing.&lt;/p&gt;

&lt;p&gt;From that new data, I could click any result in my UI and see its nearest semantic neighbors — &lt;strong&gt;not just from the same search, but &lt;em&gt;anywhere&lt;/em&gt; in the dataset, across all 100 searches&lt;/strong&gt; — fully explorable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fohlxbckhwcwzgpl93ymw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fohlxbckhwcwzgpl93ymw.png" width="640" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Highlighted links in the Related section mean they were from different queries.&lt;/p&gt;

&lt;p&gt;This worked &lt;em&gt;exceptionally&lt;/em&gt; well. &lt;strong&gt;A whopping 42.2% of all neighbor links crossed query boundaries&lt;/strong&gt;, and &lt;strong&gt;&lt;em&gt;every one&lt;/em&gt; of the 797 documents in my corpus had at least one cross-search connection in its top 8&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I’ll present my approach and findings here, and &lt;a href="https://github.com/sixthextinction/knn" rel="noopener noreferrer"&gt;the full code is available on GitHub&lt;/a&gt; to review.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the K-Nearest Neighbors Algorithm (KNN) ?
&lt;/h2&gt;

&lt;p&gt;Similar things tend to be near each other. The k-nearest neighbors algorithm (k-NN) formalizes this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Given a point in space, find the k closest points to it using some distance metric (here, cosine similarity over embeddings).&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I treat each Google result as a point in a shared semantic space. That changes the question from &lt;em&gt;“what ranks for this query?”&lt;/em&gt; to &lt;em&gt;“what lives near this document?”&lt;/em&gt; Going from Google’s ranking to &lt;strong&gt;proximity&lt;/strong&gt; is what makes connections show up across queries, domains, and levels of abstraction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why k-NN?&lt;/strong&gt; Because it is local and doesn’t need training. It simply operates over the structure already present in the embeddings, and because it runs over the entire merged corpus, neighbors can come from &lt;em&gt;anywhere&lt;/em&gt; in the data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;I’m too lazy for full orchestration layers or distributed systems—so this project is just a sequence of steps that add structure, progressively.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsbu1t4p5ph5g976yyqjf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsbu1t4p5ph5g976yyqjf.png" width="516" height="940"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each stage does &lt;em&gt;one&lt;/em&gt; thing. &lt;code&gt;ingest.py&lt;/code&gt; collects results across many queries into a single &lt;a href="https://duckdb.org/docs/current/" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt; table, preserving &lt;code&gt;(url, query)&lt;/code&gt; pairs as distinct rows so context isn’t lost. Then &lt;code&gt;embed.py&lt;/code&gt; converts each row into a vector &lt;code&gt;(title + snippet + domain + query)&lt;/code&gt; and stores it in &lt;a href="https://docs.trychroma.com/docs/overview/introduction" rel="noopener noreferrer"&gt;Chroma&lt;/a&gt;. Next, &lt;code&gt;neighbors.py&lt;/code&gt; runs cosine k-NN over that global space and hydrates results back from DuckDB. Finally, &lt;code&gt;serve.py&lt;/code&gt; exposes this through a minimal API and HTML/JS/CSS UI, where we can click any result, and see its nearest neighbors from anywhere in the corpus.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;💡&lt;/em&gt; I could have stored everything in Chroma with metadata fields and skipped DuckDB entirely. I didn’t, because &lt;strong&gt;Chroma does not make for a very good source of truth.&lt;/strong&gt; Metadata in it is harder to query, to inspect ad hoc, and to rebuild from.&lt;/p&gt;

&lt;p&gt;DuckDB on the other hand, is a single portable file, queryable with standard SQL, trivially exportable, and completely replaceable without touching the vector layer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install:&lt;/strong&gt; &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;Python&lt;/a&gt; 3.10+, &lt;a href="https://docs.astral.sh/uv/" rel="noopener noreferrer"&gt;uv&lt;/a&gt; (or another venv workflow), &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; with &lt;a href="https://ollama.com/library/nomic-embed-text" rel="noopener noreferrer"&gt;nomic-embed-text:latest&lt;/a&gt; pulled, and &lt;a href="https://docs.docker.com/get-docker/" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; (or any Chroma HTTP server) — e.g. &lt;code&gt;docker compose up -d&lt;/code&gt; in this folder so Chroma listens on &lt;code&gt;localhost:8000&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python dependencies&lt;/strong&gt; (&lt;code&gt;requirements.txt&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python-dotenv&amp;gt;=1.0.0  
requests&amp;gt;=2.28.0  
chromadb&amp;gt;=0.5.0  
duckdb&amp;gt;=1.0.0  
fastapi&amp;gt;=0.115.0  
uvicorn[standard]&amp;gt;=0.32.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Environment Variables:&lt;/strong&gt; Set at least &lt;code&gt;**BRIGHT_DATA_API_KEY**&lt;/code&gt; and &lt;code&gt;**BRIGHT_DATA_ZONE**&lt;/code&gt; (required for &lt;code&gt;ingest.py&lt;/code&gt;). You get them from your Bright Data dashboard after &lt;a href="https://get.brightdata.com/bd7914?utm_content=turning_google_into_an_explorable_knowledge_graph_using_pure_k_nn" rel="noopener noreferrer"&gt;signing up here&lt;/a&gt;; replace with your own if using some other SERP API. Everything else is optional, &lt;a href="https://github.com/sixthextinction/knn/blob/main/README.md" rel="noopener noreferrer"&gt;documented in README.md&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run order:&lt;/strong&gt; create/activate a venv, &lt;code&gt;uv pip install -r requirements.txt&lt;/code&gt;, start Chroma, then do &lt;code&gt;python ingest.py&lt;/code&gt; → &lt;code&gt;python embed.py&lt;/code&gt; → &lt;code&gt;python serve.py&lt;/code&gt; and open the app URL (default &lt;code&gt;[http://127.0.0.1:8766/](http://127.0.0.1:8766/).)&lt;/code&gt;&lt;a href="http://127.0.0.1:8766/" rel="noopener noreferrer"&gt;).&lt;/a&gt;.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Building a Multi-Angle Query Set for Your Research Topic
&lt;/h2&gt;

&lt;p&gt;Our list of Google queries can live in a&lt;code&gt;queries.json&lt;/code&gt;. For best results, try to cover as many angles as you can think of. Example: I wanted to research ML on edge devices (as of 2026), so I included strings covering hardware, software, models/compression, web/WASM, research and benchmarks, and more.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://github.com/sixthextinction/knn/blob/main/queries.json" rel="noopener noreferrer"&gt;queries.json&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[  
  "how to run a neural network on a microcontroller",  
  "edge AI chips compared Coral TPU vs Jetson Nano",  
  "TensorFlow Lite Micro supported operators",  
  "ONNX Runtime Web WebGPU backend",  
  "site:arxiv.org efficient LLM survey",  
  "MLPerf Tiny benchmark results"  
  // ...see queries.json for the full list  
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ll read this file once at ingest time and never again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Setting Up a Bright Data SERP API Client in Python
&lt;/h2&gt;

&lt;p&gt;Here, we’re gonna wrap the Bright Data SERP API in a thin, defensive client that fails loudly on bad responses instead of silently passing garbage downstream.&lt;/p&gt;

&lt;p&gt;Some quick gotchas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Don’t use the&lt;/strong&gt; &lt;code&gt;**&amp;amp;num=**&lt;/code&gt; &lt;strong&gt;parameter.&lt;/strong&gt; Google deprecated that parameter in September 2025. Now you get ~10 organics regardless. Cap rows by slicing after the response, which is what &lt;code&gt;limit_organic()&lt;/code&gt; does.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Use a retry loop, but keep it simple.&lt;/strong&gt; Rather than pulling in a specialized retry library, &lt;code&gt;search()&lt;/code&gt; makes three attempts with a linear backoff of &lt;code&gt;0.5s × (attempt + 1)&lt;/code&gt;. Good enough.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Don’t forget to unwrap the response.&lt;/strong&gt; The &lt;code&gt;"format": "json"&lt;/code&gt; parameter brings in a response that is an envelope with its own &lt;code&gt;status_code&lt;/code&gt;, &lt;code&gt;headers&lt;/code&gt;, and &lt;code&gt;body&lt;/code&gt; — and the actual SERP payload lives inside &lt;code&gt;body&lt;/code&gt;, so you need a second &lt;code&gt;json.loads&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://github.com/sixthextinction/knn/blob/main/bright_data_serp.py" rel="noopener noreferrer"&gt;bright_data_serp.py&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# a util function really  
def limit_organic(data: Dict[str, Any], max_results: int) -&amp;gt; Dict[str, Any]:  
    # Keep at most `max_results` organic rows.  
    if max_results `&amp;lt;= 0:  
        return data  
    organic = data.get("organic")  
    if isinstance(organic, list) and len(organic) &amp;gt;` max_results:  
        return {**data, "organic": organic[:max_results]}  
    return data  


class BrightDataSERPClient:  
    def __init__(        self,  
        api_key: Optional[str] = None,  
        zone: Optional[str] = None,  
        country: Optional[str] = None,    ):  
        self.api_key = api_key or os.getenv("BRIGHT_DATA_API_KEY")  
        self.zone = zone or os.getenv("BRIGHT_DATA_ZONE")  
        self.country = country or os.getenv("BRIGHT_DATA_COUNTRY")  
        self.api_endpoint = "https://api.brightdata.com/request"  

        if not self.api_key:  
            raise ValueError("BRIGHT_DATA_API_KEY is required.")  
        if not self.zone:  
            raise ValueError("BRIGHT_DATA_ZONE is required.")  

        self.session = requests.Session()  
        self.session.headers.update(  
            {  
                "Content-Type": "application/json",  
                "Authorization": f"Bearer {self.api_key}",  
            }  
        )  

    def search(        self,  
        query: str,  
        num_results: int = 10,  
        language: Optional[str] = None,  
        country: Optional[str] = None,  
        max_retries: int = 2,    ) -&amp;gt; Dict[str, Any]:  
        last_err: Optional[Exception] = None  
        for attempt in range(max_retries + 1):  
            try:  
                return self._do_search(query, num_results, language, country)  
            except Exception as e:  
                last_err = e  
                if attempt `&amp;lt; max_retries:  
                    # simple linear backoff  
                    time.sleep(0.5 * (attempt + 1))  
        assert last_err is not None  
        raise last_err  

    def _do_search(        self,  
        query: str,  
        num_results: int,  
        language: Optional[str],  
        country: Optional[str],    ) -&amp;gt;` Dict[str, Any]:  
        search_url = (  
            f"https://www.google.com/search"  
            f"?q={requests.utils.quote(query)}"  
            f"&amp;amp;brd_json=1"  
        )  
        if language:  
            search_url += f"&amp;amp;hl={language}&amp;amp;lr=lang_{language}"  
        target_country = country or self.country  
        payload: Dict[str, Any] = {  
            "zone": self.zone,  
            "url": search_url,  
            "format": "json",  
        }  
        if target_country:  
            payload["country"] = target_country  

        response = self.session.post(self.api_endpoint, json=payload, timeout=60)  
        response.raise_for_status()  
        result = response.json()  
        if not isinstance(result, dict):  
            raise RuntimeError(f"Bright Data unexpected response type: {type(result)}")  
        inner_status = result.get("status_code")  
        if inner_status is not None and inner_status != 200:  
            raise RuntimeError(f"Bright Data SERP status_code={inner_status}")  
        if "body" in result:  
            body = result["body"]  
            if isinstance(body, str):  
                if not body.strip():  
                    raise RuntimeError("Bright Data SERP empty body")  
                result = json.loads(body)  
            else:  
                result = body  
        elif "organic" not in result:  
            raise RuntimeError("Bright Data response missing 'body' and 'organic'")  
        return limit_organic(result, num_results)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;None of this is &lt;em&gt;glamorous&lt;/em&gt;, really. But a pipeline that silently ingests empty responses is worse than one that crashes loudly. So I intentionally fail fast here so the rest of the pipeline can trust its input.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Ingesting Multi-Query Search Results Into DuckDB
&lt;/h2&gt;

&lt;p&gt;With a reliable client in place, &lt;code&gt;ingest.py&lt;/code&gt; has just one job: &lt;strong&gt;loop over every query in&lt;/strong&gt; &lt;code&gt;queries.json&lt;/code&gt;&lt;strong&gt;, fetch Google’s organic results, and write them into a single DuckDB table&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For the primary key, I decided on a SHA-256 hash of &lt;code&gt;url + source_query&lt;/code&gt;. This gives us three things for free:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The &lt;em&gt;same&lt;/em&gt; URL retrieved by two different queries becomes two distinct rows, with different &lt;code&gt;source_query&lt;/code&gt; values. We don't lose that provenance.&lt;/li&gt;
&lt;li&gt;  Re-ingesting a query produces the same IDs deterministically, so &lt;code&gt;DELETE WHERE source_query = ?&lt;/code&gt; followed by re-insert is safe to run as many times as you like.&lt;/li&gt;
&lt;li&gt;  And finally, you won’t need an autoincrement sequence or UUID generation — the ID is fully derivable from the content itself.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def row_id(url: str, source_query: str) -&amp;gt; str:  
    return hashlib.sha256(f"{url}t{source_query}".encode()).hexdigest()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I've added a &lt;code&gt;--refresh&lt;/code&gt; option too; this wipes the table and re-fetches all queries from scratch (which you'd use when you want a completely clean corpus).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://github.com/sixthextinction/knn/blob/main/ingest.py" rel="noopener noreferrer"&gt;ingest.py&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Fetch Google SERP via Bright Data and write a single DuckDB table.  

# Default: merge — skip queries that already have rows (same ``source_query`` string).  
# With ``--refresh``: delete all rows, then re-fetch every query in the file.  

import time  
from bright_data_serp import BrightDataSERPClient  

# resolve DuckDB + queries file and ensure data directory exists.  
def db_path() -&amp;gt; str:  
    return os.getenv("DUCKDB_PATH", str(_DIR / "data" / "serp.duckdb"))  
def queries_path() -&amp;gt; str:  
    return os.getenv("QUERIES_JSON", str(_DIR / "queries.json"))  
def ensure_data_dir() -&amp;gt; None:  
    Path(db_path()).parent.mkdir(parents=True, exist_ok=True)  

# Default table name; this can also be an env var if you have a multi-table database  
TABLE = "serp_results"  

# Deterministic primary key with SHA256. Re-fetching a query overwrites the same id rows.  
def row_id(url: str, source_query: str) -&amp;gt; str:  
    return hashlib.sha256(f"{url}t{source_query}".encode()).hexdigest()  

# Flatten SERP `organic` into DB-ready dicts  
def organic_to_rows(data: Dict[str, Any], source_query: str) -&amp;gt; List[Dict[str, Any]]:  
    organic = data.get("organic")  
    if not isinstance(organic, list):  
        return []  
    out: List[Dict[str, Any]] = []  
    for i, row in enumerate(organic):  
        if not isinstance(row, dict):  
            continue  
        # link vs url, description vs snippet — handle both  
        url = row.get("link") or row.get("url") or ""  
        if not url:  
            continue  
        title = (row.get("title") or "")[:8000]  
        snippet = (row.get("description") or row.get("snippet") or "") or ""  
        snippet = snippet[:16000]  
        pos = row.get("rank") or row.get("position") or (i + 1)  
        try:  
            position = int(pos)  
        except (TypeError, ValueError):  
            position = i + 1  
        domain = urlparse(url).netloc or ""  
        rid = row_id(url, source_query)  
        out.append(  
            {  
                "id": rid,  
                "source_query": source_query,  
                "url": url,  
                "title": title,  
                "snippet": snippet,  
                "domain": domain,  
                "position": position,  
            }  
        )  
    return out  


# Read query strings: either a top-level JSON array or passing in a full object ``{"queries": [...]}``.  
def load_queries(path: str) -&amp;gt; List[str]:  
    p = Path(path)  
    raw = json.loads(p.read_text(encoding="utf-8"))  
    if isinstance(raw, list):  
        return [str(x).strip() for x in raw if str(x).strip()]  
    if isinstance(raw, dict) and "queries" in raw:  
        return [str(x).strip() for x in raw["queries"] if str(x).strip()]  
    raise ValueError("queries.json must be a JSON array or {"queries": [...]}")  


# One-time table definition. Obviously, safe to run every ingest.  
def ensure_schema(con: duckdb.DuckDBPyConnection) -&amp;gt; None:  
    con.execute(  
        f"""  
        CREATE TABLE IF NOT EXISTS {TABLE} (  
            id VARCHAR PRIMARY KEY,  
            source_query VARCHAR NOT NULL,  
            url VARCHAR NOT NULL,  
            title VARCHAR,  
            snippet VARCHAR,  
            domain VARCHAR,  
            position INTEGER NOT NULL  
        )  
        """  
    )  

# 1) Connect and ensure the schema.   
# 2) If ``--refresh``, delete every row.   
# 3) For each query: in default merge mode, skip if that ``source_query`` already has rows, else call Bright Data.  
def main() -&amp;gt; None:  
    con = duckdb.connect(dpath)  
    ensure_schema(con)  
    if args.refresh:  
        con.execute(f"DELETE FROM {TABLE}")  # full wipe, then every query is fetched again  

    bd = BrightDataSERPClient()  
    query_list = load_queries(qpath)  
    for q in query_list:  
        if not args.refresh and con.execute(  
            f"SELECT COUNT(*) FROM {TABLE} WHERE source_query = ?",  
            [q],  
        ).fetchone()[0]:  
            continue  # merge: already have this ``source_query``  
        raw = bd.search(q, num_results=args.num_results)  
        rows = organic_to_rows(raw, q)  
        con.execute(f"DELETE FROM {TABLE} WHERE source_query = ?", [q])  
        if not rows:  
            time.sleep(args.delay)  # rate limit: full script paces on empty organics too  
            continue  
        con.executemany(  
            f"INSERT INTO {TABLE} (id, source_query, url, title, snippet, domain, position) VALUES (?, ?, ?, ?, ?, ?, ?)",  
            [  
                (r["id"], r["source_query"], r["url"], r["title"], r["snippet"], r["domain"], r["position"])  
                for r in rows  
            ],  
        )  
        time.sleep(args.delay)  # pace the script between calls. `ingest.py` also sleeps after a failed `search. Dropping this invites throttling risk.  

    # All done! now count rows, close the connection, and maybe stdout a quick summary
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Embedding and Indexing Google Results in ChromaDB
&lt;/h2&gt;

&lt;p&gt;For embedding, I picked four fields, concatenated: &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;snippet&lt;/code&gt;, &lt;code&gt;domain&lt;/code&gt;, &lt;code&gt;source query&lt;/code&gt;. I’m including &lt;code&gt;domain&lt;/code&gt;here because it carries implicit topical weight — &lt;code&gt;arxiv.org&lt;/code&gt; and &lt;code&gt;thinkrobotics.com&lt;/code&gt; &lt;em&gt;should&lt;/em&gt; mean something different even with identical text.&lt;/p&gt;

&lt;p&gt;Why add &lt;code&gt;source query&lt;/code&gt; as well? Because the same URL retrieved by two different searches &lt;em&gt;should&lt;/em&gt; embed slightly differently — it captures why the result was surfaced, not just what it says.&lt;/p&gt;

&lt;p&gt;We don’t need to go over the full embedding logic, so let’s just cover the most important bits.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://github.com/sixthextinction/knn/blob/main/embed.py" rel="noopener noreferrer"&gt;embed.py&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# String fed to the embedding model  
# (title → snippet → domain → source query).  
def embedding_text(    title: "str, snippet: str, domain: str, source_query: str) -&amp;gt; str:  "
    t = (title or "").strip()  
    s = (snippet or "").strip()  
    d = (domain or "").strip()  
    q = (source_query or "").strip()  
    return f"{t}n{s}n{d}n{q}".strip()  


# POST /api/embed. New API returns "embeddings" (list of one vector per input)  
def ollama_embed_one(    host: str,  
    model: str,  
    text: str,  
    session: requests.Session,) -&amp;gt; List[float]:  
    url = host.rstrip("/") + "/api/embed"  
    r = session.post(  
        url,  
        json={"model": model, "input": text},  
        timeout=120,  
    )  
    r.raise_for_status()  
    data = r.json()  
    embs = data.get("embeddings")  
    if isinstance(embs, list) and embs and isinstance(embs[0], list):  
        return [float(x) for x in embs[0]]  
    one = data.get("embedding")  
    if isinstance(one, list):  
        return [float(x) for x in one]  
    raise RuntimeError(f"Ollama embed response missing embeddings: {data!r}")  

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, an excerpt of &lt;code&gt;main()&lt;/code&gt; in &lt;code&gt;embed.py&lt;/code&gt;. &lt;strong&gt;Notice that it deletes and recreates the Chroma collection every run&lt;/strong&gt;. That's deliberate: DuckDB is the source of truth, and rebuilding the vector collection keeps Chroma in sync without writing diff/upsert logic. The tradeoff is that embedding is all-or-nothing; this script is not doing incremental vector maintenance. 😅&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Read all DuckDB rows, drop/recreate the Chroma collection, embed, and add in batches.  
    con = duckdb.connect(dpath, read_only=True)  
    rows = con.execute(  
        f"SELECT id, source_query, title, snippet, domain FROM {TABLE} ORDER BY id"  
    ).fetchall()  
    con.close()  
    if not rows:  
        raise SystemExit(f"No rows in {TABLE}; run ingest first.")  

    client = chroma_client()  # CHROMA_HOST / CHROMA_PORT / CHROMA_SSL  
    name = args.collection  
    try:  
        client.delete_collection(name)  
    except Exception:  
        pass  # no collection yet on first run  
    collection = client.create_collection(  
        name=name,  
        metadata={"hnsw:space": "cosine"},  # cosine in Chroma matches query style in serve  
    )  

    session = requests.Session()  
    ids: List[str] = []  
    embeddings: List[List[float]] = []  
    batch_size = 32  
    for i, (rid, source_query, title, snippet, domain) in enumerate(rows):  
        text = embedding_text(  
            str(title or ""),  
            str(snippet or ""),  
            str(domain or ""),  
            str(source_query or ""),  
        )  
        if not text:  
            text = str(rid)  # last resort so Ollama never sees an empty string  
        emb = ollama_embed_one(args.ollama_host, args.model, text, session)  
        ids.append(str(rid))  
        embeddings.append(emb)  
        if len(ids) &amp;gt;= batch_size or i == len(rows) - 1:  
            collection.add(ids=ids, embeddings=embeddings)  
            print(f"Added {len(ids)} vectors (row {i + 1}/{len(rows)})")  
            ids = []  
            embeddings = []
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 5: Running Cosine k-NN Over a Merged Corpus
&lt;/h2&gt;

&lt;p&gt;At this point, our two data stores have distinct jobs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Chroma&lt;/strong&gt; knows vectors and row ids;&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;DuckDB&lt;/strong&gt; holds everything else — &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;snippet&lt;/code&gt;, &lt;code&gt;URL&lt;/code&gt;, &lt;code&gt;domain&lt;/code&gt;, &lt;code&gt;position&lt;/code&gt;, &lt;code&gt;source_query&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our k-NN implementation, therefore, is simple: look up the anchor in DuckDB → fetch its vector from Chroma → query for nearby ids → hydrate back from DuckDB. The Chroma layer can stay thin, all display fields come from one place.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://github.com/sixthextinction/knn/blob/main/neighbors.py" rel="noopener noreferrer"&gt;neighbors.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two implementation details you should know about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The numpy guard:&lt;/strong&gt; Chroma may return &lt;code&gt;embeddings&lt;/code&gt; as a nested list or an &lt;a href="https://numpy.org/doc/2.4/reference/generated/numpy.ndarray.html" rel="noopener noreferrer"&gt;ndarray&lt;/a&gt;. The usual &lt;code&gt;if not embs&lt;/code&gt; breaks on arrays, so &lt;code&gt;first_embedding_for_query&lt;/code&gt; normalizes the first vector without relying on truthiness.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def first_embedding_for_query(embs: Any) -&amp;gt; Optional[List[float]]:  
    # bc Chroma may return `embeddings` as a nested list or `ndarray`   
    # This avoids `if not embs` on arrays.  
    if embs is None:  
        return None  
    if isinstance(embs, np.ndarray):  
        if embs.size == 0:  
            return None  
        v = embs[0] if embs.ndim &amp;gt; 1 else embs  
        return v.tolist()  
    if isinstance(embs, (list, tuple)):  
        if len(embs) == 0 or embs[0] is None:  
            return None  
        v = embs[0]  
        return v.tolist() if hasattr(v, "tolist") else list(v)  
    return None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Hydration:&lt;/strong&gt; Chroma returns ids and distances, not the rows themselves. So &lt;code&gt;rows_by_ids&lt;/code&gt; fetches the DuckDB records for those ids, keyed by &lt;code&gt;id&lt;/code&gt;so Chroma's ranked order is preserved when distances are stitched back on.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def rows_by_ids(con: duckdb.DuckDBPyConnection, ids: list[str]) -&amp;gt; dict[str, dict]:  
    if not ids:  
        return {}  
    placeholders = ",".join(["?"] * len(ids))  
    out: dict[str, dict] = {}  
    for row in con.execute(  
        f"""  
        SELECT id, source_query, url, title, snippet, domain, position  
        FROM {TABLE} WHERE id IN ({placeholders})  
        """,  
        ids,  
    ).fetchall():  
        rid = str(row[0])  
        out[rid] = {  
            "id": rid,  
            "source_query": row[1],  
            "url": row[2],  
            "title": row[3],  
            "snippet": row[4],  
            "domain": row[5],  
            "position": row[6],  
        }  
    return out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that our default path is always &lt;strong&gt;pure k-NN&lt;/strong&gt;: ask Chroma for &lt;code&gt;k+1&lt;/code&gt; nearest row ids, drop the anchor itself, hydrate from DuckDB.&lt;/p&gt;

&lt;p&gt;But I’ve also added a &lt;code&gt;cross_query_only&lt;/code&gt; flag — a UI filter toggled via a "Cross-query neighbors only" checkbox. Not pure k-NN, but useful to end users.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5fp69ui75y95k5xvqkmc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5fp69ui75y95k5xvqkmc.png" width="640" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results with cross_query_only switched on via UI&lt;/p&gt;

&lt;p&gt;When this is on, &lt;code&gt;compute_neighbors&lt;/code&gt; drops any candidate whose &lt;code&gt;source_query&lt;/code&gt; matches the anchor's. The nearest neighbors in vector space are often siblings from the same original search, so this lets you ask "show me related results from &lt;em&gt;other&lt;/em&gt; queries" without touching the index.&lt;/p&gt;

&lt;p&gt;Regardless, &lt;code&gt;compute_neighbors&lt;/code&gt; wires the two stores together.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def compute_neighbors(anchor: str, k: int = DEFAULT_K, *, cross_query_only: bool = False) -&amp;gt; dict:  
    k = max(1, min(int(k), 50))  
    dpath = db_path()  

    # 1) DuckDB validates the anchor and gives us the row metadata, including  
    # ``source_query`` for cross-query filtering.  
    con = duckdb.connect(dpath, read_only=True)  
    try:  
        anchor_row = row_by_id(con, anchor)  
    finally:  
        con.close()  
    if not anchor_row:  
        return _neighbor_error("unknown id", k, cross_query_only=cross_query_only)  

    # 2) Chroma stores the vector under the same row id.  
    coll = chroma_client().get_collection(collection_name())  
    got = coll.get(ids=[anchor], include=["embeddings"])  
    vector = first_embedding_for_query(got.get("embeddings"))  
    if vector is None:  
        return _neighbor_error("no embedding for id (re-run embed.py)", k, cross_query_only=cross_query_only)  

    max_n = max(1, min(int(coll.count()), 5000))  
    neighbors: list[dict] = []  

    if not cross_query_only:  
        # Normal mode: ask for k + 1 because the nearest result is usually the anchor itself.  
        qres = coll.query(query_embeddings=[vector], n_results=min(k + 1, max_n), include=["distances"])  
        out_ids, out_dist = _ids_distances_from_query(qres, anchor)  
        out_ids, out_dist = out_ids[:k], out_dist[:k]  

        con = duckdb.connect(dpath, read_only=True)  
        try:  
            by_id = rows_by_ids(con, out_ids)  
        finally:  
            con.close()  
        for nid, dist in zip(out_ids, out_dist):  
            if nid in by_id:  
                neighbors.append({**by_id[nid], "distance": dist})  

    else:  
        # Cross-query mode: nearest neighbors often share the same Google query,  
        # so over-fetch, filter by ``source_query``, and widen if we still need more.  
        anchor_seed = str(anchor_row.get("source_query") or "").strip()  
        n_results = min(max(k * 4 + 1, k + 12, 24), max_n)  
        while len(neighbors) `&amp;lt; k and n_results &amp;lt;= max_n:  
            qres = coll.query(query_embeddings=[vector], n_results=n_results, include=["distances"])  
            out_ids, out_dist = _ids_distances_from_query(qres, anchor)  

            con = duckdb.connect(dpath, read_only=True)  
            try:  
                by_id = rows_by_ids(con, out_ids)  
            finally:  
                con.close()  

            neighbors = []  
            for nid, dist in zip(out_ids, out_dist):  
                row = by_id.get(nid)  
                if not row:  
                    continue  
                if str(row.get("source_query") or "").strip() == anchor_seed:  
                    continue  
                neighbors.append({**row, "distance": dist})  
                if len(neighbors) &amp;gt;`= k:  
                    break  

            if len(neighbors) &amp;gt;= k or n_results &amp;gt;= max_n:  
                break  
            n_results = min(max(n_results * 2, k + 1), max_n)  

    return _neighbor_ok(anchor_row, neighbors, k, cross_query_only)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 6: Serving ChromaDB Vectors with FastAPI
&lt;/h2&gt;

&lt;p&gt;The backend is prime gotcha territory. Two of them, specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Route order is load-bearing.&lt;/strong&gt; FastAPI evaluates routes in declaration order. &lt;a href="https://www.starlette.io/staticfiles/" rel="noopener noreferrer"&gt;StaticFiles&lt;/a&gt; with &lt;code&gt;html=True&lt;/code&gt; is a catch-all — it will attempt to serve any path that isn't already handled as a file. If you mount it before registering the API routes, every request to &lt;code&gt;/api/rows&lt;/code&gt; tries to find a file named &lt;code&gt;api/rows&lt;/code&gt; in the static directory and returns 404. Make sure you register the API routes first.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Disable docs, redoc, and openapi.&lt;/strong&gt; For a local tool you’re restarting constantly, FastAPI’s schema introspection at startup is just noise.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PORT = int(os.getenv("SERVE_PORT", "8766"))  # override with env if needed  

app = FastAPI(  
    title="k-NN SERP",  
    docs_url=None,  
    redoc_url=None,  
    openapi_url=None,  
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The actual API surface is minimal. We’ll only need two endpoints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  First, &lt;code&gt;/api/rows&lt;/code&gt; dumps the full DuckDB corpus ordered by query then position — this is what populates the main table on load.&lt;/li&gt;
&lt;li&gt;  Next, &lt;code&gt;/api/neighbors&lt;/code&gt; takes an &lt;code&gt;id&lt;/code&gt; and &lt;code&gt;k&lt;/code&gt;, calls &lt;code&gt;compute_neighbors&lt;/code&gt;, and routes errors to the appropriate HTTP status codes via &lt;code&gt;_neighbors_http_response&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;💡&lt;/em&gt; The full project has a third endpoint, &lt;code&gt;/api/metrics&lt;/code&gt;, serving a precomputed &lt;code&gt;knn_metrics.json&lt;/code&gt; from &lt;code&gt;internal/knn_metrics.py&lt;/code&gt;. See the full code for that one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;We have to make each failure mode distinct because the the frontend decides what to show based on it&lt;/strong&gt;. So, 404 for an unknown id, 400 for a missing embedding (re-run &lt;code&gt;embed.py&lt;/code&gt;), 503 for Chroma unreachable (Docker daemon not running etc.)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://github.com/sixthextinction/knn/blob/main/serve.py" rel="noopener noreferrer"&gt;serve.py&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def _neighbors_http_response(result: dict) -&amp;gt; JSONResponse | dict:  
    err = result.get("error")  
    if not err:  
        return {  
            "anchor": result["anchor"],  
            "neighbors": result["neighbors"],  
            "k": result["k"],  
            "cross_query_only": result.get("cross_query_only", False),  
        }  
    if err == "unknown id":  
        return JSONResponse({...}, status_code=404)  
    if "DuckDB not found" in err:  
        return JSONResponse({...}, status_code=500)  
    if err.startswith("Chroma") or "Chroma" in err:  
        return JSONResponse({...}, status_code=503)  # dependency down  
    if "no embedding" in err:  
        return JSONResponse({...}, status_code=400)  # re-run embed.py  
    return JSONResponse({...}, status_code=500)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@app.get("/api/rows", response_model=None)  
def api_rows() -&amp;gt; JSONResponse | dict[str, Any]:  
    con = duckdb.connect(db_path(), read_only=True)  
    try:  
        rows = con.execute(  
            f"SELECT id, source_query, url, title, snippet, domain, position"  
            f" FROM {TABLE} ORDER BY source_query, position, id"  
        ).fetchall()  
    finally:  
        con.close()  
    payload = [  
        {  
            "id": r[0],  
            "source_query": r[1],  
            "url": r[2],  
            "title": r[3],  
            "snippet": r[4],  
            "domain": r[5],  
            "position": r[6],  
        }  
        for r in rows  
    ]  
    return {"rows": payload}  


@app.get("/api/neighbors", response_model=None)  
def api_neighbors(    id: str | None = Query(default=None),  
    k: int = DEFAULT_K,  
    cross_query: str = "0",  # query params are strings; convert below) -&amp;gt; JSONResponse | dict[str, Any]:  
    anchor = (id or "").strip()  
    if not anchor:  
        return JSONResponse({"error": "missing id", ...}, status_code=400)  
    cross_query_only = cross_query.strip().lower() in ("1", "true", "yes", "on")  
    return _neighbors_http_response(compute_neighbors(anchor, k, cross_query_only=cross_query_only))  


app.mount(  
    "/",  
    StaticFiles(directory=str(STATIC), html=True),  
    name="ui",  
)  


def main() -&amp;gt; None:  
    print(f"k-NN SERP UI: http://127.0.0.1:{PORT}/")  
    uvicorn.run(app, host="127.0.0.1", port=PORT)  


# Must come last — catches all unhandled paths as static files  
app.mount("/", StaticFiles(directory=str(STATIC), html=True), name="ui")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 7 — Serving a Neighbor Explorer UI with JavaScript
&lt;/h2&gt;

&lt;p&gt;I won’t go into UI design — this isn’t a frontend tutorial. Just use whatever rendering approach makes sense for you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qlceo8zrvcowlmpplhc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qlceo8zrvcowlmpplhc.png" width="640" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Filtered corpus view by "llama.cpp"&lt;/p&gt;

&lt;p&gt;So let’s just talk about the core JS we need — a &lt;code&gt;loadNeighbors&lt;/code&gt; function that hits&lt;code&gt;/api/neighbors&lt;/code&gt;, builds the anchor card, maps neighbors into a table row each with rank, cosine distance, title, domain, seed query, and snippet.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://github.com/sixthextinction/knn/blob/main/static/index.html" rel="noopener noreferrer"&gt;/static/index.html&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  async function loadNeighbors(id, tr) {  
  setRowActive(tr);  
  focusId = id;  
  const seq = ++loadSeq;  // capture sequence before any await  
  nPanel.hidden = false;  
  anchorBox.innerHTML = "`&amp;lt;p class="meta loading-cell"&amp;gt;`Loading…`&amp;lt;/p&amp;gt;`";  
  neighborsBox.innerHTML = "";  

  try {  
    const res = await fetch("/api/neighbors?id=" + encodeURIComponent(id) + "&amp;amp;k=8");  
    const data = await res.json();  
    if (seq !== loadSeq) return;  // a newer click landed first, discard this result  

    if (!res.ok) {  
      anchorBox.innerHTML = "`&amp;lt;p class="err"&amp;gt;`" + escapeHtml(data.error || res.statusText) + "`&amp;lt;/p&amp;gt;`";  
      return;  
    }  

    const a = data.anchor;  
    anchorBox.innerHTML = /* anchor card HTML */;  

    const n = data.neighbors || [];  
    neighborsBox.innerHTML = n.length === 0  
      ? "`&amp;lt;p class="meta"&amp;gt;`No neighbors (check embed.py / Chroma).`&amp;lt;/p&amp;gt;`"  
      : buildNeighborTable(n);  

  } catch (e) {  
    if (seq !== loadSeq) return;  
    anchorBox.innerHTML = "`&amp;lt;p class="err"&amp;gt;`" + escapeHtml(String(e)) + "`&amp;lt;/p&amp;gt;`";  
  }  
}  

(async function init() {  
  try {  
    const res = await fetch("/api/rows");  
    const data = await res.json();  
    allRows = data.rows || [];  
    renderTable(allRows);  
  } catch (e) {  
    rowMeta.textContent = "Failed to load /api/rows: " + e;  
  }  
})();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that’s everything for code! Let’s look at some interesting results I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: What Cosine k-NN Reveals Across 100 Google Searches
&lt;/h2&gt;

&lt;p&gt;I ran five metrics over every document in the corpus to verify the pipeline was &lt;em&gt;actually&lt;/em&gt; bridging queries (rather than just clustering within them.) The answer was a resounding &lt;em&gt;yes&lt;/em&gt; — &lt;strong&gt;42.2% of all neighbor links crossed query boundaries&lt;/strong&gt;, and &lt;strong&gt;every one of the 797 documents had at least one cross-query neighbor in its top 8&lt;/strong&gt; — including the niche ones!&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You can see the full raw data here: &lt;a href="https://github.com/sixthextinction/knn/blob/main/metrics.json" rel="noopener noreferrer"&gt;metrics.json&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The per-query breakdown is where it gets &lt;em&gt;really&lt;/em&gt; interesting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hub Queries vs. Island Queries: How Semantic Density Varies by Topic
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fahwmccm6r3bp2ee4z5gg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fahwmccm6r3bp2ee4z5gg.png" width="640" height="261"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqskjuyr56otsmqvc91eu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqskjuyr56otsmqvc91eu.png" width="640" height="1424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cross-query neighbor rate by source query — 5 best examples of each on the left, the full list on the right (click to expand). Images created by author via D3.js.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;tinyML getting started Arduino&lt;/code&gt; query scores 95.3% — the highest in the corpus, and on the surface, a beginner tutorial query. But its documents live at a vocabulary crossroads: the k-NN pulled neighbors from 17 different source queries, spanning hardware datasheets, ArXiv surveys, RTOS scheduling guides, and mobile deployment docs. Without the merged corpus you'd only have a list of tutorials. With it, we see that this query sits at the center of the whole topic space. Turns out, a specific chip or runtime can be a crossroads of &lt;em&gt;multiple&lt;/em&gt; topics.&lt;/p&gt;

&lt;p&gt;The “island” end is just as revealing. &lt;code&gt;pruning vs quantization&lt;/code&gt;, &lt;code&gt;weight clustering&lt;/code&gt;, &lt;code&gt;knowledge distillation&lt;/code&gt; — all 12–19%. Clearly, model compression theory forms a tight, self-contained cluster that talks to itself fluently and barely touches the rest of the corpus. &lt;strong&gt;If you're researching compression techniques, you're in a separate conversation from the people researching deployment and hardware&lt;/strong&gt; — even though most would assume those worlds overlap.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Query-to-Query Edges Reveal Hidden Connections
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;query-to-query edge count&lt;/strong&gt; measures how many neighbor links flow between each pair of source queries across the whole corpus:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F26oncwv4n6ardygayx7w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F26oncwv4n6ardygayx7w.png" width="640" height="327"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| Rank | Query A | A→B | B→A | Total | Query B |  
| ---: | --- | --: | --: | --: | --- |  
| 1 | `site:pytorch.org mobile deployment` | 31 | 26 | 57 | `PyTorch ExecuTorch mobile deployment guide` |  
| 2 | `WebAssembly machine learning inference browser` | 27 | 24 | 51 | `WebGPU machine learning inference browser` |  
| 3 | `site:arxiv.org tinyML survey 2024` | 21 | 16 | 37 | `site:arxiv.org efficient LLM survey` |  
| 4 | `llama.cpp performance ARM CPU benchmark` | 18 | 17 | 35 | `llama.cpp vs MLC LLM phone comparison` |  
| 5 | `ONNX Runtime vs TensorFlow Lite 2025` | 18 | 13 | 31 | `TensorFlow Lite vs ONNX Runtime for edge deployment` |  
| 6 | `INT8 vs INT4 accuracy loss LLM` | 15 | 14 | 29 | `INT4 quantization large language model accuracy` |  
| 7 | `Whisper tiny on-device speech recognition` | 13 | 10 | 23 | `offline speech recognition Android` |  
| 8 | `MediaPipe on-device LLM inference` | 14 | 8 | 22 | `Hugging Face on-device inference blog` |  
| 9 | `memory footprint LLM quantization MB` | 11 | 10 | 21 | `KV cache quantization LLM` |  
| 10 | `WebNN API machine learning browser native` | 13 | 6 | 19 | `WebAssembly machine learning inference browser` |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The site-scoped pairs at ranks 1 and 3 are worth a second look: &lt;code&gt;site:pytorch.org&lt;/code&gt; pairs tightly with the broader ExecuTorch guide; &lt;code&gt;site:arxiv.org&lt;/code&gt; pairs with the wider LLM survey. &lt;strong&gt;Our pipeline is detecting that a scoped search is a zoom-in on a broader topic — without being told.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Boundaries Barely Exist in Embedding Space
&lt;/h3&gt;

&lt;p&gt;For each document, I measured the cosine distance difference between its nearest same-query neighbor and its nearest cross-query neighbor. A cross-query neighbor at distance 0.246 was nearly as semantically close as a same-query neighbor at 0.202.&lt;/p&gt;

&lt;p&gt;Also, on average, such cross-query neighbors sit only a ~0.06 cosine distance away than a same-query one. That’s definitely not a loose thematic connection — in fact, it’s nearly as tight as the results Google already ranked together. &lt;strong&gt;Our pipeline is actually finding close ones that were never in the same search to begin with.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: A Proximity-Based Knowledge Graph
&lt;/h2&gt;

&lt;p&gt;Going &lt;strong&gt;proximity-based&lt;/strong&gt; instead of Google’s traditional &lt;strong&gt;relevance-ranked&lt;/strong&gt;, and &lt;strong&gt;cross-query&lt;/strong&gt; instead of &lt;strong&gt;query-bound&lt;/strong&gt;, gives us something Google does not. The 42.2% cross-query rate and the 3.52 average unique queries per neighborhood are evidence that the semantic space over a merged corpus has structure that rewards exploration.&lt;/p&gt;

&lt;p&gt;This is a fine research tool, but also a setup for something even MORE useful.&lt;/p&gt;

&lt;p&gt;You could always swap &lt;a href="https://www.nomic.ai/" rel="noopener noreferrer"&gt;Nomic&lt;/a&gt; for a larger embedding model, add reranking, build a graph visualization over the query-to-query edges, or pipe this into a RAG system as a retrieval layer. If you take a shot at this, let me know in the comments, or just reach out on &lt;a href="https://www.linkedin.com/in/prithwish-nath-04b873a7/" rel="noopener noreferrer"&gt;LinkedIn.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Some links in this article are tracking links used for analytics purposes only. I do not receive any commission or compensation from them.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How Failing at Fantasy Baseball Made Me Fix My Cron Jobs with Temporal</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Tue, 05 May 2026 16:03:31 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/how-failing-at-fantasy-baseball-made-me-fix-my-cron-jobs-with-temporal-537c</link>
      <guid>https://dev.to/prithwish_nath/how-failing-at-fantasy-baseball-made-me-fix-my-cron-jobs-with-temporal-537c</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ouu18icd99hd9zntuew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ouu18icd99hd9zntuew.png" alt="Image" width="800" height="567"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So I made a bad trade in my fantasy baseball league. Dropped Kaz Okamoto because — according to my data — he’d been cold for two weeks. In reality, he’s been on a tear for the last 9 days. 😅 &lt;strong&gt;This was a bad decision made because of bad data&lt;/strong&gt; — my stats cron job had hit a rate limit, exited with no errors, and my FastAPI backend kept serving a stale JSON snapshot.&lt;/p&gt;

&lt;p&gt;Well, I’d been meaning to fix that setup anyway. This time I did — and instead of patching the script, I tried out &lt;a href="https://github.com/temporalio/sdk-python" rel="noopener noreferrer"&gt;Temporal&lt;/a&gt; and…it worked &lt;em&gt;embarrassingly well&lt;/em&gt;. Retries, backoff, execution history — things I’d normally bolt on manually were just… there. And if the network layer itself was flaky — rate limits, geo blocks — I could just add a &lt;a href="https://get.brightdata.com/bd-what-is-a-residential-proxy?utm_content=how_failing_at_fantasy_baseball_made_me_fix_my_cron_jobs_with_temporal" rel="noopener noreferrer"&gt;proxy&lt;/a&gt; as a hardening layer.&lt;/p&gt;

&lt;p&gt;This actually prompted me to go look at some of our production ingest jobs at work, and I thought: &lt;em&gt;these are the same pattern, just with more surface area!&lt;/em&gt; I ended up swapping out one of them, tentatively, then another. Same pattern, just more scale.&lt;/p&gt;

&lt;p&gt;This is my (admittedly very casual) write-up of what I learned. I hope it’s useful!&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡&lt;/strong&gt; I use &lt;a href="https://docs.temporal.io/develop/python/" rel="noopener noreferrer"&gt;Temporal’s Python SDK&lt;/a&gt; here, but they have one for &lt;a href="https://docs.temporal.io/develop/typescript/" rel="noopener noreferrer"&gt;TypeScript&lt;/a&gt; too — if that’s your thing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  How Cron Jobs Can Burn You
&lt;/h2&gt;

&lt;p&gt;Here’s the brittle script I was using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# One-shot MLB.com player fetch + write  
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;player_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PLAYER_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.mlb.com/player/kazuma-okamoto-672960&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;out_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OUTPUT_DIR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./data/runs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latest.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARN: HTTP &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, leaving &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; unchanged&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_stats_datatable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;out_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Wrote stats to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARN: ingest failed (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="s"&gt;), leaving &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; unchanged&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That script lived behind a super simple crontab line — once a night, fixed schedule, stdout/stderr logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# crontab -l (excerpt)&lt;/span&gt;
0 2 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="nb"&gt;cd&lt;/span&gt; /home/me/fantasy-stats &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; .venv/bin/python scripts/fetch_player_stats.py &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /var/log/mlb_fetch.log 2&amp;gt;&amp;amp;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script is about ~20 lines of Python plus one line of schedule. How many potential failures can you spot here?&lt;/p&gt;

&lt;p&gt;The first is the thing I &lt;em&gt;thought&lt;/em&gt; was prudent: &lt;strong&gt;if the fetch looks wrong, don’t overwrite the snapshot.&lt;/strong&gt; So any &lt;strong&gt;429&lt;/strong&gt;, &lt;strong&gt;timeout&lt;/strong&gt;, or &lt;strong&gt;200 with a layout that no longer contains the marker&lt;/strong&gt; &lt;code&gt;extract_stats_datatable&lt;/code&gt; expects becomes a printed &lt;code&gt;WARN&lt;/code&gt;, a no-op, and &lt;strong&gt;&lt;code&gt;main()&lt;/code&gt; returns — exit code 0.&lt;/strong&gt; No &lt;a href="https://requests.readthedocs.io/en/latest/api/#requests.Response.raise_for_status" rel="noopener noreferrer"&gt;&lt;code&gt;raise_for_status()&lt;/code&gt;&lt;/a&gt;; no &lt;code&gt;sys.exit(1)&lt;/code&gt;. Cron is “happy”; the one-line warning vanishes in a log I wasn’t tailing; &lt;code&gt;latest.json&lt;/code&gt; never updates. &lt;strong&gt;Nine “successful” runs later…I made a bad decision because I had bad data.&lt;/strong&gt; (Flip it to &lt;code&gt;raise_for_status()&lt;/code&gt; and you get the opposite smell: a non-zero exit, still no retry, still stale data until someone fixes the feed — pick your poison 😬)&lt;/p&gt;

&lt;p&gt;The other two are subtler — and I didn’t personally run into them — but upon review, they were just as likely to have burnt me.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Fixed output path.&lt;/strong&gt; Every run writes to &lt;code&gt;latest.json&lt;/code&gt;. If two runs overlap — which happens the moment a run is slow and the next cron tick fires — it’s a race condition. One overwrites the other mid-write. You might read a corrupted file, or never know which run's data you actually have.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Non-atomic write.&lt;/strong&gt; &lt;code&gt;out.write_text()&lt;/code&gt; is not atomic. If the process dies mid-write — OOM, signal, anything — you get a partial JSON file. The next reader gets a parse error and now has to figure out if the file is corrupted or just empty. This is the &lt;em&gt;exactly&lt;/em&gt; kind of bug that shows up at 2am on a production system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real problem isn’t any ONE of these — it’s really that &lt;strong&gt;cron gives you exactly one bit of feedback: exit zero or exit non-zero&lt;/strong&gt; — and as this script shows, &lt;strong&gt;exit zero can lie.&lt;/strong&gt; It can’t give you a retry policy, overlap protection, or artifact history. No way to answer “what did this job actually do at 3am last Tuesday?”&lt;/p&gt;

&lt;p&gt;And yes, you can try to patch around that. Add a retry loop and exponential backoff, ship logs somewhere. But now your retry state lives in the process memory that disappears on crash, your backoff is hand-rolled, and your observability is still pretty much log spelunking. At that point you’re not using cron anymore. &lt;strong&gt;You’re rebuilding a tiny, worse workflow engine around cron.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Temporal?
&lt;/h2&gt;

&lt;p&gt;Temporal is a “durable execution platform”. What that really means in practice is that you’ll write ordinary functions — a &lt;a href="https://docs.temporal.io/workflows" rel="noopener noreferrer"&gt;Workflow&lt;/a&gt; that orchestrates things, and &lt;a href="https://docs.temporal.io/activities" rel="noopener noreferrer"&gt;Activities&lt;/a&gt; that do the actual work — and Temporal will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Make the execution survive process crashes,&lt;/li&gt;
&lt;li&gt;  Retry failed steps with backoff,&lt;/li&gt;
&lt;li&gt;  Prevent overlapping runs, and&lt;/li&gt;
&lt;li&gt;  Record the full history of every execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡&lt;/strong&gt; &lt;a href="https://docs.temporal.io/evaluate/why-temporal#reliable-execution" rel="noopener noreferrer"&gt;Durable execution&lt;/a&gt; is the simple idea that your code should keep running to completion even if the machine running it doesn’t. The mental model is that your workflow is a function call that cannot be interrupted, even if the worker reboots halfway through. State lives in Temporal’s history, not in the worker’s memory.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The architecture for our project will look like this (by default you get the &lt;a href="https://docs.temporal.io/web-ui" rel="noopener noreferrer"&gt;Temporal Web UI&lt;/a&gt; on &lt;code&gt;localhost:8233&lt;/code&gt;):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6j2i3ebz9n7hyq44mzvd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6j2i3ebz9n7hyq44mzvd.png" width="640" height="626"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To grok Temporal properly, understand that &lt;strong&gt;the Workflow owns &lt;em&gt;when&lt;/em&gt; things happen, while the Activity owns &lt;em&gt;what&lt;/em&gt; happens&lt;/strong&gt; — i.e the page fetch, the stats extraction, the file write. &lt;strong&gt;Workflows&lt;/strong&gt; must stay deterministic; all side effects belong in &lt;strong&gt;Activities&lt;/strong&gt;. Why does that matter? We’ll come back to that in a second.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting player data from MLB.com
&lt;/h2&gt;

&lt;p&gt;Before any of the Temporal machinery, you need a reliable data ingestion. For us, that means loading an MLB.com player page and extracting the stats blob embedded in the initial HTML.&lt;/p&gt;

&lt;p&gt;MLB.com currently renders a player page with a JavaScript object that starts with &lt;code&gt;stats: {"statsDatatable"...}&lt;/code&gt;. That's convenient: no browser, no Playwright, no screenshot automation needed.&lt;/p&gt;

&lt;p&gt;Critical to understand that this does &lt;strong&gt;not&lt;/strong&gt; tell you that the stats blob is still there or that the row you care about was parsed correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;mlb_player_stats.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
MLB.com player page: HTTP fetch + embedded JSON extraction.  
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JSONDecoder&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;quote&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_strip_tags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;  
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;`&amp;lt;[^&amp;gt;`]+&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_sanitize_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
    &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;  
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_strip_tags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_stats_datatable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
    &lt;span class="n"&gt;needle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;stats: {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statsDatatable&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;  
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;needle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Could not find stats JSON marker (page layout may have changed).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stats: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;JSONDecoder&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;raw_decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pick_current_season_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Regular Season&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Career&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_requests_proxies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="c1"&gt;# Bright Data super proxy; returns None if credentials unset  
&lt;/span&gt;    &lt;span class="n"&gt;explicit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_PROXY_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;explicit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;explicit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;explicit&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_PROXY_HOST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brd.superproxy.io&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_PROXY_PORT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;33335&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;username&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_PROXY_USERNAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_PROXY_PASSWORD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  
    &lt;span class="n"&gt;user_enc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;safe&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;pass_enc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;safe&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;proxy_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_enc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pass_enc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;@&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;proxy_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;proxy_url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_player_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;proxies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_requests_proxies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_stats_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
    &lt;span class="c1"&gt;# Fetch page, parse embedded hitting summary rows (sanitized)  
&lt;/span&gt;    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_player_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_stats_datatable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;hitting_large&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statsDatatable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hitting&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
    &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hitting_large&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hitting_large&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;hitting_large&lt;/span&gt;  
    &lt;span class="n"&gt;filtered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filteredRows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
    &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pick_current_season_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;career&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;filtered&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Career Regular Season&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
        &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;via_proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_requests_proxies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;via_bright_data_proxy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;via_proxy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;current_regular_season&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;_sanitize_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;career_regular_season_row&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;_sanitize_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;career&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;career&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all_summary_rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;_sanitize_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;filtered&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keeping the fetch inside an activity means the execution model stays unchanged while the &lt;strong&gt;network path can evolve independently&lt;/strong&gt;. The same &lt;code&gt;fetch_player_page&lt;/code&gt; call can run directly or be routed through a proxy layer without touching the workflow logic.&lt;/p&gt;

&lt;p&gt;Note that Temporal can only give you execution reliability: retries, timeouts, and visibility. &lt;strong&gt;It can not make a failing network succeed.&lt;/strong&gt; If every attempt returns 429 from the same IP, Temporal will reliably retry a failing request until the policy is exhausted — you won’t know how valuable our proxy layer is until you &lt;em&gt;really&lt;/em&gt; need it (if you’re following along, &lt;a href="https://get.brightdata.com/bd7914?utm_content=how_failing_at_fantasy_baseball_made_me_fix_my_cron_jobs_with_temporal" rel="noopener noreferrer"&gt;get it here&lt;/a&gt;.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Extraction with Temporal.io
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Temporal Workflow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@workflow.defn&lt;/span&gt;  
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StatsCollectionWorkflow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="nd"&gt;@workflow.run&lt;/span&gt;  
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;StatsJob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;CollectStatsResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_activity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="n"&gt;collect_stats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="nc"&gt;CollectStatsInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;workflow_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;workflow_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;start_to_close_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;retry_policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;RetryPolicy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;initial_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                &lt;span class="n"&gt;backoff_coefficient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;maximum_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                &lt;span class="n"&gt;maximum_attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;a href="https://docs.temporal.io/retry-policies" rel="noopener noreferrer"&gt;RetryPolicy&lt;/a&gt; block is the part most of us have written manually at some point — a &lt;code&gt;while&lt;/code&gt; loop, a &lt;code&gt;try/except&lt;/code&gt;, a &lt;code&gt;time.sleep&lt;/code&gt;, a counter, hopefully a max attempts check. Here it's declared once, lives outside the business logic, and &lt;em&gt;survives worker crashes&lt;/em&gt;. If the worker process dies on attempt 3 of 8, the next worker that comes up picks up at attempt 4. The state is in Temporal, not in memory.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.temporal.io/encyclopedia/detecting-activity-failures#start-to-close-timeout" rel="noopener noreferrer"&gt;start_to_close_timeout&lt;/a&gt; is the hard limit on how long a single activity attempt can run. Without it, a stalled HTTP request holds a worker slot indefinitely. Decidedly &lt;em&gt;not&lt;/em&gt; what we want.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Temporal Activity
&lt;/h3&gt;

&lt;p&gt;Here’s &lt;code&gt;activities.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Activities: MLB.com fetch + stats extraction + artifact write (all side effects here).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Union&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;temporalio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;activity&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;temporal_cron.mlb_player_stats&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;build_stats_payload&lt;/span&gt;  

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;  
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CollectStatsInput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;  
    &lt;span class="n"&gt;workflow_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;  
    &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;  
    &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;  

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;  
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CollectStatsResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;artifact_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;  
    &lt;span class="n"&gt;home_runs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Union&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
    &lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_atomic_write_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;tmp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_suffix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;suffix&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.tmp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="nd"&gt;@activity.defn&lt;/span&gt;  
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;collect_stats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CollectStatsInput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;CollectStatsResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="c1"&gt;# One HTTP try per activity attempt; workflow RetryPolicy owns backoff/attempts.  
&lt;/span&gt;    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_stats_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;current_regular_season&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;  
    &lt;span class="n"&gt;hr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;homeRuns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_dir&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OUTPUT_DIR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./data/runs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
    &lt;span class="n"&gt;safe_wid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;workflow_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;safe_rid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;out_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;safe_wid&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;__&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;safe_rid&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;workflow_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;workflow_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;player_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="nf"&gt;_atomic_write_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CollectStatsResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;artifact_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;  
        &lt;span class="n"&gt;home_runs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hr&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;hr&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;player_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The output path uses the run ID, not a fixed filename.&lt;/strong&gt; Every execution gets its own artifact — &lt;code&gt;stats-manual-abc123__run456.json&lt;/code&gt;. No races, no overwrites, and you have a full history of every run. You can diff two runs. You can see exactly what data you had on any given night. This alone would have saved me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The file write is atomic:&lt;/strong&gt; &lt;code&gt;_atomic_write_json&lt;/code&gt; (in the excerpt above) writes to a &lt;code&gt;.tmp&lt;/code&gt; file first, then &lt;code&gt;replace()&lt;/code&gt; on the same filesystem. A reader either sees the old file or the new file — never a partial write. The brittle script calls &lt;code&gt;write_text()&lt;/code&gt; directly; if the process died mid-write, you got corrupted JSON and a confusing parse error at the worst possible time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 Note that &lt;code&gt;collect_stats&lt;/code&gt; is a sync function, not async. That's intentional — sync activities run in a thread pool, so blocking I/O doesn't block the event loop. The &lt;a href="https://docs.temporal.io/develop/python/best-practices/python-sdk-sync-vs-async" rel="noopener noreferrer"&gt;Temporal SDK supports both&lt;/a&gt;; sync is the right call when your activity is mostly waiting on a network request.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Temporal Worker
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;worker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Worker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;task_queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task_queue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;workflows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;StatsCollectionWorkflow&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
        &lt;span class="n"&gt;activities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;collect_stats&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://docs.temporal.io/develop/python/workers/" rel="noopener noreferrer"&gt;Worker&lt;/a&gt; is the only process that ever touches MLB.com. The workflow and scheduler are pure orchestration — they tell Temporal what to do, they don’t do any work themselves. You can scale workers horizontally without touching the scheduling layer. You’ll want that when you take this pattern to production.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dqc11o8b785y6os5b48.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dqc11o8b785y6os5b48.png" width="640" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The worker view shows the stats-pipeline worker polling alongside Temporal's own system worker.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Temporal Schedule
&lt;/h2&gt;

&lt;p&gt;Putting this on a schedule is one function call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;schedule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Schedule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ScheduleActionStartWorkflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;StatsCollectionWorkflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;workflow_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;task_queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task_queue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;),&lt;/span&gt;  
    &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ScheduleSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cron_expressions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;  
    &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;SchedulePolicy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ScheduleOverlapPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SKIP&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://docs.temporal.io/schedule" rel="noopener noreferrer"&gt;SKIP&lt;/a&gt; &lt;strong&gt;is the thing cron simply cannot do.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, if MLB.com is slow today, or your fetches are getting rate-limited harder than usual, and your run takes 12 minutes. Your schedule fires every 15. Eventually the slow run bleeds into the next tick. With cron, you now have two instances running simultaneously, both writing to &lt;code&gt;latest.json&lt;/code&gt;, racing each other. With &lt;code&gt;SKIP&lt;/code&gt;, the new scheduled run sees the previous one is still active and does nothing. When the previous run finishes, the schedule resumes normally at the next tick. That's an entire class of bug you stop thinking about.&lt;/p&gt;

&lt;p&gt;The schedule script also handles create-or-update correctly — &lt;code&gt;describe()&lt;/code&gt; first, catch &lt;code&gt;NOT_FOUND&lt;/code&gt;, then either create or update:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;RPCError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;RPCStatusCode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NOT_FOUND&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;raise&lt;/span&gt;  
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_schedule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;schedule_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ScheduleUpdate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it once to create the schedule. Run it again to change the cron expression or job parameters. Same command either way.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you actually see in the UI
&lt;/h2&gt;

&lt;p&gt;Pop open &lt;code&gt;http://localhost:8233&lt;/code&gt; while a workflow is running. Every step is there — which activity ran, how many attempts it took, what the retry intervals were, what came back. If it failed on attempt 2 and succeeded on attempt 5, you can see that. You can see the exact input that went in and the exact output that came out. You can see how long each attempt took.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frc8fyr3esqobmhx20oqx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frc8fyr3esqobmhx20oqx.png" width="640" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The workflow list gives you the first-level answer cron never gives cleanly: what ran, when, and whether it completed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1raipxtwzvbi3uxam2o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1raipxtwzvbi3uxam2o.png" width="640" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtvbghcqe8smghlckpw5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtvbghcqe8smghlckpw5.png" width="640" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The timeline connects the workflow input, activity execution, and output artifact in one place. The event history is the audit trail: scheduled tasks, activity start/completion, workflow task transitions, and final result.&lt;/p&gt;

&lt;p&gt;The activity details will also show the successful third attempt and preserve the previous failure: &lt;code&gt;429 Too Many Requests&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6zid43pr4lsiqv7jwfpl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6zid43pr4lsiqv7jwfpl.png" width="640" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Temporal activity event showing attempt #3 with previous HTTP 429 failure details&lt;/p&gt;

&lt;p&gt;Compare that to the cron version — just a process exit code and whatever you &lt;code&gt;print()&lt;/code&gt; to stdout. If the job failed at 3am and you weren't tailing logs, that information is gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This observability is honestly why teams end up on Temporal even for jobs that aren’t that complicated.&lt;/strong&gt; This removes entire categories of debugging work. No log spelunking to figure out whether a retry happened. No guessing how many attempts ran. No reconstructing a timeline from scattered stdout. You won’t have to infer from fragments; all this is something you can just look up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running it yourself
&lt;/h2&gt;

&lt;p&gt;Commands below assume &lt;a href="https://docs.astral.sh/uv/" rel="noopener noreferrer"&gt;uv&lt;/a&gt; is available with a venv set up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start the local Temporal server  &lt;/span&gt;
temporal server start-dev  
&lt;span class="c"&gt;# In another terminal, start the worker  &lt;/span&gt;
uv run temporal-cron-worker  
&lt;span class="c"&gt;# Trigger a single run  &lt;/span&gt;
uv run temporal-cron-start  
&lt;span class="c"&gt;# Or put it on a schedule (defaults to every 15 minutes, set SCHEDULE_CRON in .env to change)  &lt;/span&gt;
uv run temporal-cron-schedule
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8233&lt;/code&gt; and you'll see the workflow execution. Artifacts land under &lt;code&gt;data/runs/&lt;/code&gt;, one file per run, named by workflow and run ID.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Note on Production
&lt;/h2&gt;

&lt;p&gt;This just runs against a local Temporal dev server with a single worker. Taking it to production means picking &lt;a href="https://temporal.io/cloud" rel="noopener noreferrer"&gt;Temporal Cloud&lt;/a&gt; or running your own cluster, adding proper secrets management, structured logging, metrics, and deployment automation.&lt;/p&gt;

&lt;p&gt;Cron is all you need for simple, isolated tasks. Probably not so much when you have to introduce retries, external dependencies, or jobs that can overlap or run longer than their schedule.&lt;/p&gt;

&lt;p&gt;When you get into that zone, Temporal gives you a durable execution layer around ingest: retries, timeouts, overlap control, and a history you can audit. And then when that layer needs to scale up or if you start hitting IP-based (or geo-based, even) friction, Bright Data’s proxies give you the hardened network path you’ll need. It’s been a pretty natural pairing, in my experience. You can route the same &lt;code&gt;requests.get()&lt;/code&gt; call through a proxy and let Temporal keep owning retries, timeouts, and audit history.&lt;/p&gt;

&lt;p&gt;Regardless, this &lt;em&gt;pattern&lt;/em&gt; is a cheat code right here for Temporal: &lt;strong&gt;Workflows own orchestration, and Activities own side effects.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
      <category>devops</category>
      <category>automation</category>
    </item>
    <item>
      <title>I Built a $0 Search Engine on Real Web Data (No Algolia or Elasticsearch)</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Tue, 21 Apr 2026 06:43:00 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/i-built-a-0-search-engine-on-real-web-data-no-algolia-or-elasticsearch-26fl</link>
      <guid>https://dev.to/prithwish_nath/i-built-a-0-search-engine-on-real-web-data-no-algolia-or-elasticsearch-26fl</guid>
      <description>&lt;p&gt;I’ve been reviewing some old RAG code I wrote a year ago and boy has it &lt;em&gt;not&lt;/em&gt; aged well. For context: frequently, I’ll need to do what I call a fast “literature” pass. &lt;em&gt;What’s the latest opinion on long context vs. RAG? What’s new for agentic retrieval? Where does hybrid search fit in 2026?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I depend heavily on &lt;a href="https://arxiv.org/" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt; papers for this. Using a SERP API (&lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=i_built_a_zero_dollar_search_engine_on_real_web_data_no_algolia_or_elasticsearch" rel="noopener noreferrer"&gt;Bright Data&lt;/a&gt;) for Google, running &lt;code&gt;site:arxiv.org&lt;/code&gt; plus a well-framed query gets me recent, relevant papers. I’d run four or five of these, collate the results…and then inevitably get bogged down scrolling JSON, opening new tabs, running &lt;code&gt;grep&lt;/code&gt;. Just godawful UX for what is, at its core, a &lt;em&gt;search&lt;/em&gt; problem (not data).&lt;/p&gt;

&lt;p&gt;So I spent this past week refactoring it all, and eventually ended up building a local faceted search surface over real web data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  I fetch organic Google results (&lt;code&gt;site:arxiv.org + query&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;  Index them with &lt;a href="https://typesense.org/" rel="noopener noreferrer"&gt;Typesense&lt;/a&gt; (a FOSS, lightning-fast, local-first Algolia alternative)&lt;/li&gt;
&lt;li&gt;  A few lines of Python server code to proxy queries from the browser.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives me a live, filterable index where I now quickly search across all my query runs at once, see which papers surfaced under which query angle, and spot overlaps in seconds instead of minutes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg8a4h83fbnd46cwdl0cv.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg8a4h83fbnd46cwdl0cv.jpeg" alt="screenshot showing Faceted search over indexed SERP rows: keyword query, seed-query chips, domain chip, and result cards with provenance" width="640" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This pattern should work for &lt;em&gt;any&lt;/em&gt; research domain where Google is a better discovery layer than the source’s own search, so &lt;a href="https://github.com/sixthextinction/typesense" rel="noopener noreferrer"&gt;I’m open sourcing it&lt;/a&gt; and writing this up. I hope it’s useful!&lt;/p&gt;

&lt;p&gt;Find the full code here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/sixthextinction/typesense" rel="noopener noreferrer"&gt;GitHub - sixthextinction/typesense: POC for local-first Algolia-style search but FOSS. Ingests…POC for local-first Algolia-style search but FOSS.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;




&lt;p&gt;I wasn’t kidding about keeping the Python side minimal. We’ll only need Docker, and three Python packages (no frameworks):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;requests&amp;gt;=2.28.0  
python-dotenv&amp;gt;=1.0.0  
typesense&amp;gt;=0.21.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The only one worth mentioning is the &lt;a href="https://github.com/typesense/typesense-python" rel="noopener noreferrer"&gt;Typesense Python client&lt;/a&gt; — it handles schema creation, JSONL import, and search. The other two are bog-standard &lt;a href="https://requests.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;Requests&lt;/a&gt; and &lt;a href="https://pypi.org/project/python-dotenv/" rel="noopener noreferrer"&gt;python-dotenv&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Typesense itself runs in &lt;a href="https://docs.docker.com/compose/" rel="noopener noreferrer"&gt;Docker Compose&lt;/a&gt;. Let’s make it one container with a persistent volume that survives restarts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;docker-compose.yml&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;services:  
  typesense:  
    image: typesense/typesense:26.0  
    restart: unless-stopped  
    ports:  
      - "8108:8108"  
    volumes:  
      - typesense-data:/data  
    command: &amp;gt;  
      --data-dir /data  
      --api-key devtypesense  
      --listen-port 8108  
      --enable-cors  

volumes:  
  typesense-data:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# and then you can do  
docker compose up -d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;.env&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BRIGHT_DATA_API_KEY=your_api_key  
BRIGHT_DATA_ZONE=serp  
BRIGHT_DATA_COUNTRY=us  
TYPESENSE_API_KEY=devtypesense
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;TYPESENSE_API_KEY&lt;/code&gt; can be anything really — it just has to match the &lt;code&gt;--api-key&lt;/code&gt; flag in the compose file. I'll explain why the browser never sees it when we get to &lt;code&gt;serve.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/lp-scraping-browser-acf1964?utm_content=i_built_a_zero_dollar_search_engine_on_real_web_data_no_algolia_or_elasticsearch" rel="noopener noreferrer"&gt;Bright Data credentials come from your account.&lt;/a&gt; If you’re swapping in another SERP API, this is the only file you’d change.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the pieces fit together
&lt;/h2&gt;




&lt;p&gt;We have four files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bright_data_serp.py   # Bright Data SERP client  
ingest.py             # fetch → transform → upsert into Typesense  
serve.py              # /api/search proxy + static file server  
static/index.html     # search UI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ingest.py&lt;/code&gt; sends queries to Bright Data, maps each organic result to a Typesense document, and bulk-imports the batch. After that, &lt;code&gt;serve.py&lt;/code&gt; sits between the browser and Typesense — authenticated calls go out, plain JSON comes back. The browser never talks to Typesense directly.&lt;/p&gt;

&lt;p&gt;Let’s go through each.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to get structured SERP data from Bright Data
&lt;/h2&gt;




&lt;p&gt;Our client will just &lt;code&gt;POST&lt;/code&gt; to &lt;code&gt;https://api.brightdata.com/request&lt;/code&gt; with a Bearer token, a zone name, and a Google URL string.&lt;/p&gt;

&lt;p&gt;Critically, you have to include &lt;code&gt;brd_json=1&lt;/code&gt;. Without it you get raw HTML. With it, you get a parsed &lt;code&gt;organic&lt;/code&gt; JSON array — each row has &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;link&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt;, &lt;code&gt;rank&lt;/code&gt;, and usually more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;bright_data_serp.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;  

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;limit_organic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Keep at most ``max_results`` organic rows. Google/Bright Data often ignore ``&amp;amp;num=``; slice client-side.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;max_results&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;  
    &lt;span class="n"&gt;organic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt; &lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;\&lt;span class="o"&gt;*&lt;/span&gt;\&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;  


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BrightDataSERPClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_COUNTRY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.brightdata.com/request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY is required.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE is required.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
        &lt;span class="n"&gt;last_err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_do_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;last_err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;  
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; \&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;last_err&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;last_err&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_do_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
        &lt;span class="c1"&gt;# Omit &amp;amp;num=: deprecated by Google (Bright Data strips it); use limit_organic after fetch.  
&lt;/span&gt;        &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.google.com/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?q=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;brd_json=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;hl=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;lr=lang_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="n"&gt;target_country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;  
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;search_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;  

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bright Data unexpected response type: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;inner_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;inner_status&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;inner_status&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bright Data SERP status_code=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;inner_status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;  
                    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bright Data SERP empty body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;  
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bright Data response missing &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; and &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;limit_organic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A very common gotcha: no matter how intuitive it might feel, do not put &lt;code&gt;&amp;amp;num=&lt;/code&gt; on that search URL to request N results like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;search_url = (  
    f"https://www.google.com/search"  
    f"?q={requests.utils.quote(query)}"  
    f"&amp;amp;num=50"  
    f"&amp;amp;brd_json=1"  
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Google deprecated the &lt;code&gt;num&lt;/code&gt; parameter for ordinary web search back in September 2025. Now, you typically get about one page of organics (~10). So we'll have to cap rows in code with &lt;code&gt;limit_organic(..., num_results)&lt;/code&gt; — slice &lt;code&gt;organic&lt;/code&gt; after the response, not via the URL.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;"format": "json"&lt;/code&gt;, the JSON you parse from the HTTP response is an envelope, not the SERP object itself: &lt;code&gt;status_code&lt;/code&gt;, &lt;code&gt;headers&lt;/code&gt;, and &lt;code&gt;body&lt;/code&gt;. The real SERP payload is inside &lt;code&gt;body&lt;/code&gt;, usually as a JSON string you must &lt;code&gt;json.loads&lt;/code&gt; again.&lt;/p&gt;

&lt;p&gt;That means only a 200 from &lt;code&gt;api.brightdata.com&lt;/code&gt; is not enough: check the inner &lt;code&gt;status_code&lt;/code&gt; (e.g. 401 → empty &lt;code&gt;body&lt;/code&gt;). The client rejects non-200 inner status, empty &lt;code&gt;body&lt;/code&gt;, and missing &lt;code&gt;organic&lt;/code&gt; after unwrap so ingest doesn’t silently index nothing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;result = response.json()  
inner = result.get("status_code")  
if inner is not None and inner != 200:  
    raise RuntimeError(f"Bright Data SERP status_code={inner}")  
if "body" in result:  
    body = result["body"]  
    if isinstance(body, str):  
        if not body.strip():  
            raise RuntimeError("Bright Data SERP empty body")  
        result = json.loads(body)  
    else:  
        result = body  
# ...  
return limit_organic(result, num_results)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you skip the unwrap and pass the top-level dict to &lt;code&gt;organic_to_documents&lt;/code&gt;, there's no &lt;code&gt;organic&lt;/code&gt; key — and with no check, you get an empty index and no error message. It just silently indexes &lt;em&gt;nothing&lt;/em&gt;. (Ask me how I know.🙃)&lt;/p&gt;

&lt;p&gt;Finally, our client retries with a short backoff — &lt;code&gt;0.5s * (attempt + 1)&lt;/code&gt; — so a transient failure on one query doesn't kill the whole run.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to design the Typesense Schema
&lt;/h2&gt;




&lt;p&gt;Typesense needs a &lt;a href="https://typesense.org/docs/0.25.0/api/collections.html" rel="noopener noreferrer"&gt;collection&lt;/a&gt; before anything can go in. The schema maps directly to the shape of an organic SERP result — I didn’t add any fields I wasn’t already getting for free:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ingest.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Fetches Google SERP via Bright Data THEN indexes organic results into Typesense.  
# Use --append to upsert into an existing index.   
# Use --query and/or --queries-file to override the built-in demo query list.  
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urlparse&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;typesense&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typesense.exceptions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ObjectNotFound&lt;/span&gt;  

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bright_data_serp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BrightDataSERPClient&lt;/span&gt;  

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

&lt;span class="n"&gt;COLLECTION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

&lt;span class="c1"&gt;# Some obvious "RAG and retrieval" topics  
&lt;/span&gt;&lt;span class="n"&gt;DEFAULT_QUERIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;site:arxiv.org retrieval augmented generation 2026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;site:arxiv.org hybrid search reranking 2026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;site:arxiv.org agentic RAG 2026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;site:arxiv.org long context vs RAG 2026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
&lt;span class="p"&gt;]&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;typesense_client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;typesense&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;typesense&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nodes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
                &lt;span class="p"&gt;{&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TYPESENSE_HOST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TYPESENSE_PORT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8108&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;protocol&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TYPESENSE_PROTOCOL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                &lt;span class="p"&gt;}&lt;/span&gt;  
            &lt;span class="p"&gt;],&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TYPESENSE_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;connection_timeout_seconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;collection_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fields&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;optional&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;position&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;int32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  
        &lt;span class="p"&gt;],&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default_sorting_field&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;position&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;organic_to_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;source_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;  
    &lt;span class="n"&gt;organic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="k"&gt;continue&lt;/span&gt;  
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;continue&lt;/span&gt;  
        &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
        &lt;span class="n"&gt;snippet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;  
        &lt;span class="n"&gt;snippet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
        &lt;span class="n"&gt;pos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;position&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
        &lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;netloc&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;  
        &lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\t&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;source_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;position&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ensure_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;typesense&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; \&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recreate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;recreate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ObjectNotFound&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;pass&lt;/span&gt;  
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;collection_schema&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt;  
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ObjectNotFound&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;collection_schema&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_queries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
    &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queries_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queries_file&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;splitlines&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;  
            &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
                &lt;span class="k"&gt;continue&lt;/span&gt;  
            &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;extra&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DEFAULT_QUERIES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ingest Bright Data SERP into Typesense.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--num-results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Max organic rows to index per query after fetch (Google ignores &amp;amp;num=; we slice client-side).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--delay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Seconds between Bright Data requests.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;store_true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Do not drop the collection; create it only if missing. Use for multiple ingest runs into one index.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;dest&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;metavar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SERP query string (repeatable). Default: built-in demo queries if no --queries-file/--query.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--queries-file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Path to a file with one query per line (# and blank lines ignored).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;typesense_client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="nf"&gt;ensure_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recreate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="n"&gt;bd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BrightDataSERPClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;all_docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="n"&gt;query_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_queries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Query: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;continue&lt;/span&gt;  
        &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;organic_to_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  indexed &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; organic rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;all_docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;all_docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No documents to import. Check Bright Data credentials and SERP response.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt;  

    &lt;span class="n"&gt;jsonl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;imp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;import_&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;upsert&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;  
    &lt;span class="c1"&gt;# import_ returns one JSON object per line  
&lt;/span&gt;    &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;imp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:false&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Import reported errors (first few):&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Done. Total documents: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two fields have &lt;code&gt;facet: True&lt;/code&gt;: &lt;code&gt;source_query&lt;/code&gt; and &lt;code&gt;domain&lt;/code&gt;. These are what the filter chips in the UI are built on. &lt;code&gt;source_query&lt;/code&gt; is the exact string sent to the SERP API — i.e. not a label you add later, the &lt;em&gt;actual query&lt;/em&gt;. &lt;code&gt;domain&lt;/code&gt; is extracted from the URL at ingest time.&lt;/p&gt;

&lt;p&gt;Both become filterable for free here, which is a huge win for us.&lt;/p&gt;

&lt;p&gt;Also, &lt;code&gt;default_sorting_field: "position"&lt;/code&gt; means results come back in the same order Google returned them. I &lt;em&gt;do&lt;/em&gt; want that as a default — it's the ranking signal I'm using Bright Data to get in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Some Common Gotchas
&lt;/h2&gt;




&lt;p&gt;When you’re mapping organic results to documents, the first question is how to generate document IDs. The move that &lt;em&gt;feels&lt;/em&gt; right is to simply hash the URL — deduplicate on URL, one document per link.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don’t listen to that instinct.&lt;/strong&gt; Don’t do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What you should do is bake the query into the ID&lt;/strong&gt; so the same link under two Bright Data runs is two documents, each tagged with the query that surfaced it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\t&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ID is &lt;code&gt;sha256(url + source_query)&lt;/code&gt;, so the same paper appearing under two different queries becomes two separate documents. Search for a paper title and both facet chips show up — you can see exactly &lt;em&gt;which&lt;/em&gt; of your Bright Data runs found it. If you hash on URL alone, that's gone permanently. The index &lt;em&gt;looks&lt;/em&gt; cleaner but you've thrown away the only thing that makes the &lt;code&gt;source_query&lt;/code&gt; facet meaningful.&lt;/p&gt;

&lt;p&gt;One more thing that will ruin your day if you miss it: Bright Data returns &lt;code&gt;link&lt;/code&gt; in most payloads but &lt;code&gt;url&lt;/code&gt; in some. &lt;code&gt;description&lt;/code&gt; and &lt;code&gt;snippet&lt;/code&gt; both map to the snippet field depending on the response. Handle both, else some batches might index with blank snippets, no errors or warnings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;url&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;  
&lt;span class="n"&gt;snippet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The difference between a snapshot vs. a corpus
&lt;/h2&gt;




&lt;p&gt;&lt;code&gt;ingest.py&lt;/code&gt; runs in two modes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python ingest.py            # drops and recreates the collection  
python ingest.py --append   # creates only if missing, then upserts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running without &lt;code&gt;--append&lt;/code&gt; wipes and recreates the collection every time — that's probably fine for exploration, throwaway by design. &lt;code&gt;--append&lt;/code&gt; creates the collection only if it doesn't exist, then upserts into it.&lt;/p&gt;

&lt;p&gt;That matters because let’s say I have a scenario where I ran the default four queries on Monday. Thursday I wanted to add &lt;code&gt;site:arxiv.org graph RAG 2026&lt;/code&gt; to the same index — compare it against what I'd already collected rather than start over. With &lt;code&gt;--append&lt;/code&gt;, the new results land alongside the originals and the new seed query shows up as a chip immediately. &lt;em&gt;Without&lt;/em&gt; it, I'd be choosing between Monday's index and Thursday's.&lt;/p&gt;

&lt;p&gt;That’s what I meant by “collect once, query many times” — the index &lt;em&gt;accumulates&lt;/em&gt; and doesn’t reset or get overwritten each time.&lt;/p&gt;

&lt;p&gt;Custom queries work inline or from a file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python ingest.py --append --query "site:arxiv.org graph RAG 2026"  
python ingest.py --append --queries-file my_queries.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Keeping the Typesense API key server-side
&lt;/h2&gt;




&lt;p&gt;You could point the browser straight at Typesense and skip &lt;code&gt;serve.py&lt;/code&gt; entirely. The problem is that Typesense's API key is an admin key — the same one that can drop your collection. Put it in client-side JS and anyone who opens devtools has it.&lt;/p&gt;

&lt;p&gt;So &lt;code&gt;serve.py&lt;/code&gt; is just a proxy. The browser calls &lt;code&gt;/api/search&lt;/code&gt;, the server makes the authenticated Typesense request and JSON comes back.&lt;/p&gt;

&lt;p&gt;I kept it as &lt;a href="https://docs.python.org/3/library/http.server.html" rel="noopener noreferrer"&gt;stdlib&lt;/a&gt; &lt;code&gt;[http.server](https://docs.python.org/3/library/http.server.html)&lt;/code&gt; — no &lt;a href="https://flask.palletsprojects.com/" rel="noopener noreferrer"&gt;Flask&lt;/a&gt; or &lt;a href="https://fastapi.tiangolo.com/" rel="noopener noreferrer"&gt;FastAPI&lt;/a&gt;. Adding a framework to wrap thirty lines of routing is honestly just adding a dependency for the sake of having a dependency. If you want to build on top of this, swapping in your preferred framework takes an hour.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://typesense.org/docs/0.25.0/api/documents.html#search" rel="noopener noreferrer"&gt;search parameters&lt;/a&gt; passed to Typesense are set once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;serve.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;http.server&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseHTTPRequestHandler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HTTPServer&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;typesense&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;  

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

&lt;span class="n"&gt;STATIC&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;static&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
&lt;span class="n"&gt;COLLECTION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
&lt;span class="n"&gt;PORT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SERVE_PORT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8765&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;typesense&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;typesense&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nodes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
                &lt;span class="p"&gt;{&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TYPESENSE_HOST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TYPESENSE_PORT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8108&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;protocol&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TYPESENSE_PROTOCOL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                &lt;span class="p"&gt;}&lt;/span&gt;  
            &lt;span class="p"&gt;],&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TYPESENSE_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;connection_timeout_seconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseHTTPRequestHandler&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;_ts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;typesense&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  

    &lt;span class="nd"&gt;@classmethod&lt;/span&gt;  
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;typesense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;typesense&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_ts&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_ts&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; \&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;address_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;do_GET&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/index.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STATIC&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;index.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text/html; charset=utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Not found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_file&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Not found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt;  
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_bytes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end_headers&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wfile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;qs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_qs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;fq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filter_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facet_counts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;}&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt;  

        &lt;span class="c1"&gt;# Text search spans four stored fields (see ingest schema). Weights tune BM25-style  
&lt;/span&gt;        &lt;span class="c1"&gt;# ranking: a term in the title should matter more than the same term buried in the  
&lt;/span&gt;        &lt;span class="c1"&gt;# snippet, and more than an incidental match in the URL or domain string.  
&lt;/span&gt;        &lt;span class="c1"&gt;# Order MUST match query_by — Typesense applies weights positionally.  
&lt;/span&gt;        &lt;span class="n"&gt;query_by&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title,snippet,url,domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="n"&gt;query_by_weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4,3,1,1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;# so titles are more important than snippets, which are more important than urls, which are more important than domains  
&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_by&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_by_weights&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_by_weights&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facet_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_query,domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_facet_values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;per_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;fq&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filter_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fq&lt;/span&gt;  

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;typesense&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;COLLECTION&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end_headers&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wfile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json; charset=utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end_headers&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wfile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HTTPServer&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PORT&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SERP demo UI: http://127.0.0.1:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PORT&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;serve_forever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;query_by_weights&lt;/code&gt; runs in the same order as &lt;code&gt;query_by&lt;/code&gt;. A match in &lt;code&gt;title&lt;/code&gt; outscores the same match in &lt;code&gt;snippet&lt;/code&gt;, which outscores a match in &lt;code&gt;url&lt;/code&gt; or &lt;code&gt;domain&lt;/code&gt;. That nudges ranking toward "this is what the page is about" rather than "this word appears somewhere in the metadata" — no embeddings, no extra service, just the standard keyword-search lever.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;domain&lt;/code&gt; being in &lt;code&gt;query_by&lt;/code&gt; is a small trick: searching &lt;code&gt;arxiv.org&lt;/code&gt; directly returns everything from that domain. Useful when you've mixed sources in one index, and costs nothing.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;facet_by&lt;/code&gt; returns counts alongside every search response — the UI builds chips from those without a second request.&lt;/p&gt;

&lt;p&gt;One UX detail I cared about is if a facet filter produces zero results, the UI reruns the query &lt;em&gt;without&lt;/em&gt; &lt;code&gt;filter_by&lt;/code&gt;, keeps the chips populated from those broader counts, and tells you that your filters might be hiding matches. You don’t want a blank screen with zero explanation, do you? 🙃&lt;/p&gt;

&lt;h2&gt;
  
  
  Typesense Facets in Vanilla JavaScript
&lt;/h2&gt;




&lt;p&gt;The UI can just be regular JavaScript/CSS. I don’t need to go into too much detail, frontend UI design isn’t the point of this post. All you need is some sort of JS logic that hits &lt;code&gt;/api/search&lt;/code&gt;, renders hits, and builds chips from &lt;code&gt;facet_counts&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Facet state is two variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;filterSq&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// active source_query filter  &lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;filterDom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// active domain filter&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clicking a chip toggles the relevant variable and re-runs the search. Multiple active filters compose with &lt;code&gt;&amp;amp;&amp;amp;&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;buildFilterBy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="nx"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;  
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filterSq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="nx"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;source_query:=`&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;filterSq&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filterDom&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;domain:=`&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;filterDom&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt; &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That string goes straight into Typesense’s &lt;code&gt;filter_by&lt;/code&gt; — the UI is just a thin layer over native filter syntax. Nothing to maintain on the client side.&lt;/p&gt;

&lt;p&gt;Each result card shows the title, snippet, domain, and the seed query that produced it. That last tag is the thing. You can see at a glance which Bright Data run each result came from — i.e. which question you were asking when you made the query.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running it
&lt;/h2&gt;






&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Start Typesense  &lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;  

&lt;span class="c"&gt;# 2. Install Python deps  &lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt  

&lt;span class="c"&gt;# 3. Ingest the demo queries  &lt;/span&gt;
python ingest.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Query: &lt;span class="s1"&gt;'site:arxiv.org retrieval augmented generation 2026'&lt;/span&gt;  
  indexed 8 organic rows  
Query: &lt;span class="s1"&gt;'site:arxiv.org hybrid search reranking 2026'&lt;/span&gt;  
  indexed 8 organic rows  
Query: &lt;span class="s1"&gt;'site:arxiv.org agentic RAG 2026'&lt;/span&gt;  
  indexed 8 organic rows  
Query: &lt;span class="s1"&gt;'site:arxiv.org long context vs RAG 2026'&lt;/span&gt;  
  indexed 8 organic rows  
Done. Total documents: 32
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 4. Start the UI  &lt;/span&gt;
python serve.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://127.0.0.1:8765/&lt;/code&gt; (or whatever you set with &lt;code&gt;SERVE_PORT&lt;/code&gt;). You should see the empty search shell first:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhn5ipm9yxby3ztu66gb.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhn5ipm9yxby3ztu66gb.jpeg" alt="Landing state. The search box and short explainer before you run a query." width="640" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Search for &lt;code&gt;memory&lt;/code&gt;, &lt;code&gt;chunk&lt;/code&gt;, &lt;code&gt;graph&lt;/code&gt;, &lt;code&gt;RAG&lt;/code&gt;. Click a seed query chip to isolate a single SERP run. If you've mixed domains, the domain chips filter those too.&lt;/p&gt;

&lt;p&gt;Second pass, same index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python ingest.py &lt;span class="nt"&gt;--append&lt;/span&gt; &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"site:arxiv.org graph RAG 2026"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;New seed query appears as a chip immediately. Everything you indexed before is still there.&lt;/p&gt;

&lt;h2&gt;
  
  
  What query "provenance" actually means
&lt;/h2&gt;




&lt;p&gt;The default run collects ~32 arxiv results tagged across four seed queries. Search for &lt;code&gt;RAG&lt;/code&gt; or &lt;code&gt;memory&lt;/code&gt;and you get hits from all four runs mixed together.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6y8m7ka6jjlotn60usrq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6y8m7ka6jjlotn60usrq.png" alt="Same keywords, narrowed to one seed query via a facet chip — shortlist is the papers that surfaced under that SERP API run." width="640" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now the interesting question is this: are the results under “agentic RAG 2026” the same papers as under “long context vs RAG 2026”?&lt;/p&gt;

&lt;p&gt;We can verify this quickly.&lt;/p&gt;

&lt;p&gt;Click the &lt;code&gt;site:arxiv.org agentic RAG 2026&lt;/code&gt; chip — that’s one shortlist. Clear it, then click &lt;code&gt;site:arxiv.org long context vs RAG 2026&lt;/code&gt; — another. Some papers appear in both, and you quickly inspect them this way. Those are the ones Google considers relevant regardless of how you framed the question. The ones in only one list are specific to that framing.&lt;/p&gt;

&lt;p&gt;This is what I mean by &lt;strong&gt;provenance&lt;/strong&gt;. The &lt;code&gt;source_query&lt;/code&gt; facet isn't a topic label, but can be considered a record of &lt;em&gt;which question you were asking when you collected the data&lt;/em&gt;. Meaning a paper showing up under multiple seeds is telling you something, and not a deduplication problem.&lt;/p&gt;

&lt;p&gt;One honest caveat, though: this is navigation over SERP metadata — titles, snippets, URLs. It can’t search &lt;em&gt;inside&lt;/em&gt; the PDFs. What it does is let me triage thirty papers in two minutes instead of twenty, which is the problem I actually had.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions (FAQ)
&lt;/h2&gt;




&lt;p&gt;&lt;strong&gt;&lt;em&gt;Q:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;How do I get Google search results as JSON with Bright Data?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; &lt;code&gt;POST&lt;/code&gt; to &lt;code&gt;https://api.brightdata.com/request&lt;/code&gt; with a Bearer token, your zone name, and a Google URL that includes &lt;code&gt;&amp;amp;brd_json=1&lt;/code&gt;. That flag is what flips the response from raw HTML to a parsed &lt;code&gt;organic&lt;/code&gt; array (each row has &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;link&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt;, &lt;code&gt;rank&lt;/code&gt;). The JSON you get back is an &lt;em&gt;envelope&lt;/em&gt; — the SERP payload is inside &lt;code&gt;body&lt;/code&gt;, usually as a JSON string you have to &lt;code&gt;json.loads&lt;/code&gt; a second time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Q:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Typesense vs Meilisearch vs Elasticsearch — which should I pick for a local search index?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; For this kind of workload (a small, local, faceted index over web data) Typesense and Meilisearch are both reasonable but Elasticsearch is overkill. Typesense is in-memory C++, sub-millisecond latency, facets and typo tolerance on by default, one Docker container, no JVM. Meilisearch is Rust, disk-backed (LMDB), handles larger corpora on less RAM, and has arguably nicer defaults for developer UX. Elasticsearch is what you use when you have a dedicated ops team, billions of documents, or log-analytics workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Q:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Why is the same URL indexed twice if it appears under two queries?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; Because I want it that way. The document ID is &lt;code&gt;sha256(url + source_query)&lt;/code&gt;, so the same paper surfacing under &lt;code&gt;"agentic RAG 2026"&lt;/code&gt; and under &lt;code&gt;"long context vs RAG 2026"&lt;/code&gt; becomes two documents — each tagged with the query that found it. Searching for the title shows both facet chips, which is how you see &lt;em&gt;which&lt;/em&gt; Bright Data run produced each hit. Hash on URL alone and that provenance is gone permanently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Q:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Does this actually search inside the papers, or just the search-result metadata?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; Just metadata — titles, snippets, URLs, domains, and the seed query. It’s navigation over SERP rows, not full-text search over PDFs. If you need to search inside the papers, you’d add a second stage — download the PDFs, chunk, embed — on top of this index, using the URLs it surfaces as the candidate set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Q:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Can I use this pipeline for non-arxiv sources?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; Yes. The pipeline has no opinion about what the queries are. &lt;code&gt;site:arxiv.org&lt;/code&gt; is just the scenario I needed; swap in &lt;code&gt;site:github.com&lt;/code&gt;, &lt;code&gt;site:news.ycombinator.com&lt;/code&gt;, mix &lt;code&gt;site:&lt;/code&gt; operators, or drop the filter entirely. The &lt;code&gt;domain&lt;/code&gt; field is extracted from the URL at ingest time, so mixed-domain runs get a second facet chip for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Q:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Why stdlib&lt;/em&gt; &lt;code&gt;_http.server_&lt;/code&gt; &lt;em&gt;instead of Flask or FastAPI?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; Because the proxy is small enough that a framework import would be bigger than the logic it wraps. One handler, two routes (&lt;code&gt;/&lt;/code&gt; and &lt;code&gt;/api/search&lt;/code&gt;), no middleware, no router — stdlib is enough. If you're building on top of this, swapping in FastAPI or your preferred framework takes an hour; I just didn't want to pay the dependency tax for a demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;




&lt;p&gt;Bright Data solves the hard part of web data collection — proxy rotation, bot detection, structured extraction. Yadda, yadda.&lt;/p&gt;

&lt;p&gt;What you &lt;em&gt;do&lt;/em&gt; with that JSON is a different question. Export it and it answers the questions you had when you wrote the query. Or, index it, and it answers questions you haven’t even thought of yet.&lt;/p&gt;

&lt;p&gt;From a collection as an endpoint to a collection as the start of something you can actually explore as you’re researching — is what I was going for here. It took me a week to refactor something I’d been doing badly for a year, and about twenty minutes to run once it was done. This can scale quite well so, more queries, more domains, more &lt;code&gt;--append&lt;/code&gt; runs — and you have the option to make the Typesense index grow with the research instead of resetting every time.&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
      <category>webdev</category>
      <category>javascript</category>
    </item>
    <item>
      <title>How To Validate Any API Response with Great Expectations (GX)</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Wed, 15 Apr 2026 05:59:38 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/how-to-validate-any-api-response-with-great-expectations-gx-2deb</link>
      <guid>https://dev.to/prithwish_nath/how-to-validate-any-api-response-with-great-expectations-gx-2deb</guid>
      <description>&lt;p&gt;A lot of my data analysis work is forensic (example &lt;a href="https://javascript.plainenglish.io/reverse-engineering-virality-what-1k-reddit-posts-reveal-about-the-internets-attention-economy-5a1f913ffc4d" rel="noopener noreferrer"&gt;here&lt;/a&gt;). I’ll pull repeated snapshots — content, traffic, SERPs, metadata — and look at patterns across &lt;em&gt;hundreds&lt;/em&gt; of these ingestions. So naturally, bad batches will skew results in ways that are &lt;em&gt;just convincing enough&lt;/em&gt; so I don’t catch it. Disastrous.&lt;/p&gt;

&lt;p&gt;You can add try-catches, but a 200 OK from your API with empty titles and duplicate rows is still considered a &lt;em&gt;success&lt;/em&gt; on the wire unless you model validation as its own failure path. So is there a better way?&lt;/p&gt;

&lt;p&gt;This is exactly the gap &lt;a href="https://docs.greatexpectations.io/docs/" rel="noopener noreferrer"&gt;Great Expectations&lt;/a&gt; (GX Core) fills. It’s an open-source Python library for defining declarative quality rules on your data — &lt;strong&gt;“quality gates”&lt;/strong&gt;, explicit rules your data must pass before it’s trusted &lt;strong&gt;—&lt;/strong&gt; things like “this field must never be null”, “these IDs must be unique”, or “this value must fall within a known range”. You basically codify what “good” looks like, run it as a validation step in your pipeline, and get a clear pass/fail on every batch — before bad data ever touches your analysis.&lt;/p&gt;

&lt;p&gt;Let’s turn that idea into code — we’ll get data at scale via &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=how_to_validate_any_api_response_with_great_expectations_gx" rel="noopener noreferrer"&gt;Bright Data&lt;/a&gt; (a SERP API), run a GX suite over each batch, then store it in &lt;a href="https://duckdb.org/docs/stable/" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt; (a file-backed analytical SQL database that’s just a &lt;code&gt;.duckdb&lt;/code&gt; file on disk) clean rows to the main table, failed batches to quarantine with enough context to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Here’s what you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;requests&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;2.28.0  
python-dotenv&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;1.0.0  
duckdb&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;1.0.0  
pandas&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;2.0.0  
psutil&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;5.9.0  
great-expectations&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that &lt;strong&gt;Python 3.10+&lt;/strong&gt; is the floor for Great Expectations 1.x with current pandas / duckdb wheels.&lt;/p&gt;

&lt;p&gt;The important ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/great-expectations/" rel="noopener noreferrer"&gt;great-expectations&lt;/a&gt; — the quality gates: expectations on each batch, clear pass/fail and reports. All we need is GX Core.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/duckdb/" rel="noopener noreferrer"&gt;duckdb&lt;/a&gt; — Doesn't need a server to run.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/requests/" rel="noopener noreferrer"&gt;requests&lt;/a&gt; —For HTTP to Bright Data’s &lt;a href="https://get.brightdata.com/lp-scraping-browser-acf1964?utm_content=how_to_validate_any_api_response_with_great_expectations_gx" rel="noopener noreferrer"&gt;POST /request&lt;/a&gt; endpoint (your zone must be a &lt;strong&gt;SERP&lt;/strong&gt; zone).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before running &lt;code&gt;pip install -r requirements.txt&lt;/code&gt;, create a .env file with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BRIGHT_DATA_API_KEY=your_api_key  
BRIGHT_DATA_ZONE=serp # or your SERP zone name from the Bright Data dashboard  
BRIGHT_DATA_COUNTRY=us # optional
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Bright Data client reads these when you run the pipeline. You need a SERP-capable zone — without one, POST /request will not match what this code expects. Proxy rotation and unblocking stay on Bright Data’s side.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Does the Pipeline Actually Work?
&lt;/h2&gt;

&lt;p&gt;This diagram should make it clear.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jvlwv7rp2ibk6jmqhw2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jvlwv7rp2ibk6jmqhw2.png" width="640" height="777"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bright_data.py &lt;span class="c"&gt;# Bright Data API client  &lt;/span&gt;
serp_expectations.py &lt;span class="c"&gt;# Great Expectations validation suite  &lt;/span&gt;
duckdb_store.py &lt;span class="c"&gt;# DuckDB schema + insert/quarantine logic  &lt;/span&gt;
ingest.py &lt;span class="c"&gt;# Pipeline: fetch → validate → store or quarantine&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flow is straightforward: one query equals one batch. For each batch, we check things in a strict order before anything touches the database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Fetching Structured SERP Data with Bright Data
&lt;/h2&gt;

&lt;p&gt;Our API client will just be a thin wrapper around Bright Data’s &lt;code&gt;POST /request&lt;/code&gt; endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  bright_data.py
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;  

&lt;span class="n"&gt;_GX_ROOT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;  
&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_GX_ROOT&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize_serp_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Bright Data may wrap JSON in a string body.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;  


&lt;span class="nd"&gt;@dataclass&lt;/span&gt;  
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SerpApiResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Bright Data POST /request: HTTP status plus parsed JSON body (if any).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

    &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;  
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;  


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;SERP API zone client; uses https://api.brightdata.com/request&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_COUNTRY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.brightdata.com/request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY must be set in the environment or constructor.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE must be set in the environment or constructor.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Raises on non-200 or network error.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
        &lt;span class="n"&gt;serp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search_with_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search request failed with HTTP &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;normalize_serp_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_with_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Does not raise on non-200 — for ingest + quarantine routing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
        &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.google.com/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?q=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;num=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;brd_json=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;hl=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;lr=lang_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="n"&gt;target_country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;  
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;search_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;  

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;SerpApiResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;  

        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_parse_response_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;SerpApiResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_parse_response_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;  
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_non_json_body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some things to note here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;search_with_status&lt;/code&gt; does not raise on non-&lt;code&gt;200&lt;/code&gt; so ingest can quarantine API failures&lt;/li&gt;
&lt;li&gt;Transport errors are caught so you get &lt;code&gt;status_code=0&lt;/code&gt; and the same &lt;code&gt;api_error&lt;/code&gt; path as HTTP failures.&lt;/li&gt;
&lt;li&gt;Also, &lt;code&gt;search()&lt;/code&gt; raises on failure for callers that want exceptions.&lt;/li&gt;
&lt;li&gt;And finally,&lt;code&gt;normalize_serp_payload&lt;/code&gt; unwraps responses just in case your API puts JSON under a string &lt;code&gt;body&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 2: What Happens to Data That Doesn’t Pass?
&lt;/h2&gt;

&lt;p&gt;Every batch in this pipeline has two possible destinations in DuckDB:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If it clears all our defined quality gates, it goes to &lt;code&gt;serp_results&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If it doesn't — wrong HTTP status, empty organic list, or a failed GX expectation — it lands in &lt;code&gt;serp_quarantine&lt;/code&gt; instead, with enough context to understand why.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's build that store before we wire up the gates.&lt;/p&gt;

&lt;h3&gt;
  
  
  duckdb_store.py
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psutil&lt;/span&gt;  


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SerpStore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;serp_results + serp_quarantine&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;makedirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SET memory_limit=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;available_memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;virtual_memory&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;available&lt;/span&gt;  
            &lt;span class="n"&gt;memory_gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;available_memory&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SET memory_limit=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory_gb&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;GB&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_create_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_create_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            CREATE TABLE IF NOT EXISTS serp_results (  
                id BIGINT PRIMARY KEY,  
                query TEXT NOT NULL,  
                timestamp TIMESTAMP NOT NULL,  
                result_position INTEGER NOT NULL,  
                title TEXT,  
                url TEXT,  
                snippet TEXT,  
                domain TEXT,  
                rank INTEGER,  
                previous_rank INTEGER,  
                rank_delta INTEGER  
            )  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE INDEX IF NOT EXISTS idx_query ON serp_results(query)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE INDEX IF NOT EXISTS idx_domain ON serp_results(domain)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            CREATE TABLE IF NOT EXISTS serp_quarantine (  
                query TEXT,  
                timestamp TIMESTAMP,  
                reason TEXT,  
                organic_count INTEGER,  
                payload_hash TEXT,  
                http_status INTEGER,  
                raw_json JSON  
            )  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="nd"&gt;@staticmethod&lt;/span&gt;  
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_payload_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;canonical&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;insert_quarantine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;raw_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;http_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;payload_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_payload_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            INSERT INTO serp_quarantine  
                (query, timestamp, reason, organic_count, payload_hash, http_status, raw_json)  
            VALUES (?, ?, ?, ?, ?, ?, ?)  
            &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;[&lt;/span&gt; 
                &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;payload_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;http_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="p"&gt;],&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;insert_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Insert from the GX validation frame (output of organic_to_validation_df).  

        Store what we validated: normalized url, derived domain, snippet, and positional rank —  
        not a second parse from raw organic dicts (avoids drift vs Great Expectations).  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt;  

        &lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COALESCE(MAX(id), 0) FROM serp_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;next_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_id_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  

        &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;  
            &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
            &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
            &lt;span class="n"&gt;snippet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
            &lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
            &lt;span class="n"&gt;rank_val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
            &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="p"&gt;{&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;next_id&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result_position&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rank_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;previous_rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank_delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="p"&gt;}&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            INSERT INTO serp_results (id, query, timestamp, result_position, title, url, snippet, domain, rank, previous_rank, rank_delta)  
            SELECT id, query, timestamp, result_position, title, url, snippet, domain, rank, previous_rank, rank_delta FROM df  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_row_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COUNT(*) FROM serp_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__enter__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__exit__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What’s happening here?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;reason&lt;/code&gt; field is an enum-like string: &lt;code&gt;api_error&lt;/code&gt;, &lt;code&gt;organic_empty&lt;/code&gt;, or &lt;code&gt;validation_failed&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;payload_hash&lt;/code&gt; is a SHA-256 of the canonical JSON — so you can tell if the same bad payload keeps coming back (very useful!)&lt;/li&gt;
&lt;li&gt;Each row stores a JSON blob in &lt;code&gt;raw_json&lt;/code&gt;: for &lt;code&gt;api_error&lt;/code&gt; and &lt;code&gt;organic_empty&lt;/code&gt; it is the normalized SERP dict; for &lt;code&gt;validation_failed&lt;/code&gt; it is &lt;code&gt;{"serp":&lt;/code&gt;[normalized serp here]&lt;code&gt;, "gx_validation":&lt;/code&gt;[GX result dict here]&lt;code&gt;}&lt;/code&gt; so you still have both the SERP payload &lt;em&gt;and&lt;/em&gt; the failing expectation details.
Every bad batch lands in &lt;code&gt;serp_quarantine&lt;/code&gt; with a paper trail instead of silently disappearing or, worse, silently making it into &lt;code&gt;serp_results&lt;/code&gt;. When a batch &lt;strong&gt;passes&lt;/strong&gt; all gates, &lt;code&gt;insert_batch&lt;/code&gt; writes one row per organic result from the validated DataFrame — same normalized fields GX validated. &lt;code&gt;previous_rank&lt;/code&gt; / &lt;code&gt;rank_delta&lt;/code&gt; are reserved for later rank-tracking; they stay null on first ingest.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 3: The Gate Order — Why Sequence Matters
&lt;/h2&gt;

&lt;p&gt;Bright Data's parsed SERP JSON path (request URL includes &lt;code&gt;brd_json=1&lt;/code&gt; or, in their docs, &lt;code&gt;brd_json=json&lt;/code&gt;) returns the same kind of structured object you see in their examples and in real dumps — &lt;code&gt;organic&lt;/code&gt;, &lt;code&gt;general&lt;/code&gt;, &lt;code&gt;input&lt;/code&gt;, and so on — not a page of HTML you parse yourself.&lt;/p&gt;

&lt;p&gt;So when Google &lt;em&gt;can't&lt;/em&gt; be reached or nothing usable is extracted, that tends to show up as a &lt;strong&gt;non-200&lt;/strong&gt; from &lt;code&gt;POST /request&lt;/code&gt;, a network error (we treat as &lt;code&gt;api_error&lt;/code&gt;), or an empty &lt;code&gt;organic&lt;/code&gt; array —and not as a captcha HTML document mistaken for a normal organic list. So the gates to implement are exactly three:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;non-&lt;code&gt;200&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;empty &lt;code&gt;organic&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;then GX on whatever is left.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ingest.py
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timezone&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;  

&lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;  

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bright_data&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;normalize_serp_payload&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;duckdb_store&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SerpStore&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;serp_expectations&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;organic_to_validation_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validate_organic_batch&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_write_validation_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;reports_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reports&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
    &lt;span class="n"&gt;reports_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y%m%dT%H%M%SZ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reports_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
    &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generated_at_utc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ingest_live&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;    &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;serp_json_out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,):&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
    One batch = one query → one SERP response.  
    Order: api_error → organic_empty → GX validate → insert or validation_failed quarantine.  
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_gx.duckdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
    &lt;span class="n"&gt;batches_total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
    &lt;span class="n"&gt;empty_batches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
    &lt;span class="n"&gt;api_error_batches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
    &lt;span class="n"&gt;validation_failed_batches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
    &lt;span class="n"&gt;report_entries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="n"&gt;serp_dump_batches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;SerpStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;batches_total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
            &lt;span class="n"&gt;serp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search_with_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize_serp_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;serp_json_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;serp_dump_batches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                    &lt;span class="p"&gt;{&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;normalized_serp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="p"&gt;}&lt;/span&gt;  
                &lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;organic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
            &lt;span class="n"&gt;organic_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

            &lt;span class="c1"&gt;# Optional (not implemented): best-effort catch-all for an explicit blocked API response  
&lt;/span&gt;            &lt;span class="c1"&gt;# (e.g. substring-match json.dumps(raw) for "captcha" / "unusual traffic") — just in case.  
&lt;/span&gt;            &lt;span class="c1"&gt;# Not needed here, only include if the API you're using can error out like that  
&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;api_error_batches&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
                &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_quarantine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                    &lt;span class="p"&gt;{&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="p"&gt;}&lt;/span&gt;  
                &lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="k"&gt;continue&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;empty_batches&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
                &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_quarantine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_empty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                    &lt;span class="p"&gt;{&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_empty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="p"&gt;}&lt;/span&gt;  
                &lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="k"&gt;continue&lt;/span&gt;  

            &lt;span class="n"&gt;vdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;organic_to_validation_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;gx_ok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gx_payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validate_organic_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;gx_ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;validation_failed_batches&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
                &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_quarantine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                    &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx_validation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;gx_payload&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  
                    &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                    &lt;span class="p"&gt;{&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;gx_payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="p"&gt;}&lt;/span&gt;  
                &lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="k"&gt;continue&lt;/span&gt;  

            &lt;span class="c1"&gt;# Persist the validated DataFrame so DuckDB gets the same normalized url/domain/rank GX checked.  
&lt;/span&gt;            &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="p"&gt;{&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inserted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;gx_payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="p"&gt;}&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;row_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_row_count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="n"&gt;report_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_write_validation_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batches_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;empty_batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;empty_batches&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_error_batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;api_error_batches&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_failed_batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;validation_failed_batches&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;empty_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;empty_batches&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_error_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_error_batches&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_failed_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;validation_failed_batches&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt;  
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows_ingested&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_results_row_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_report_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;serp_json_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;out_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp_json_out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y%m%dT%H%M%SZ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generated_at_utc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp_dump_batches&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_json_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;  
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;  

    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx: Bright Data SERP → Great Expectations → DuckDB (fail fast on GX failure)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;nargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; 
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inference engineering&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python duckdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Great Expectations data validation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;machine learning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;opentelemetry how to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;],&lt;/span&gt;  
        &lt;span class="n"&gt;help&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search queries (default: five example queries if none given)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--num-results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DuckDB file path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--save-serp-json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="nb"&gt;type&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;metavar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PATH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;help&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write normalized SERP (+ query, http_status) per batch to this JSON file.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ingest_live&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;serp_json_out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;save_serp_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main loop enforces that order we talked about: &lt;code&gt;api_error&lt;/code&gt; and &lt;code&gt;organic_empty&lt;/code&gt; are cheap pre-checks that catch the most common failure modes without spinning up GX at all.&lt;/p&gt;

&lt;p&gt;In fact, GX only runs when there's actual data to validate. &lt;code&gt;store.insert_batch(vdf, q)&lt;/code&gt; passes the &lt;em&gt;validated DataFrame&lt;/em&gt;, not the original raw &lt;code&gt;organic&lt;/code&gt; list — what goes into DuckDB is exactly what GX checked, with the same URL normalization applied.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: What Quality Gates Should You Put on SERP Data?
&lt;/h2&gt;

&lt;p&gt;To reiterate, when we say “Quality gates”, we mean explicit Great Expectations rules on the DataFrame we validate: each “expectation” is simply one condition the batch must satisfy before you trust it for analytics.&lt;/p&gt;

&lt;p&gt;For this pipeline we normalize the &lt;code&gt;organic&lt;/code&gt;list in our API response into a small schema — &lt;code&gt;url&lt;/code&gt;, &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;snippet&lt;/code&gt;, &lt;code&gt;domain&lt;/code&gt;, and &lt;code&gt;rank&lt;/code&gt; — and run the suite on that frame only.&lt;/p&gt;

&lt;h3&gt;
  
  
  serp_expectations.py
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urlparse&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;great_expectations&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;gx&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;great_expectations.data_context.types.base&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProgressBarsConfig&lt;/span&gt;  


&lt;span class="c1"&gt;# helpers   
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;netloc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;www.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_coerce_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw_rank&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;  
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_rank&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
    &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_normalize_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;  
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;normalised&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;  
            &lt;span class="n"&gt;fragment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;normalised&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;geturl&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;  

&lt;span class="c1"&gt;# before anything, normalize the data  
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;organic_to_validation_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;raw_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_normalize_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;snippet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;raw_rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_coerce_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="c1"&gt;# now add the actual expectations  
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_serp_expectation_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;_meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SERP organic batch gates for Bright Data brd_json=1 payloads.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectationSuite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_organic_batch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; 
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToNotBeNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToNotBeNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToMatchRegex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;regex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;^https?://\\S+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToBeUnique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectTableRowCountToBeBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;min_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;max_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValueLengthsToBeBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;min_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;max_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToMatchRegex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;regex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\\S&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValueLengthsToBeBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;min_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;max_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;253&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToBeBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;min_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;max_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToNotBeNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;mostly&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
        &lt;span class="p"&gt;],&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="c1"&gt;# gx in ephemeral mode  
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_organic_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statistics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evaluated_expectations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;successful_expectations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unsuccessful_expectations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;},&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exception_message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_frame_empty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  

    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;progress_bars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProgressBarsConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;globally&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;metric_calculations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="n"&gt;data_source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_sources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;hex&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;asset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data_source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_dataframe_asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_batch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;batch_definition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_batch_definition_whole_dataframe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;whole&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batch_definition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_parameters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dataframe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;  
    &lt;span class="n"&gt;suite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_serp_expectation_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s walk through the decisions we made here, because they’re not arbitrary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Normalize Before You Validate — Never Inside the Validation Layer
&lt;/h3&gt;

&lt;p&gt;Before anything in GX runs, &lt;code&gt;organic_to_validation_df&lt;/code&gt; normalizes the data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_normalize_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Lowercase scheme, strip fragment.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
    &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;fragment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;geturl&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;organic_to_validation_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;raw_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_normalize_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;snippet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_coerce_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things happening here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bright Data’s &lt;code&gt;brd_json=1&lt;/code&gt; responses use &lt;code&gt;link&lt;/code&gt; as the key, not &lt;code&gt;url&lt;/code&gt;. We normalize this before GX sees it. You’ve probably used such remappings plenty of times, for most API responses.&lt;/li&gt;
&lt;li&gt;Fragments are stripped and scheme is lowercased before the uniqueness check — so &lt;code&gt;https://example.com/page#section&lt;/code&gt; and &lt;code&gt;https://example.com/page&lt;/code&gt; don't pass as different URLs.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;_coerce_rank&lt;/code&gt; handles the fact that rank values from APIs can come back as &lt;code&gt;int&lt;/code&gt;, &lt;code&gt;float&lt;/code&gt;, numpy scalars, or even strings. Coerce to a consistent type &lt;em&gt;before&lt;/em&gt; GX, not inside an expectation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ephemeral mode
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://docs.greatexpectations.io/docs/reference/api/data_context/ephemeraldatacontext_class/" rel="noopener noreferrer"&gt;[mode="ephemeral"]&lt;/a&gt; simply means no config files, no persisted context directory, and no &lt;code&gt;great_expectations.yml&lt;/code&gt; — a pure in-process validator you spin up per batch. Progress bars are turned off in code when batching many runs. We do this to make GX lightweight and easier to get started with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Null Checks Must Come Before Regex Gate
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ExpectColumnValuesToNotBeNull&lt;/code&gt; on &lt;code&gt;url&lt;/code&gt; and &lt;code&gt;title&lt;/code&gt; must come before the regex and length gates. That's because GX's &lt;code&gt;ExpectColumnValuesToMatchRegex&lt;/code&gt; silently skips null values by default — it only evaluates non-null rows. Without an explicit null check, a null URL would slip through the regex gate undetected.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the URL Regex Is Intentionally Loose
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;^https?://\S+&lt;/code&gt; is deliberately not a strict RFC 3986 URL validator. Overly strict URL validation generates false positives on legitimately unusual URLs — CDN URLs, tracking URLs, URLs with encoded characters. The goal here is to catch the two actual failure modes: an empty string and a non-URL value like &lt;code&gt;"N/A"&lt;/code&gt; or a relative path.&lt;/p&gt;

&lt;h3&gt;
  
  
  URL Uniqueness Catches Parse Artifacts
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ExpectColumnValuesToBeUnique&lt;/code&gt; on &lt;code&gt;url&lt;/code&gt; catches cases where the same URL appears twice in a single batch. That's a real thing that can happen with pagination artifacts or certain response formats. In a ranking pipeline, a duplicate URL means your rank counts are wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use a Fixed Ceiling for Rank and Row Count, Not a Self-Adjusting One
&lt;/h3&gt;

&lt;p&gt;Notice that &lt;code&gt;max_value=float(num_results)&lt;/code&gt; for the rank gate, and &lt;code&gt;max_value=num_results&lt;/code&gt; for the row count, both use the same value — the same number we used for our SERP API. This is intentional and it matters.&lt;/p&gt;

&lt;p&gt;A common mistake is to derive the ceiling from the batch itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Don't do this  
&lt;/span&gt;&lt;span class="n"&gt;max_rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;())))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? Well, if the API returns a result with &lt;code&gt;rank=47&lt;/code&gt; for a 10-result query, &lt;code&gt;max_rank&lt;/code&gt; becomes 47 and the gate passes. You've made the gate self-adjust to whatever the data says, which means the gate can &lt;em&gt;never&lt;/em&gt; actually fail on an out-of-range rank. Using a fixed &lt;code&gt;num_results&lt;/code&gt; means an unexpected rank value will actually trigger a quarantine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Optional Fields Need Soft Gates, Not Hard Ones
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;snippet&lt;/code&gt; uses &lt;code&gt;mostly=0.8&lt;/code&gt; rather than a hard non-null gate. That's because Google legitimately suppresses snippets for some result types — videos, certain knowledge panel entries, sitelinks. A hard non-null gate on snippet would quarantine perfectly valid batches. The &lt;code&gt;mostly&lt;/code&gt; parameter lets you say "80% of rows must pass this check" — if more than 20% of rows have no snippet, that's a signal the response structure has changed, not normal Google behaviour.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Does a Failed GX Validation Actually Look Like?
&lt;/h2&gt;

&lt;p&gt;When a batch fails validation, the &lt;code&gt;gx&lt;/code&gt; block in the report tells you exactly which expectation failed and what the unexpected values were. Here's a representative example — two organic rows end up with the &lt;strong&gt;same&lt;/strong&gt; normalized URL (duplicate rows in the response, or two URLs that collapse to one after fragment stripping), so the uniqueness gate fires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"inference engineering"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"validation_failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"http_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"organic_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"gx"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"statistics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"evaluated_expectations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"successful_expectations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"unsuccessful_expectations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt; 
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="nl"&gt;"expectation_config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"expect_column_values_to_be_unique"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"kwargs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"url"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="nl"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"element_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"unexpected_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"unexpected_percent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"partial_unexpected_list"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt; 
            &lt;/span&gt;&lt;span class="s2"&gt;"https://example.com/page?q=1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
            &lt;/span&gt;&lt;span class="s2"&gt;"https://example.com/page?q=1"&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;http_status&lt;/code&gt; is &lt;code&gt;200&lt;/code&gt;. Without the quality gate, this batch would have been inserted. The downstream join on URL would have silently doubled a result's apparent frequency in your analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Validation Report
&lt;/h2&gt;

&lt;p&gt;After every run, we emit a timestamped JSON report with one entry per query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"generated_at_utc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"20260326T102005Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"batches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"inference engineering"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"inserted"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"http_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"organic_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"gx"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"statistics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"evaluated_expectations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"some broken query"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"organic_empty"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"http_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"organic_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"gx"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The possible &lt;code&gt;outcome&lt;/code&gt; values are &lt;code&gt;inserted&lt;/code&gt;, &lt;code&gt;api_error&lt;/code&gt;, &lt;code&gt;organic_empty&lt;/code&gt;, and &lt;code&gt;validation_failed&lt;/code&gt;. Over time, watching the rate of each outcome per query is how you'd catch gradual degradation — a rising &lt;code&gt;organic_empty&lt;/code&gt; rate is a signal before it becomes a crisis.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Run the Pipeline
&lt;/h2&gt;

&lt;p&gt;Put the four modules in one package or folder on your &lt;code&gt;PYTHONPATH&lt;/code&gt;, install dependencies (&lt;code&gt;pip install -r requirements.txt&lt;/code&gt;), and set Bright Data credentials (&lt;code&gt;BRIGHT_DATA_API_KEY&lt;/code&gt;, &lt;code&gt;BRIGHT_DATA_ZONE&lt;/code&gt;, plus optional &lt;code&gt;BRIGHT_DATA_COUNTRY&lt;/code&gt;) in a &lt;code&gt;.env&lt;/code&gt; file next to the code or in the environment.&lt;/p&gt;

&lt;p&gt;Our orchestrator uses &lt;code&gt;argparse&lt;/code&gt;. So here’s some useful flags worth including, mostly for quality-of-life:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--num-results&lt;/code&gt; (default &lt;code&gt;10&lt;/code&gt;),&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--db&lt;/code&gt; (override the DuckDB file path),&lt;/li&gt;
&lt;li&gt;and optionally, I’ve found using a &lt;code&gt;--save-serp-json PATH&lt;/code&gt; to dump normalized SERP payloads while tuning your rules &lt;em&gt;really&lt;/em&gt; helps.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# One query  &lt;/span&gt;
python ingest.py &lt;span class="s2"&gt;"inference engineering"&lt;/span&gt;  

&lt;span class="c"&gt;# Multiple queries (defaults to five example queries if you pass none)  &lt;/span&gt;
python ingest.py  

&lt;span class="c"&gt;# Example: 20 results, custom DB path  &lt;/span&gt;
python ingest.py &lt;span class="s2"&gt;"python asyncio"&lt;/span&gt; &lt;span class="nt"&gt;--num-results&lt;/span&gt; 20 &lt;span class="nt"&gt;--db&lt;/span&gt; ./data/my_run.duckdb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On success, the script prints a small JSON summary — batch counts, rows written, paths to the DuckDB file and the validation report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"batches_total"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"empty_batches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"api_error_batches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"validation_failed_batches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"empty_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"api_error_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"validation_failed_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"rows_ingested"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"serp_results_row_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"db_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./data/serp_gx.duckdb"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"validation_report_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./data/reports/validation_20260326T102005Z.json"&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s pretty much everything. I’ve walked you through each file. Just know &lt;code&gt;ingest.py&lt;/code&gt; is the entry point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions (FAQ)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Why not just write the validation logic myself?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; You can, and for one or two checks, you probably should! The case for GX is not that something like&lt;code&gt;if not url or not url.startswith("http")&lt;/code&gt; is hard to write — it's that &lt;em&gt;twenty&lt;/em&gt; of those checks scattered across your pipeline become hard to read, hard to audit, and easy to accidentally skip when you're in a hurry. GX gives you a single place where &lt;em&gt;all&lt;/em&gt; your rules live, a consistent result structure across every check, and a report that tells you exactly which rule failed and on which values. It saves a ton of trouble in the long run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Does running GX on every batch slow the pipeline down?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; In practice, no — not at the batch sizes typical of API calls (let’s say 10–50 rows per query). To make double sure, this is why I run GX in Ephemeral context mode (mode=”ephemeral”) — that avoids any file I/O or context persistence overhead. The validation itself is pandas operations under the hood anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What happens to bad data in a pipeline without validation?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; It gets inserted and you don’t find out until something downstream breaks — or worse, until it produces wrong answers that are just plausible enough to go unnoticed. A &lt;code&gt;200 OK&lt;/code&gt; with duplicate rows, empty titles, or out-of-range values looks like a success on the wire. Without an explicit quality gate, that batch goes straight into your database and silently skews every query that touches it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How do I know what rules to write for my own data?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; Sample first, then write rules. Pull a handful of real responses from your source API, inspect the fields and edge cases, then encode what “good” looks like. Rules written speculatively against an imagined schema generate false positives; rules written against real samples catch actual failure modes. Start with the fields you’d join on or aggregate in your analytics, and &lt;em&gt;only&lt;/em&gt; add gates for everything else if you’ve seen it break.&lt;/p&gt;

&lt;h2&gt;
  
  
  What GX + Bright Data Gives You
&lt;/h2&gt;

&lt;p&gt;So why does this pattern work reliably?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bright Data handles the upstream layer.&lt;/strong&gt; Proxy rotation, bot detection, structured extraction — all abstracted behind a single API call that returns structured JSON.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Great Expectations handles downstream trust.&lt;/strong&gt; It doesn’t fix bad data. It measures whether incoming data meets your rules before you trust it with your analytics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The quarantine table (in DuckDB) gives you an audit trail.&lt;/strong&gt; Not just “something went wrong,” but what the payload was, which expectations failed, and when. You can query the quarantine table to understand your pipeline’s health over time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Web pipelines that don’t have explicit quality gates just gradually produce wrong answers. Catching &lt;em&gt;that&lt;/em&gt; before it reaches your database is the whole point.&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
      <category>api</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Why You Should Add Observability to Your Data Extraction with OpenTelemetry</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Mon, 06 Apr 2026 03:54:52 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/why-you-should-add-observability-to-your-data-extraction-with-opentelemetry-p1k</link>
      <guid>https://dev.to/prithwish_nath/why-you-should-add-observability-to-your-data-extraction-with-opentelemetry-p1k</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;TL;DR: This is a step-by-step tutorial on the quickest way to add observability to any data ingestion pipeline — whether you’re scraping or using an API.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Anything that fetches data at scale has a class of failure that error handling won’t catch. Not because your error handling code is &lt;em&gt;bad&lt;/em&gt; (it probably isn’t) but because retries that &lt;em&gt;eventually&lt;/em&gt; succeed, queries that take 10x longer than average, and domains that silently time out — &lt;strong&gt;don’t throw exceptions because they’re not technically errors.&lt;/strong&gt; And you’ll never know. The solution is actually adding proper &lt;a href="https://www.redhat.com/en/topics/devops/what-is-observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Overkill? Not at all. Because a data pipeline — &lt;em&gt;any&lt;/em&gt; data pipeline — with network calls, retries, timeouts, and wildly variable latency across different queries and domains is a textbook &lt;a href="https://www.atlassian.com/microservices/microservices-architecture/distributed-architecture" rel="noopener noreferrer"&gt;distributed system&lt;/a&gt;. It has all the same failure modes, and so it deserves the same tooling.&lt;/p&gt;

&lt;p&gt;In this post, we’ll build a SERP pipeline on top of &lt;a href="https://get.brightdata.com/bd7914?utm_content=why_you_should_add_observability_to_your_data_extraction_with_opentelemetry" rel="noopener noreferrer"&gt;Bright Data&lt;/a&gt;’s API and instrument it with &lt;a href="https://opentelemetry.io/" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt; (See: &lt;a href="https://opentelemetry.io/docs/languages/python" rel="noopener noreferrer"&gt;Python docs&lt;/a&gt;), the open-source standard for distributed tracing. Bright Data reduces blocks and proxy headaches out of the box — but proper Otel tracing shows you exactly where risk remains.&lt;/p&gt;

&lt;p&gt;By the end, you’ll be able to see what each call costs you in time, where retries are hiding, and which queries are slow.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Gets You
&lt;/h2&gt;

&lt;p&gt;What I’m trying to do is surface problems you’ll probably run into, and otherwise just silently pay for. These patterns map nearly 1:1 to wherever your data ingest pipelines look like in production.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retry storm detection.&lt;/strong&gt; If a domain starts blocking aggressively, you won’t see it as hard errors anymore, but a creeping spike in &lt;code&gt;scraper.retries &amp;gt; 0&lt;/code&gt; &lt;a href="https://opentelemetry.io/docs/concepts/signals/traces/#spans" rel="noopener noreferrer"&gt;spans&lt;/a&gt;. That’s your early warning before you trigger a full ban or blow past your proxy quota for the month.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actual cost visibility.&lt;/strong&gt; Every retry is another proxy request. If you’re paying per request or per GB, scraper.retries on your spans maps directly to a line item on your invoice. You can aggregate this and alert on it — I haven’t been doing this before adding OTel, and most likely, neither have you. 😅&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-query latency profiling.&lt;/strong&gt; Some queries are just structurally slower — more competitive terms, heavier result pages, more contention in the proxy pool. Traces let you see this per-query instead of as a blended average that makes everything look fine. Once you can see the outliers, you can do something about them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Basically, if you take ONE thing away from this read, let it be this: &lt;strong&gt;data pipelines have exactly the same failure modes as any distributed system&lt;/strong&gt; — timeouts, partial failures, retry amplification, silent degradation — whether data is obtained via an API call or just scraping.&lt;/p&gt;

&lt;p&gt;So let’s build a data collection stack you can reason about. And, as it turns out, the tooling you’d use for microservices works perfectly well here too.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Here’s what you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;opentelemetry-api&amp;gt;=1.20.0  
opentelemetry-sdk&amp;gt;=1.20.0  
opentelemetry-instrumentation-requests&amp;gt;=0.41b0  
opentelemetry-exporter-otlp-proto-http&amp;gt;=1.20.0  
requests&amp;gt;=2.28.0  
python-dotenv&amp;gt;=1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/opentelemetry-instrumentation-requests/" rel="noopener noreferrer"&gt;opentelemetry-instrumentation-requests&lt;/a&gt; — this gives us automatic HTTP tracing. Zero manual work.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/opentelemetry-exporter-otlp-proto-http/" rel="noopener noreferrer"&gt;opentelemetry-exporter-otlp-proto-http&lt;/a&gt; — for when we want to send traces somewhere real, like &lt;a href="https://www.jaegertracing.io/docs/" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before running &lt;code&gt;pip install -r requirements.txt&lt;/code&gt;, create a &lt;code&gt;.env&lt;/code&gt; file with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BRIGHT_DATA_API_KEY=your_api_key  
BRIGHT_DATA_ZONE=serp # or your SERP zone name from Bright Data dashboard  
BRIGHT_DATA_COUNTRY=us # optional  
OTEL_EXPORTER=console   # set to "jaeger" to send traces to Jaeger (must be running)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client reads these on instantiation. Replace with your own API credentials if you need them, but don’t forget &lt;code&gt;OTEL_EXPORTER&lt;/code&gt; — it controls where traces go.&lt;/p&gt;

&lt;h2&gt;
  
  
  Initializing OpenTelemetry
&lt;/h2&gt;

&lt;p&gt;We want two modes: a console exporter for development where traces print right in the terminal, and an &lt;a href="https://opentelemetry.io/docs/specs/otlp/" rel="noopener noreferrer"&gt;OTLP&lt;/a&gt; exporter for production. A single env var switches between them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.instrumentation.requests&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RequestsInstrumentor&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TracerProvider&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace.export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ConsoleSpanExporter&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.resources&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SERVICE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Resource&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;init_otel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;console&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bd-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;SERVICE_NAME&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;  
    &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jaeger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.exporter.otlp.proto.http.trace_exporter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OTLPSpanExporter&lt;/span&gt;  
        &lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:4318&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/v1/traces&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ConsoleSpanExporter&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  

    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tracer_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="nc"&gt;RequestsInstrumentor&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# this is where the magic happens!  
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That one line — &lt;a href="https://opentelemetry-python-contrib.readthedocs.io/en/latest/instrumentation/requests/requests.html" rel="noopener noreferrer"&gt;RequestsInstrumentor().instrument()&lt;/a&gt; — hooks into the requests library globally. Every HTTP call your code makes from this point forward gets a &lt;a href="https://opentelemetry.io/docs/concepts/signals/traces/#spans" rel="noopener noreferrer"&gt;trace span&lt;/a&gt;, including the ones in third-party code you didn’t write. You get that for free.&lt;/p&gt;

&lt;p&gt;One thing that’ll catch you out if you’re not careful: &lt;code&gt;init_otel&lt;/code&gt; must run &lt;em&gt;before&lt;/em&gt; any &lt;code&gt;requests.Session&lt;/code&gt; is created. That means calling it before importing &lt;code&gt;BrightDataClient&lt;/code&gt; in your entrypoint. Yes, the import order matters here.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Client with Custom Spans
&lt;/h2&gt;

&lt;p&gt;Automatic HTTP tracing is great, but it only tells you about the transport layer. It has no idea this call was for the query “&lt;code&gt;machine learning&lt;/code&gt;”, or that it targeted&lt;code&gt;google.com&lt;/code&gt;, or that it had to retry once before it worked. That context is what custom spans are for.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;  

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;  
&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_COUNTRY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.brightdata.com/request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY and BRIGHT_DATA_ZONE required (env or constructor)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;})&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;  
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StatusCode&lt;/span&gt;  

        &lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;target_domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bright_data.search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.target_domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_domain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.num_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

            &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
            &lt;span class="n"&gt;last_err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  

            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
                &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_do_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                    &lt;span class="n"&gt;latency_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;  
                    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
                    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.retries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                    &lt;span class="c1"&gt;# clean success — no retries needed  
&lt;/span&gt;                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                        &lt;span class="c1"&gt;# recovered, but we want this surfaced in Jaeger  
&lt;/span&gt;                        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Recovered after retry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;  
                &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                    &lt;span class="n"&gt;last_err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;  
                    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.retries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  

            &lt;span class="c1"&gt;# all retries exhausted  
&lt;/span&gt;            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_err&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_err&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record_exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;last_err&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_do_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.google.com/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?q=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;num=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;brd_json=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;hl=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;lr=lang_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="n"&gt;target_country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;  
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;search_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;  

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="c1"&gt;# Bright Data may return body as JSON string — unpack it  
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I’m using a &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=why_you_should_add_observability_to_your_data_extraction_with_opentelemetry" rel="noopener noreferrer"&gt;SERP API&lt;/a&gt; for data, but swap it out with whatever you’re using. The concepts apply to anything. Also, a couple of things worth understanding here:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Parent-Child Relationship
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;_do_search&lt;/code&gt; helper builds the target URL and POSTs it to our API endpoint (&lt;code&gt;api.brightdata.com/request&lt;/code&gt;). When that call runs, the &lt;code&gt;RequestsInstrumentor&lt;/code&gt;auto-creates a child POST span &lt;em&gt;inside&lt;/em&gt; our &lt;code&gt;bright_data.search&lt;/code&gt; parent span. They share the same &lt;code&gt;trace_id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://www.jaegertracing.io/docs/" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt;, you’ll get a proper timeline: the outer business operation wrapping the inner HTTP call. That nesting is what makes traces actually useful — you see the whole story, not just individual events.&lt;/p&gt;

&lt;h3&gt;
  
  
  You Have to Set Span Status Yourself
&lt;/h3&gt;

&lt;p&gt;This one surprised me. OTel records all the data you throw at it, but it won’t decide what &lt;em&gt;matters&lt;/em&gt; on your behalf. If you don’t explicitly call &lt;code&gt;[span.set_status(…)](https://opentelemetry.io/docs/languages/python/instrumentation/#set-span-status)&lt;/code&gt;, every span stays &lt;code&gt;UNSET&lt;/code&gt;— even when a retry happened underneath. A query that timed out, retried, and recovered would be completely invisible to a &lt;a href="https://www.jaegertracing.io/docs/" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt; filter like &lt;code&gt;status=ERROR&lt;/code&gt;. You’d never find it.&lt;/p&gt;

&lt;p&gt;So there’s a deliberate tradeoff we’re making in the code above: recovered retries are marked &lt;code&gt;ERROR&lt;/code&gt;so they show up in dashboards. Some teams prefer to use &lt;code&gt;OK&lt;/code&gt;and add a &lt;code&gt;scraper.recovered = true&lt;/code&gt; attribute instead, keeping error rate metrics clean.&lt;/p&gt;

&lt;p&gt;Honestly, both are fine 🤷‍♂️ It just depends on whether you want alerting to treat “degraded success” as a failure. The important thing is to choose consciously, and not fall through to &lt;code&gt;UNSET&lt;/code&gt;by accident.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Let’s do this in a file, call it something like &lt;code&gt;scraper.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;  

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;  
&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="n"&gt;_exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OTEL_EXPORTER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;console&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# console fallback as a default  
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;otel_config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;init_otel&lt;/span&gt;  
&lt;span class="nf"&gt;init_otel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_exporter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# must come before BrightDataClient import  
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bright_data_otel&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BrightDataClient&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python programming&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;machine learning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web development&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data science&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloud computing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;]&lt;/span&gt;  
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;  
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: error — &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Done in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--delay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it in console mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; python scraper.py &lt;span class="nt"&gt;--count&lt;/span&gt; 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or point it at &lt;a href="https://www.jaegertracing.io/docs/" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# to be honest, just set OTEL_EXPORTER in .env  &lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;jaeger python scraper.py &lt;span class="nt"&gt;--count&lt;/span&gt; 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What the Traces Actually Show
&lt;/h2&gt;

&lt;p&gt;Here’s what the terminal printed for my five-query run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;1/5] python programming: 6 results  
&lt;span class="o"&gt;[&lt;/span&gt;2/5] machine learning: 8 results  
&lt;span class="o"&gt;[&lt;/span&gt;3/5] web development: 9 results  
&lt;span class="o"&gt;[&lt;/span&gt;4/5] data science: 9 results  
&lt;span class="o"&gt;[&lt;/span&gt;5/5] cloud computing: 9 results  
Done &lt;span class="k"&gt;in &lt;/span&gt;75.1s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five queries, all with results, no errors printed. Looks perfectly healthy….right? Let’s see what the traces say.&lt;/p&gt;

&lt;h3&gt;
  
  
  The clean calls…
&lt;/h3&gt;

&lt;p&gt;For &lt;code&gt;python programming&lt;/code&gt;, the &lt;code&gt;bright_data.search&lt;/code&gt;span looks exactly as expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bright_data.search"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"attributes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python programming"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.target_domain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"google.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.latency_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;3686.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.retries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.7 seconds, zero retries, one nested POST span confirming the HTTP round-trip happened exactly once. Looks good! Moving on.&lt;/p&gt;

&lt;h3&gt;
  
  
  …and the ones that weren’t so clean.
&lt;/h3&gt;

&lt;p&gt;The query &lt;code&gt;data science&lt;/code&gt; printed 9 results. Except the traces show &lt;em&gt;three&lt;/em&gt; spans for that single call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"status_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ReadTimeout: ...Read timed out. (read timeout=30)"&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"start_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-20T21:18:24.986273Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"end_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-20T21:18:54.999505Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"events"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exception"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"attributes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"exception.type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests.exceptions.ReadTimeout"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"exception.stacktrace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"UNSET"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"start_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-20T21:18:55.505186Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"end_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-20T21:19:20.097874Z"&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bright_data.search"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"attributes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"data science"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.latency_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;55113.46&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.retries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Turns out, this query hit the 30-second read timeout, waited for the retry backoff, tried again, and finally came back with data — costing you two proxy requests and 55 seconds instead of one request and ~4 seconds.&lt;/p&gt;

&lt;p&gt;You‘d have absolutely no idea from the terminal output itself.&lt;/p&gt;

&lt;p&gt;This happens more often than you think — failures that silently blend in with the clean calls around it, that your pipeline still declare a success. &lt;strong&gt;That’s the whole argument for adding observability to your data ingest pipeline, right there.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This failure was invisible. No exception or warning, nothing in your logs. OTel may not have &lt;em&gt;prevented&lt;/em&gt; this failure — it has nothing to do with networking or data ingestion — but it definitely made it impossible to miss.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Latency Breakdown Across All Five Queries
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Retries&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;python programming&lt;/td&gt;
&lt;td&gt;3,686ms&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;machine learning&lt;/td&gt;
&lt;td&gt;6,558ms&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;web development&lt;/td&gt;
&lt;td&gt;3,079ms&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;data science&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;55,113ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cloud computing&lt;/td&gt;
&lt;td&gt;4,600ms&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;scraper.target_domain&lt;/code&gt; attribute lets you aggregate this same breakdown per domain when you scrape multiple targets (e.g. google.com vs bing.com).&lt;/p&gt;

&lt;p&gt;Four of five calls were clean. One was 15x slower than average, and you only know that because you were looking at traces.&lt;/p&gt;

&lt;p&gt;If you’re running this at scale across hundreds of queries, that pattern — most queries fast, some consistently slow or timeout-prone — is &lt;em&gt;exactly&lt;/em&gt; the info you need to tune retry budgets, adjust per-query timeouts, or start asking why that&lt;code&gt;“data science”&lt;/code&gt; query keeps choking.&lt;/p&gt;

&lt;p&gt;You can’t turn knobs on what you can’t see, after all. Adding observability gives you visibility into things you may not even have thought of.&lt;/p&gt;

&lt;h2&gt;
  
  
  Going to Production with Jaeger
&lt;/h2&gt;

&lt;p&gt;The console exporter is great for development, but for anything actually running in production you want traces going somewhere persistent. The easiest starting point is &lt;a href="https://hub.docker.com/r/jaegertracing/all-in-one" rel="noopener noreferrer"&gt;Jaeger’s all-in-one Docker image&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; jaeger &lt;span class="se"&gt;\ &lt;/span&gt; 
  &lt;span class="nt"&gt;-p&lt;/span&gt; 16686:16686 &lt;span class="se"&gt;\ &lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4318:4318 &lt;span class="se"&gt;\ &lt;/span&gt; 
  jaegertracing/all-in-one:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, as before:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; python scraper.py &lt;span class="nt"&gt;--count&lt;/span&gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this run, open &lt;code&gt;http://localhost:16686,&lt;/code&gt; search for &lt;code&gt;bd-scraper&lt;/code&gt;(or whatever you called yours) and you’ll see each &lt;code&gt;bright_data.search&lt;/code&gt; span as a row in the trace timeline with the nested POST spans inside.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;data science&lt;/code&gt; query was slow again, but hey, at least it didn’t fail? Small victories. 😅 It stands out immediately in the Jaeger UI — one that’s wider than everything else on the screen (~17 seconds.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4i81kyaukz303zr9mcm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4i81kyaukz303zr9mcm.png" width="640" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is what &lt;a href="http://localhost:16686/search" rel="noopener noreferrer"&gt;http://localhost:16686/search&lt;/a&gt; will look like after a run.&lt;/p&gt;

&lt;p&gt;Click through it for more info.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5d5mzn7aawt6fgxmjwn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5d5mzn7aawt6fgxmjwn.png" width="640" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And you can expand through traces here to as fine grained a detail as you need.&lt;/p&gt;

&lt;p&gt;For a real setup, swap Jaeger out for whatever backend you already run. &lt;a href="https://grafana.com/docs/tempo/latest" rel="noopener noreferrer"&gt;Grafana Tempo&lt;/a&gt;, &lt;a href="https://docs.honeycomb.io/send-data/traces/opentelemetry" rel="noopener noreferrer"&gt;Honeycomb&lt;/a&gt;, &lt;a href="https://docs.datadoghq.com/opentelemetry/setup" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt; — the &lt;a href="https://opentelemetry.io/docs/specs/otlp/" rel="noopener noreferrer"&gt;OTLP&lt;/a&gt; exporter speaks the same protocol to all of them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;That’s everything! Feel free to reach out on&lt;/em&gt; &lt;a href="https://www.linkedin.com/in/prithwish-nath-04b873a7/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; &lt;em&gt;if you have questions, or leave a comment below.👋&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>Building a Local Data Analytics Pipeline with dbt Core and DuckDB</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Wed, 18 Mar 2026 16:49:14 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/building-a-local-data-analytics-pipeline-with-dbt-core-and-duckdb-4e4</link>
      <guid>https://dev.to/prithwish_nath/building-a-local-data-analytics-pipeline-with-dbt-core-and-duckdb-4e4</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;TL;DR:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;This pipeline uses dbt Core + DuckDB locally — no infrastructure — to normalize domains, deduplicate URLs, enforce data contracts via tests, and materialize four analyst-ready mart tables from raw SERP API output.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49xhyi77pufd3km52e77.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49xhyi77pufd3km52e77.png" alt="cover image with article title and logos for dbt and duck DB" width="560" height="294"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After web ingestion, you’ll have inconsistent domains, duplicate URLs across collection runs, null titles, and more. This is not  &lt;em&gt;wrong&lt;/em&gt; data, per se, just unprocessed data. The gap between “data in a table” and “data you can trust in a query” is bigger than you think.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getdbt.com/" rel="noopener noreferrer"&gt;dbt&lt;/a&gt;  (data build tool) is an open-source transformation framework that can help us with exactly that problem: you write SQL models, it materializes them in dependency order, and it tracks lineage from raw source to final output. Paired with  &lt;a href="https://duckdb.org/docs/" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt;  via the community &lt;a href="https://docs.getdbt.com/reference/resource-configs/duckdb-configs" rel="noopener noreferrer"&gt;dbt-duckdb&lt;/a&gt;  adapter — no infrastructure needed, it’s all&lt;code&gt;.duckdb&lt;/code&gt;  files — it's a surprisingly capable local setup for closing that gap.&lt;/p&gt;

&lt;p&gt;I’ll walk you through the Python-based pipeline I use — one that takes SERP data and produces analytics ready tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What you need:&lt;/strong&gt; Python 3.x, first of all. Then we can install our requirements like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;dbt-core dbt-duckdb duckdb requests python-dotenv pandas
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For ingestion, we’ll be using a SERP API — I’m using the one I have access to,  &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=building_a_local_serp_analytics_pipeline_with_dbt_core_and_duckdb" rel="noopener noreferrer"&gt;Bright Data&lt;/a&gt;. For this, you’ll need an account with a SERP zone (get API key and zone from its dashboard).&lt;/p&gt;

&lt;p&gt;Create a project directory with this layout:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;ingest/&lt;/code&gt;  for the Python scripts (bright_data.py, duckdb_manager.py, scraper.py),&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;models/&lt;/code&gt;  for dbt (with subfolders staging/, intermediate/, marts/),&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;data/&lt;/code&gt;  for the .duckdb files, and&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;profiles.yml&lt;/code&gt;, and &lt;code&gt;dbt_project.yml&lt;/code&gt;  at the root.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So there are two main phases: a Python ingest layer that collects and streams results into DuckDB, and a dbt transformation layer with three tiers — staging, intermediate, and marts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: Ingesting SERP Data into DuckDB
&lt;/h2&gt;

&lt;p&gt;Before dbt has anything to work with, we need data in the database. The ingest layer is three Python files:  &lt;code&gt;bright_data.py&lt;/code&gt;  wraps the Bright Data SERP API,  &lt;code&gt;duckdb_manager.py&lt;/code&gt;  handles the DuckDB connection and schema, and  &lt;code&gt;scraper.py&lt;/code&gt;  orchestrates the collection loop. Place them in an  &lt;code&gt;ingest/&lt;/code&gt;  subdirectory.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Bright Data Client
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/lp-scraping-browser-acf1964?utm_content=building_a_local_serp_analytics_pipeline_with_dbt_core_and_duckdb" rel="noopener noreferrer"&gt;Bright Data’s SERP API&lt;/a&gt;  works differently from a proxy setup. Rather than routing requests through a proxy, you POST a target URL to their API endpoint and get back structured JSON.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ingest/bright_data.py:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
Bright Data SERP API client for fetching search results  
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;  

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
    Client for Bright Data SERP API  
    Uses the SERP API endpoint (not proxy) for Google search access  
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  
    &lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;env_api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;env_zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;env_country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_COUNTRY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;env_api_key&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;env_zone&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;env_country&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.brightdata.com/request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY must be provided via constructor or environment variable. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get your API key from: https://brightdata.com/cp/setting/users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE must be provided via constructor or environment variable. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Manage zones at: https://brightdata.com/cp/zones&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  
        &lt;span class="p"&gt;})&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
        Execute a Google search via Bright Data SERP API  

        Args:  
            query: Search query string  
            num_results: Number of results to return (default: 10)  
            language: Language code (e.g., &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;es&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fr&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)  
            country: Country code (e.g., &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;uk&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ca&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)  

        Returns:  
            Dictionary containing search results in JSON format  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
        &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.google.com/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?q=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;num=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;brd_json=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;hl=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;lr=lang_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="n"&gt;target_country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;  

        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;zone&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;search_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;  

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HTTPError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search request failed with HTTP &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error_msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;  
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search request failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the&lt;code&gt;brd_json=1&lt;/code&gt;  parameter appended to the Google search URL — that’s what tells Bright Data to parse the response and return structured data rather than raw HTML.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  API key, zone, country — read from environment variables with constructor overrides, so the client works both in scripts and in environments where secrets come in differently.&lt;/li&gt;
&lt;li&gt;  Without  &lt;code&gt;BRIGHT_DATA_API_KEY&lt;/code&gt; and  &lt;code&gt;BRIGHT_DATA_ZONE&lt;/code&gt;  set, it raises immediately with a message pointing to the right place in the Bright Data dashboard.&lt;/li&gt;
&lt;li&gt;  The client also supports a  &lt;code&gt;language&lt;/code&gt; parameter (e.g.  &lt;code&gt;hl=en&lt;/code&gt;, &lt;code&gt;lr=lang_en&lt;/code&gt;) for non-English or multi-region SERP analysis — pass it to  &lt;code&gt;search()&lt;/code&gt;  or set  &lt;code&gt;BRIGHT_DATA_COUNTRY&lt;/code&gt;  for geo-targeting.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The DuckDB Manager
&lt;/h3&gt;

&lt;p&gt;We’ll use two databases. The  &lt;a href="https://duckdb.org/docs/api/python/overview" rel="noopener noreferrer"&gt;DuckDB Python API&lt;/a&gt;  lets us create files, run SQL, and insert from pandas DataFrames directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;serp_data.duckdb&lt;/code&gt;  — the source DB that holds raw ingest output, and&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;serp_analytics.duckdb&lt;/code&gt;  — our analytics DB, one that holds the transformed models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;dbt attaches the source DB read-only and writes only to the analytics DB, so raw data stays untouched.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 The keen among you may have noticed that I'm basically stealing the &lt;a href="https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion" rel="noopener noreferrer"&gt;medallion architecture pattern&lt;/a&gt; from Databricks/BigQuery projects here. So "bronze" stays untouched, "silver/gold" tables are derived. Why do this? If a dbt model has a bug and you materialize garbage into analytics, your raw data is completely clean and you just re-run.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For our source DB, the schema is simple: one  &lt;code&gt;serp_results&lt;/code&gt; table with indexes on query and domain — the two fields that get hit hardest in the dbt transformations downstream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ingest/duckdb_manager.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
DuckDB connection and schema management for SERP ingest  
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;  

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DuckDBManager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Manages DuckDB connection, schema, and insert operations&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/serp_data.duckdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;     
        &lt;span class="c1"&gt;# db_path = "serp_data.duckdb" gives dirname = "", which can cause issues on some setups   
&lt;/span&gt;        &lt;span class="c1"&gt;# So...safer to guard like so:  
&lt;/span&gt;        &lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;makedirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_create_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_create_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            CREATE TABLE IF NOT EXISTS serp_results (  
                id BIGINT PRIMARY KEY,  
                query TEXT NOT NULL,  
                timestamp TIMESTAMP NOT NULL,  
                result_position INTEGER NOT NULL,  
                title TEXT,  
                url TEXT,  
                snippet TEXT,  
                domain TEXT,  
                rank INTEGER  
            )  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;insert_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt;  

        &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;  
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urlparse&lt;/span&gt;  
                &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;netloc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;www.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;  

        &lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COALESCE(MAX(id), 0) FROM serp_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;next_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_id_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  

        &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
            &lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

            &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;next_id&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;result_position&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
            &lt;span class="p"&gt;})&lt;/span&gt;  

        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;  
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="c1"&gt;# Best practice is to specify columns explicitly (like, INSERT INTO t (a,b,c) SELECT a,b,c FROM df)   
&lt;/span&gt;        &lt;span class="c1"&gt;# to avoid mismatch if table or DataFrame order changes  
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            INSERT INTO serp_results (id, query, timestamp, result_position, title, url, snippet, domain, rank)  
            SELECT id, query, timestamp, result_position, title, url, snippet, domain, rank FROM df  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_row_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COUNT(*) FROM serp_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__enter__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__exit__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc_tb&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Just in case, the  &lt;code&gt;insert_batch&lt;/code&gt;  method handles inconsistent SERP field names (&lt;code&gt;link&lt;/code&gt;  vs  &lt;code&gt;url&lt;/code&gt;,  &lt;code&gt;description&lt;/code&gt; vs  &lt;code&gt;snippet&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Also, it’s worth being honest about what the domain extraction is and isn’t: I’ve accounted only for a best-effort ingest-time extraction, not the canonical domain value. The dbt staging model is where the real normalization happens — lowercasing, stripping  &lt;code&gt;www.&lt;/code&gt;, and falling back to regex extraction when the field is missing. The ingest layer just makes sure  &lt;em&gt;something&lt;/em&gt;  is in the column.&lt;/p&gt;

&lt;h3&gt;
  
  
  Making the Two Work Together to Ingest Web Data
&lt;/h3&gt;

&lt;p&gt;The scraper loops through a list of queries, calls the Bright Data client for each, and streams results into DuckDB in batches. Put a .env file in the project root with  &lt;code&gt;BRIGHT_DATA_API_KEY&lt;/code&gt;  and &lt;code&gt;BRIGHT_DATA_ZONE&lt;/code&gt;, then run from the project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python ingest/scraper.py &lt;span class="nt"&gt;--count&lt;/span&gt; 50000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Some other time saving flags to add:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--queries “foo” “bar”&lt;/code&gt;  for custom queries (Here, I’ve made it default to 10 tech keywords),&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--batch-size 10&lt;/code&gt;  for results per API call,&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--delay 1.0&lt;/code&gt; for seconds between calls, and&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--db path/to/serp_data.duckdb&lt;/code&gt;  to override the output path (default:  &lt;code&gt;data/serp_data.duckdb&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
SERP scraper that streams results to DuckDB (data/serp_data.duckdb).  
Run from project root. Then run dbt to build analytics.  
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;    
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;    
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;    
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;    

&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;  

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bright_data&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BrightDataClient&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;duckdb_manager&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DuckDBManager&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_default_db_path&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
    &lt;span class="n"&gt;script_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;    
    &lt;span class="n"&gt;project_root&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;script_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;    
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_root&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_data.duckdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_and_insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;delay_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="p"&gt;):&lt;/span&gt;    
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
        &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;    
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python programming&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;machine learning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web development&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data science&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloud computing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;javascript frameworks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database design&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API development&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;devops tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cybersecurity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    
        &lt;span class="p"&gt;]&lt;/span&gt;  

    &lt;span class="n"&gt;db_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;_default_db_path&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DuckDBManager&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting scrape: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; total results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Database: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Then run: dbt run --profiles-dir .&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;results_scraped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;    
        &lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;    
        &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;results_scraped&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;  

                &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                    &lt;span class="n"&gt;serp_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

                    &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;    
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
                        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                            &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;    
                        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
                            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;    
                                &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  

                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                        &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
                        &lt;span class="n"&gt;results_scraped&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      
                        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;results_scraped&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] Query: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; | Inserted: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

                    &lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  

                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;results_scraped&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delay_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

                &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error scraping query &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
                    &lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;    
                    &lt;span class="k"&gt;continue&lt;/span&gt;  

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;KeyboardInterrupt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Scraping interrupted by user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;    
        &lt;span class="n"&gt;final_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_row_count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;=== Scraping Complete ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total rows in DB: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;final_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Time elapsed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;final_count&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows/sec&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Scrape SERP results to DuckDB for dbt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--batch-size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--delay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--queries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="nf"&gt;scrape_and_insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;    
        &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;delay_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;    
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things in here that are easy to overlook:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  In case your SERP API response structure isn’t always consistent — organic results can live at  &lt;code&gt;serp_data[‘organic’]&lt;/code&gt;  or nested at  &lt;code&gt;serp_data[‘body’][‘organic’]&lt;/code&gt; depending on the response format and search engine (Bright Data supports Google as well as others), so the parser checks both.&lt;/li&gt;
&lt;li&gt;  There’s a configurable delay between API calls (  &lt;code&gt;--delay&lt;/code&gt;, default 1 second) to avoid hammering the API.&lt;/li&gt;
&lt;li&gt;  Progress is printed after each batch so you can track the run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the scrape finishes, you have a populated  &lt;code&gt;data/serp_data.duckdb&lt;/code&gt;  file with a  &lt;code&gt;serp_results&lt;/code&gt;  table. That’s our raw data. We now have everything dbt needs to begin.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2: Transforming with dbt
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Connecting dbt to DuckDB
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/duckdb/dbt-duckdb" rel="noopener noreferrer"&gt;dbt-duckdb community adapter&lt;/a&gt; supports attaching multiple database files, which is exactly what we need here: we’ll make dbt write transformed models to  &lt;code&gt;serp_analytics.duckdb&lt;/code&gt;  while treating  &lt;code&gt;serp_data.duckdb&lt;/code&gt;  as a read-only source. This separation is a good idea because, as a standard, you never want a transformation step to accidentally mutate the raw data it's reading from.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://docs.getdbt.com/docs/local/connect-data-platform/duckdb-setup?version=1.12" rel="noopener noreferrer"&gt;Read more about configuring the dbt-duckdb adapter here  &lt;em&gt;👉&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml" rel="noopener noreferrer"&gt;profiles.yml&lt;/a&gt;  is where the connection lives. The  &lt;a href="https://docs.getdbt.com/reference/resource-configs/duckdb-configs#attaching-additional-databases" rel="noopener noreferrer"&gt;attach&lt;/a&gt;  block is the part that makes this work — it mounts the raw DB under the alias  &lt;code&gt;serp_source&lt;/code&gt;, which is the database name the source declaration will reference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;profiles.yml:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;serp_analytics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dev&lt;/span&gt;  
  &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
    &lt;span class="na"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;duckdb&lt;/span&gt;  
      &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/serp_analytics.duckdb&lt;/span&gt;  
      &lt;span class="na"&gt;threads&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;  
      &lt;span class="na"&gt;settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
        &lt;span class="na"&gt;max_temp_directory_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;10GB'&lt;/span&gt;  
        &lt;span class="na"&gt;memory_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;12GB'&lt;/span&gt;  
        &lt;span class="na"&gt;preserve_insertion_order&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  
      &lt;span class="na"&gt;attach&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/serp_data.duckdb&lt;/span&gt;  
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;duckdb&lt;/span&gt;  
          &lt;span class="na"&gt;alias&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_source&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tune  &lt;code&gt;memory_limit&lt;/code&gt;  and  &lt;code&gt;max_temp_directory_size&lt;/code&gt;  based on how much data you’re working with. I’ve optimized these values for a stress-free 50k-row scrape but you may want to lower them on smaller machines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getdbt.com/reference/dbt_project.yml" rel="noopener noreferrer"&gt;dbt_project.yml&lt;/a&gt;  holds the project-level configuration — model paths,  &lt;a href="https://docs.getdbt.com/docs/build/materializations" rel="noopener noreferrer"&gt;materialization&lt;/a&gt;  defaults per layer, and the two variables that control pipeline behavior:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;dbt_project.yml&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_analytics&lt;/span&gt;    
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1.0.0&lt;/span&gt;    
&lt;span class="na"&gt;config-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2&lt;/span&gt;    
&lt;span class="na"&gt;profile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_analytics&lt;/span&gt;  

&lt;span class="na"&gt;model-paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;models"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;    
&lt;span class="na"&gt;test-paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tests"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;    
&lt;span class="na"&gt;target-path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target"&lt;/span&gt;    
&lt;span class="na"&gt;clean-targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target"&lt;/span&gt;    
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbt_packages"&lt;/span&gt;  

&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
  &lt;span class="na"&gt;serp_analytics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
    &lt;span class="na"&gt;staging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
      &lt;span class="na"&gt;+schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;    
    &lt;span class="na"&gt;intermediate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
      &lt;span class="na"&gt;+schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;intermediate&lt;/span&gt;    
    &lt;span class="na"&gt;marts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
      &lt;span class="na"&gt;+schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;marts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Declaring the Source
&lt;/h3&gt;

&lt;p&gt;In dbt, you don’t reference raw tables directly in your models. You  &lt;a href="https://docs.getdbt.com/docs/build/sources" rel="noopener noreferrer"&gt;declare them as sources&lt;/a&gt;  first, and this is what makes lineage tracking possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/sources.yml&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2&lt;/span&gt;  
&lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_source&lt;/span&gt;  
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Raw SERP data from Bright Data collection&lt;/span&gt;  
    &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_source&lt;/span&gt;  
    &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;  
    &lt;span class="na"&gt;tables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_results&lt;/span&gt;  
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Raw search engine result pages - id, query, url, domain, rank, position, etc.&lt;/span&gt;  
        &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;id&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Primary key&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;query&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Search query string&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Result URL&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;domain&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extracted domain (e.g. example.com)&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result_position&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Position on SERP (1-based)&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rank&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Rank value (can differ from position)&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;title&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Result title&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;timestamp&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Scrape timestamp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The  &lt;code&gt;database: serp_source&lt;/code&gt;  matches the attach alias — that’s how dbt finds the raw table. From here, every model references  &lt;a href="https://docs.getdbt.com/reference/dbt-jinja-functions/source" rel="noopener noreferrer"&gt;{{ source() }}&lt;/a&gt;  or  &lt;a href="https://docs.getdbt.com/reference/dbt-jinja-functions/ref" rel="noopener noreferrer"&gt;{{ ref() }}&lt;/a&gt;  rather than a raw table path. This doesn’t sound too important, but in fact, it’s what gives dbt the information it needs to build a  &lt;a href="https://www.getdbt.com/blog/getting-started-with-data-lineage" rel="noopener noreferrer"&gt;lineage graph&lt;/a&gt;  — a visual DAG of how every output table was derived from the source.&lt;/p&gt;

&lt;p&gt;It’s worth talking about what the lineage graph  &lt;em&gt;actually&lt;/em&gt; buys you, because it’s easy to dismiss as a visualization feature:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  When something looks wrong in a mart, you trace upstream through the graph to find where it came from.&lt;/li&gt;
&lt;li&gt;  When you’re about to refactor staging, you can see which intermediate models and marts depend on it before you touch anything.&lt;/li&gt;
&lt;li&gt;  When someone new joins the project, they get the full picture of how every output table was derived — in seconds, not by grepping through SQL files.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run  &lt;code&gt;dbt docs generate &amp;amp;&amp;amp; dbt docs serve&lt;/code&gt;  to view it in the browser.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fivd8o932vmq4g14crn5p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fivd8o932vmq4g14crn5p.png" alt="dbt lineage graph showing SERP data flow from source serp_results through staging, intermediate, and marts models." width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Lineage graph from dbt docs serve, showing the dependency chain from raw SERP data to analytics-ready aggregations.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;See these  &lt;a href="https://docs.getdbt.com/reference/commands/cmd-docs" rel="noopener noreferrer"&gt;dbt docs&lt;/a&gt;  for more details on these commands.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Staging: The Contract
&lt;/h3&gt;

&lt;p&gt;The staging model is where all the defensive work happens. Its job is a guarantee: by the time a row leaves staging, it is clean, normalized, and deduplicated. Every model downstream gets to assume that contract holds, which means none of them have to repeat the defensive logic.&lt;/p&gt;

&lt;p&gt;Staging and intermediate models use  &lt;a href="https://docs.getdbt.com/reference/resource-configs/materialized" rel="noopener noreferrer"&gt;materialized='view'&lt;/a&gt;; marts use  &lt;code&gt;[materialized='table']&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/staging/stg_serp_results.sql&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'view'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'serp_source'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="p"&gt;),&lt;/span&gt;  

&lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="k"&gt;select&lt;/span&gt;  
        &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;nullif&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'Untitled'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;nullif&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="k"&gt;case&lt;/span&gt;  
            &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt; &lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;regexp_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'^https?://(?:www&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s1"&gt;)?([^/]+)'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                &lt;span class="s1"&gt;'unknown'&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;regexp_replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'^www&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'i'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
        &lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;rank&lt;/span&gt;    
    &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;    
&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="n"&gt;deduplicated&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  
    &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;over&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;partition&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;  
        &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;    
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;deduplicated&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, the domain normalization block is the part worth paying attention to.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The scraper provides a  &lt;code&gt;domain&lt;/code&gt;  field, but it isn't always populated, and even when it is, capitalization and  &lt;code&gt;www.&lt;/code&gt;  prefixes create false cardinality in downstream aggregations.&lt;/li&gt;
&lt;li&gt;  The  &lt;code&gt;CASE&lt;/code&gt;  expression handles both paths: if the field is present, lowercase it and strip  &lt;code&gt;www.&lt;/code&gt;; if it's missing, fall back to extracting the domain from the URL via  &lt;a href="https://duckdb.org/docs/sql/functions/char#regular-expressions" rel="noopener noreferrer"&gt;DuckDB's&lt;/a&gt; &lt;a href="https://duckdb.org/docs/sql/functions/char#regular-expressions" rel="noopener noreferrer"&gt;regexp_extract&lt;/a&gt; /  &lt;a href="https://duckdb.org/docs/sql/functions/char#regular-expressions" rel="noopener noreferrer"&gt;regexp_replace&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;  Without this,  &lt;code&gt;www.Example.com&lt;/code&gt;,  &lt;code&gt;example.com&lt;/code&gt;, and a row where  &lt;code&gt;domain&lt;/code&gt;  is null but the URL is  &lt;code&gt;https://www.example.com/...&lt;/code&gt;  all count as different domains in every GROUP BY. That kind of silent cardinality inflation is exactly what staging is supposed to catch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deduplication happens last. A window function partitions by  &lt;code&gt;(url, query)&lt;/code&gt;  and keeps the row with the lowest  &lt;code&gt;id&lt;/code&gt;  — the earliest record if a URL was collected more than once for a given query.&lt;/p&gt;

&lt;h3&gt;
  
  
  What about tests?
&lt;/h3&gt;

&lt;p&gt;Add  &lt;code&gt;models/staging/stg_serp_results.yml&lt;/code&gt;  next to your staging model with column-level tests — straightforward  &lt;a href="https://docs.getdbt.com/reference/resource-properties/data-tests#not_null" rel="noopener noreferrer"&gt;not_null&lt;/a&gt;  checks on the four fields that everything downstream depends on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2&lt;/span&gt;  
&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stg_serp_results&lt;/span&gt;  
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Cleaned and deduplicated SERP results; domain normalized&lt;/span&gt;  
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;  
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Result URL&lt;/span&gt;  
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;  
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;query&lt;/span&gt;  
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Search query&lt;/span&gt;  
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;  
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;domain&lt;/span&gt;  
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Normalized domain&lt;/span&gt;  
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;  
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result_position&lt;/span&gt;  
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SERP position (1-based)&lt;/span&gt;  
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There’s also a custom test in &lt;code&gt;tests/unique_url_query.sql&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Custom test because (url, query) should be unique in staging  &lt;/span&gt;
&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'stg_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;  
&lt;span class="k"&gt;having&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? Because dbt’s built-in  &lt;a href="https://docs.getdbt.com/reference/resource-properties/data-tests#unique" rel="noopener noreferrer"&gt;unique&lt;/a&gt;  test only checks a single column. Our custom test checks a  &lt;em&gt;combination&lt;/em&gt;  — a given URL should appear only once per query after deduplication.&lt;/p&gt;

&lt;p&gt;If the window function logic ever breaks due to a refactor, or if upstream data arrives in a shape the deduplication didn't account for, this test catches it before bad data reaches the marts. That's the value of encoding the business rule as a test: it runs every time, and it fails loudly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intermediate: Filter Once, Trust Everywhere
&lt;/h3&gt;

&lt;p&gt;The intermediate model is intentionally short (also a view).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/intermediate/int_serp_results.sql&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'view'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="nb"&gt;timestamp&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'stg_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;  
  &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
  &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;  
  &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'unknown'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You might wonder why this exists as its own layer rather than putting these filters in each mart. The reason is that all four mart models need the same filtered dataset.&lt;/p&gt;

&lt;p&gt;If the filter logic lived in each mart, changing the definition of a “valid” result would mean changing it in four places and hoping they stayed in sync — which they won’t, eventually. Intermediate models are dbt’s answer to that: define the analysis-ready dataset once, name it, and let everything downstream reference it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Marts: Four Questions, Four Tables
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getdbt.com/best-practices/how-we-structure/4-marts?version=1.12" rel="noopener noreferrer"&gt;The mart layer&lt;/a&gt;  is where the pipeline produces something you’d actually hand to an analyst or wire to a dashboard.&lt;/p&gt;

&lt;p&gt;Staging and intermediate models are materialized as  &lt;em&gt;views&lt;/em&gt;  — they’re always fresh and don’t consume extra storage. Marts are materialized as  &lt;em&gt;tables&lt;/em&gt;  — written to disk on every run — so queries against them are fast regardless of upstream complexity.&lt;/p&gt;

&lt;p&gt;That split keeps the feedback loop fast during development while giving analysts precomputed tables to query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query Coverage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/marts/agg_query_coverage.sql&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'table'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unique_urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unique_domains&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;best_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;worst_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_position&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'int_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;  
&lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;unique_domains&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each search query: how many unique URLs and domains appeared, and what was the spread of positions? This is where you’d start to understand which queries have competitive SERP landscapes — lots of domains spread across positions — versus which are dominated by a handful of sites.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rank Distribution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/marts/agg_rank_distribution.sql&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'table'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;rank_buckets&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="k"&gt;select&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="k"&gt;case&lt;/span&gt;  
            &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="s1"&gt;'1-3'&lt;/span&gt;  
            &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="s1"&gt;'4-10'&lt;/span&gt;  
            &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="s1"&gt;'11-20'&lt;/span&gt;  
            &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="s1"&gt;'21-50'&lt;/span&gt;  
            &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="s1"&gt;'50+'&lt;/span&gt;  
        &lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;position_bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;appearances&lt;/span&gt;  
    &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'int_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
    &lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;  
&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;position_bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unique_domains&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;appearances&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_appearances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_position&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rank_buckets&lt;/span&gt;  
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;position_bucket&lt;/span&gt;  
&lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;position_bucket&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The position buckets — 1–3, 4–10, 11–20, 21–50, 50+ — map to how SEO practitioners actually think about SERP real estate. Positions 1–3 capture the majority of clicks. Anything past 10 is largely invisible. By bucketing in the mart rather than the BI layer, you’re encoding that domain knowledge once, in version-controlled SQL, rather than relying on every analyst who touches the data to re-derive it correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain Rank Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/marts/agg_domain_rank_summary.sql&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'table'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_appearances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;query_coverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unique_urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;best_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;worst_position&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'int_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;  
&lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;total_appearances&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This flips the perspective from query-centric to domain-centric.  &lt;code&gt;query_coverage&lt;/code&gt;  is the field I find most useful here — it tells you how many distinct queries a domain appeared in, which is a rough proxy for breadth of SERP presence. A domain with high  &lt;code&gt;total_appearances&lt;/code&gt;  but low  &lt;code&gt;query_coverage&lt;/code&gt;  is strong in a narrow area; a domain with both metrics high is broadly dominant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain x Query Matrix&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/marts/agg_domain_query_matrix.sql&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'table'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="c1"&gt;-- Domain x query matrix: best position and appearances per (domain, query)  &lt;/span&gt;
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;best_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;appearances&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'int_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;  
&lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most compact mart, but also the most useful for visualization. Each row is a domain–query pair: the best position that domain achieved for that query, and how many times it appeared. Pivot this by query and you have an SEO visibility matrix — at a glance, which domains rank consistently across your whole keyword set, and which are only competitive in specific corners of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running the Pipeline
&lt;/h2&gt;

&lt;p&gt;From your project root (where  &lt;code&gt;dbt_project.yml&lt;/code&gt;  and  &lt;code&gt;profiles.yml&lt;/code&gt;  live):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Mind the " ."  &lt;/span&gt;
dbt run &lt;span class="nt"&gt;--profiles-dir&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;dbt resolves the  &lt;code&gt;ref()&lt;/code&gt;  graph before executing anything, so models always run in the right dependency order — staging first, then intermediate, then the four marts. To exclude synthetic data across the whole pipeline:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To run tests after materializing:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;--profiles-dir&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;To view the lineage graph and model docs in the browser:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt docs generate &lt;span class="nt"&gt;--profiles-dir&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;  

dbt docs serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;profiles.yml&lt;/code&gt;  is in the project root,&lt;code&gt;--profiles-dir&lt;/code&gt;  . tells dbt to look there. If you use  &lt;code&gt;~/.dbt/profiles.yml&lt;/code&gt;, you can omit it.&lt;/p&gt;

&lt;p&gt;If the deduplication breaks, or if upstream data arrives with unexpected nulls in critical columns, the tests surface it before bad data reaches the marts. That’s the other thing the pipeline buys you beyond SQL organization: the tests run as a first-class step, not as a notebook someone wrote once and forgot to share.&lt;/p&gt;

&lt;p&gt;The raw web data you ingest is  &lt;em&gt;queryable,&lt;/em&gt;  certainly (and SERP APIs like Bright Data make that very convenient with their structured JSON response), but it isn’t  &lt;em&gt;trustworthy&lt;/em&gt;  in the way analytics requires — where “trustworthy” means that every analyst querying it gets the same answers, because the decisions about normalization, deduplication, and what constitutes a valid result were made once and encoded somewhere they can be found.&lt;/p&gt;

&lt;p&gt;Without a pipeline like this, those decisions happen ad hoc. Someone normalizes the domain in a notebook. Someone else doesn’t. The counts disagree, and it’s not clear why. dbt’s layered model is a response to that problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  the decisions are in version-controlled SQL,&lt;/li&gt;
&lt;li&gt;  the contracts are enforced by tests that run on every execution, and&lt;/li&gt;
&lt;li&gt;  when something changes upstream, the first place you’ll hear about it is a failing test rather than a confused analyst.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;That’s reproducible analytics&lt;/strong&gt;  — the same inputs and the same logic producing the same outputs, every time.&lt;/p&gt;

&lt;p&gt;That’s what the move from raw scrapes to a modeled pipeline with dbt actually buys you. Not fancier queries, but a system where the logic is visible, the contracts are testable, and the next person who touches the data doesn’t have to reverse-engineer what you were thinking.&lt;/p&gt;

</description>
      <category>database</category>
      <category>datascience</category>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Practical Limits of DuckDB on Commodity Hardware</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Tue, 10 Mar 2026 08:21:25 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/the-practical-limits-of-duckdb-on-commodity-hardware-f76</link>
      <guid>https://dev.to/prithwish_nath/the-practical-limits-of-duckdb-on-commodity-hardware-f76</guid>
      <description>&lt;p&gt;DuckDB shouldn’t work this well. It’s a single embedded library that needs no server, config, or cloud bill — yet it handles warehouse-scale columnar analytics with surprising ease.&lt;/p&gt;

&lt;p&gt;Plenty of benchmarks already show that DuckDB can process large datasets. The more useful question, to me, is narrower: &lt;strong&gt;how far does it scale on low-end hardware before interactivity breaks down?&lt;/strong&gt; I do data forensics for a living, and the last thing I want is infrastructure getting in the way.&lt;/p&gt;

&lt;p&gt;To answer that, I ran a 50-million-row benchmark on a ~$500 Acer Aspire 5 (Raptor Lake i5, 16GB RAM, 1 TB SSD). Starting with 50,000 real-world search results, I generated a large synthetic dataset and executed increasingly complex analytical queries to identify where performance crossed practical thresholds.&lt;/p&gt;

&lt;p&gt;I'll present my findings here. The result wasn't a catastrophic failure point, but a series of predictable transitions (from instant, to tolerable, to better-scheduled-than-waited-on) depending on query shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The Four Performance Zones
&lt;/h2&gt;




&lt;p&gt;Before I dive into specific query types, understand that not all scales perform the same way. The same query that feels instant at 1 million rows might cross into “go grab a coffee” territory at 30 million — and different query types degrade at different rates.&lt;/p&gt;

&lt;p&gt;So there isn’t one big scale — but three. One for each &lt;em&gt;type&lt;/em&gt; of analytical query:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Percentile queries (like&lt;code&gt;GROUP BY&lt;/code&gt; + &lt;code&gt;PERCENTILE_CONT&lt;/code&gt;&lt;em&gt;) —&lt;/em&gt; when you’re trying to answer questions like “&lt;strong&gt;&lt;em&gt;For each domain, what’s its median, 25th, 75th, and 95th percentile search ranking?&lt;/em&gt;&lt;/strong&gt;” Or: “&lt;strong&gt;&lt;em&gt;For each customer tier, what is the session duration distribution?&lt;/em&gt;&lt;/strong&gt;” This is the kind of query you run when averages aren’t enough. You want to understand spread, skew, and outliers. It’s common in finance, product analytics, marketplace reporting — anywhere distributions matter more than single numbers.&lt;/li&gt;
&lt;li&gt;Window functions (like&lt;code&gt;LAG&lt;/code&gt; with &lt;code&gt;PARTITION BY&lt;/code&gt;) — e.g. “&lt;strong&gt;&lt;em&gt;For each user, how did their activity change vs. previous session?&lt;/em&gt;&lt;/strong&gt;” Or: “&lt;strong&gt;&lt;em&gt;For each account, how did monthly revenue change vs. last month?&lt;/em&gt;&lt;/strong&gt;” This is bread-and-butter analytics — a simple time-series comparison. It’s what powers growth dashboards, churn analysis. Amazon, for example, uses this extensively for anomaly detection. These are more expensive because they require partitioning and sorting — especially on large datasets like ours.&lt;/li&gt;
&lt;li&gt;Aggregations (like&lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;MIN&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt;with &lt;code&gt;DISTINCT&lt;/code&gt;) — e.g. “&lt;strong&gt;&lt;em&gt;For each region, how many total orders did we ship, how many unique customers/products, what were the min/max purchase values?&lt;/em&gt;&lt;/strong&gt;” This is the classic summary view, the kind of query behind almost every dashboard card. Deceptively simple because &lt;code&gt;DISTINCT&lt;/code&gt;at scale forces the engine to work harder than you might expect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, these three queries make up a massive chunk of all real-world analytics.&lt;/p&gt;

&lt;p&gt;Here’s how they perform going from 1,000 to 50,000,000 rows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ig0luie5bhzgikh5wkx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ig0luie5bhzgikh5wkx.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Query time (seconds) vs record count (millions). Image created via D3.js by Author.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;Percentile&lt;/th&gt;
&lt;th&gt;Window&lt;/th&gt;
&lt;th&gt;Aggregation&lt;/th&gt;
&lt;th&gt;Worst Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;0.11s&lt;/td&gt;
&lt;td&gt;0.80s&lt;/td&gt;
&lt;td&gt;0.12s&lt;/td&gt;
&lt;td&gt;0.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2M&lt;/td&gt;
&lt;td&gt;0.28s&lt;/td&gt;
&lt;td&gt;1.78s&lt;/td&gt;
&lt;td&gt;0.50s&lt;/td&gt;
&lt;td&gt;1.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5M&lt;/td&gt;
&lt;td&gt;0.64s&lt;/td&gt;
&lt;td&gt;3.28s&lt;/td&gt;
&lt;td&gt;1.23s&lt;/td&gt;
&lt;td&gt;3.3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10M&lt;/td&gt;
&lt;td&gt;1.65s&lt;/td&gt;
&lt;td&gt;6.26s&lt;/td&gt;
&lt;td&gt;2.17s&lt;/td&gt;
&lt;td&gt;6.3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15M&lt;/td&gt;
&lt;td&gt;2.69s&lt;/td&gt;
&lt;td&gt;7.94s&lt;/td&gt;
&lt;td&gt;4.15s&lt;/td&gt;
&lt;td&gt;7.9s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20M&lt;/td&gt;
&lt;td&gt;3.69s&lt;/td&gt;
&lt;td&gt;15.26s&lt;/td&gt;
&lt;td&gt;6.58s&lt;/td&gt;
&lt;td&gt;15.3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25M&lt;/td&gt;
&lt;td&gt;5.52s&lt;/td&gt;
&lt;td&gt;22.98s&lt;/td&gt;
&lt;td&gt;10.60s&lt;/td&gt;
&lt;td&gt;23.0s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30M&lt;/td&gt;
&lt;td&gt;6.67s&lt;/td&gt;
&lt;td&gt;31.41s&lt;/td&gt;
&lt;td&gt;15.46s&lt;/td&gt;
&lt;td&gt;31.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;40M&lt;/td&gt;
&lt;td&gt;9.82s&lt;/td&gt;
&lt;td&gt;47.43s&lt;/td&gt;
&lt;td&gt;22.64s&lt;/td&gt;
&lt;td&gt;47.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50M&lt;/td&gt;
&lt;td&gt;12.95s&lt;/td&gt;
&lt;td&gt;67.24s&lt;/td&gt;
&lt;td&gt;32.42s&lt;/td&gt;
&lt;td&gt;67.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Having plotted query times against human perception thresholds (I made some assumptions for those, since this is subjective), we have four distinct performance zones — with boundaries shifting depending on which query type you care about.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Comfort Zone”: Up to 5M Records
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query times:&lt;/strong&gt; Near instantaneous, `&amp;lt; 3 seconds for &lt;em&gt;everything.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is why DuckDB feels magical. For everything up to 5 million rows (and let’s be honest — more data than most organizations are querying interactively anyway) — &lt;em&gt;all queries&lt;/em&gt; are fast. Window functions complete in under a second at 1M. Even at 5M rows, the slowest query (you guessed it; a window function) takes just 3.3 seconds. Percentiles and aggregations are sub-second through 2M and barely crack 1 second at 5M.&lt;/p&gt;

&lt;p&gt;At this scale, DuckDB keeps everything comfortably in memory. Hash tables fit in RAM, partitioning operations don’t require temp files, and there’s zero disk spilling. This is the sweet spot for team analytics, dashboards, and interactive data exploration.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Workable”: 5–20M Records
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query times:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Window functions: 3–15 seconds. Noticeable latency. At 10M they take 6 seconds, and at 15M, about 8 seconds. They hit 15 seconds right at 20M — the upper edge of this zone.&lt;/li&gt;
&lt;li&gt;Percentiles and aggregations: still under 4 and 7 seconds respectively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where window queries will make people Alt-Tab out to do something else — doomscroll, check their email, watch videos of cats — instead of waiting. Percentiles and aggregations are still firmly comfortable, though.&lt;/p&gt;

&lt;p&gt;This zone is still good for Data science work and one-off analyses where a 10–15 second wait is acceptable. Still no disk spilling. Memory usage stays under ~650 MB.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Now You’re Pushing It”: 20–30M Records
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query times:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Window functions: 15–31 seconds, always crossing the 15s “frustration threshold”.&lt;/li&gt;
&lt;li&gt;Aggregations: 7–15 seconds (10.6s at 25M, 15.5s at 30M), near-annoying latency now.&lt;/li&gt;
&lt;li&gt;Percentiles: 4–7 seconds (5.5s at 25M, 6.7s at 30M.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the point where query types diverge dramatically. Percentiles are still firmly in the “interactive” category, but aggregations require patience. Window functions however, are firmly in the “go do something else” territory — at 30M records, they can reach a whopping 31 seconds — clearly the point where you should batch these, instead of waiting live.&lt;/p&gt;

&lt;p&gt;In general, this zone is best for automated/scheduled work where humans aren’t waiting on complex queries. Simple aggregations and percentiles are still interactive enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Batch-Only”: 30M+ Records
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query times:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Window functions: 31–67 seconds. Always past one minute at 50M.&lt;/li&gt;
&lt;li&gt;Aggregations: 15–32 seconds. Always past 30 seconds at 50M.&lt;/li&gt;
&lt;li&gt;Percentiles: 7–13 seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a testament to how performant DuckDB can be — if you’re &lt;em&gt;only&lt;/em&gt; running percentiles, even 50M rows is still marginally interactive (tops out at 13 seconds.)&lt;/p&gt;

&lt;p&gt;Everything else at this scale, though, should be scheduled.&lt;/p&gt;

&lt;p&gt;If you need window functions, this (30M and above) is clearly the point where you should consider a cloud data warehouse — they cross a minute at 50M (~67 seconds). Aggregations taking 20–30 seconds on average is also beyond generous definitions of “interactive”.&lt;/p&gt;

&lt;p&gt;So this is where we’ve found our limit — albeit not a hard one. Performance degrades predictably, not catastrophically. &lt;strong&gt;Critically, there’s still zero disk spilling.&lt;/strong&gt; DuckDB never writes temp files to disk, even at 50M records. Memory usage peaked at ~1.2 GB.&lt;/p&gt;

&lt;p&gt;Here’s a practical cheatsheet, sorted by use case:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Simple analytics ceiling&lt;/th&gt;
&lt;th&gt;Window function ceiling&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Live auto-refresh dashboard (less than 500ms)&lt;/td&gt;
&lt;td&gt;~2M rows&lt;/td&gt;
&lt;td&gt;~500K rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API-backed service (less than 1s)&lt;/td&gt;
&lt;td&gt;~5M rows&lt;/td&gt;
&lt;td&gt;~1M rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interactive BI (less than 3s)&lt;/td&gt;
&lt;td&gt;~15M rows&lt;/td&gt;
&lt;td&gt;~5M rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Notebook / exploration (less than 10s)&lt;/td&gt;
&lt;td&gt;~40M rows&lt;/td&gt;
&lt;td&gt;~15M rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ad-hoc analyst SQL (less than 30s)&lt;/td&gt;
&lt;td&gt;50M+ rows&lt;/td&gt;
&lt;td&gt;~20–25M rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch / ETL (less than 2min)&lt;/td&gt;
&lt;td&gt;50M+ rows&lt;/td&gt;
&lt;td&gt;50M+ rows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  2. The Goldilocks Zone for Local Analytics
&lt;/h2&gt;




&lt;p&gt;If I had to draw a circle around the scale where DuckDB on a cheap laptop is just &lt;em&gt;right&lt;/em&gt; — fast enough to feel interactive, complex enough to handle real analytical work, without needing to think about infrastructure — it’s &lt;strong&gt;1M to 10M rows&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here’s why that range is special.&lt;/p&gt;

&lt;p&gt;At 1M rows, every query type is effectively instant. Aggregations finish in 0.06 seconds. Even the most expensive query in the test — the &lt;code&gt;LAG&lt;/code&gt; window function — takes 0.47 seconds. The database is not part of your cognitive load at all at this scale.&lt;/p&gt;

&lt;p&gt;At 5M rows, that’s still largely true. Aggregations take half a second. Percentiles take 1.3 seconds. Window functions take 1.7 seconds. 5M rows of real data is a meaningful dataset — a year of event logs for a mid-sized SaaS product, a full crawl of a large website, several years of transaction history for a small business. And DuckDB chews through it without complaint.&lt;/p&gt;

&lt;p&gt;At 10M rows, you start paying a tax specifically on window functions — they cross 6 seconds here. But aggregations (0.79s) and percentiles (1.99s) are still fast enough that interactive use feels natural. If your workload skews toward &lt;code&gt;GROUP BY &lt;/code&gt;analytics and distribution queries rather than time-series comparisons, 10M rows is still firmly comfortable.&lt;/p&gt;

&lt;p&gt;So our “Goldilocks” Zone — so named after the fairytale of Goldilocks and the Three Bears; Goldilocks rejects porridges that are too hot &lt;em&gt;and&lt;/em&gt; too cold, until finding one that’s “just right” — is &lt;strong&gt;up to ~10M rows for any query type, and up to ~20M rows if you’re willing to accept that window functions will make you wait.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Below that ceiling, DuckDB on cheap/commodity hardware is no compromise at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Memory is Not the Bottleneck
&lt;/h2&gt;




&lt;p&gt;Across all scales tested (1K → 50M rows, averaged over 5 runs) DuckDB never exceeded ~&lt;strong&gt;1.2 GB&lt;/strong&gt; of memory — even at 50 million rows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjpiz4y36wj1q33onjhd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjpiz4y36wj1q33onjhd.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Peak memory (MB) vs record count (log scale, 1K–50M). Image created via D3.js by Author.&lt;/p&gt;

&lt;p&gt;At the upper bound (50 million rows), &lt;strong&gt;peak observed memory usage&lt;/strong&gt; was ~1,212 MB. On my 16GB laptop that’s ~7.5%. And that’s the worst case in this benchmark — with 50M rows and window-heavy queries in play.&lt;/p&gt;

&lt;p&gt;Memory growth is real — especially with window function (delta) queries — but it’s still fairly controlled, I’d say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10M rows → ~600 MB range&lt;/li&gt;
&lt;li&gt;20M rows → ~650 MB range&lt;/li&gt;
&lt;li&gt;30–50M rows → ~900 MB–1.2 GB range&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Time, on the other hand, accelerates sharply:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Window query: ~6.4s at 10M&lt;/li&gt;
&lt;li&gt;~15.8s at 20M&lt;/li&gt;
&lt;li&gt;~70s at 50M&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the &lt;em&gt;practical&lt;/em&gt; ceiling on a cheap laptop isn’t RAM. You’ll run out of patience before you run out of memory. It means for many local analytics use cases the scaling limit is UX, not hardware — that’s a very different kind of bottleneck, though.&lt;/p&gt;

&lt;p&gt;Before we move on, an interesting thing to note is that &lt;strong&gt;the aggregation query degrades faster than percentiles at scale.&lt;/strong&gt; At 5M they’re similar (~0.6s vs ~1.2s), but by 50M the aggregation (33.7s) is nearly 3x the percentile (12.5s). Multi-metric&lt;code&gt; GROUP BY&lt;/code&gt; with &lt;code&gt;HAVING&lt;/code&gt; and &lt;code&gt;DISTINCT &lt;/code&gt;counts WILL get expensive. If your dashboard runs heavy aggregations specifically, you should probably discount the “simple” ceiling by ~30–40%.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Window Functions Affect Scaling The Most
&lt;/h2&gt;




&lt;p&gt;If you’re embedding DuckDB in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A local analytics tool&lt;/li&gt;
&lt;li&gt;A CLI data processor&lt;/li&gt;
&lt;li&gt;A desktop SaaS product&lt;/li&gt;
&lt;li&gt;A developer-facing data tool&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Row count alone is not your sizing variable. &lt;strong&gt;Query shape is.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Across 18 dataset scales (1k → 50M rows, averaged over 5 runs, the single biggest performance divider wasn’t how much data existed — &lt;strong&gt;it was whether the workload used window functions, at all&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/stable/sql/functions/window_functions" rel="noopener noreferrer"&gt;Window functions&lt;/a&gt; power extremely common patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ranking results per group (&lt;code&gt;RANK()&lt;/code&gt;,&lt;code&gt; ROW_NUMBER()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Comparing rows over time (&lt;code&gt;LAG()&lt;/code&gt;, &lt;code&gt;LEAD()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Running totals&lt;/li&gt;
&lt;li&gt;Sessionization&lt;/li&gt;
&lt;li&gt;Deduplication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are all operations you’ll run many, &lt;em&gt;many&lt;/em&gt; times for a real project.&lt;/p&gt;

&lt;p&gt;I found that on identical hardware, aggregation-heavy workloads comfortably scale to tens of millions of rows while staying interactive. These Window-heavy workloads, though, hit “coffee break” territory &lt;em&gt;much&lt;/em&gt; earlier.&lt;/p&gt;

&lt;h3&gt;
  
  
  At 10M rows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Percentile query (aggregation-heavy):&lt;/strong&gt; ~1.7s, ~217 MB memory delta&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Window “delta” query (LAG-style):&lt;/strong&gt; ~6.4s, ~46 MB memory delta&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Group aggregation:&lt;/strong&gt; ~2.5s, ~200 MB memory delta&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Already, the window query is ~3–4x slower than pure aggregation.&lt;/p&gt;

&lt;h3&gt;
  
  
  At 20M rows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Percentile:&lt;/strong&gt; ~4.1s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Window (delta):&lt;/strong&gt; ~15.8s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregation:&lt;/strong&gt; ~7.3s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gap widens. The window query is now roughly 2–4x slower than aggregation-style queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  At 50M rows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Percentile:&lt;/strong&gt; ~16.5s, ~330 MB delta
*&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Window (delta):&lt;/strong&gt; ~70.3s, ~588 MB delta&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregation:&lt;/strong&gt; ~33.3s, ~113 MB deltaThis is where the separation becomes untenable:&lt;/li&gt;
&lt;li&gt;The window query takes &lt;strong&gt;70 seconds&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Aggregation-heavy queries remain roughly half that.&lt;/li&gt;
&lt;li&gt;Memory growth for window logic accelerates much more aggressively at scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your workload is mostly &lt;code&gt;GROUP BY&lt;/code&gt;, percentiles, and simple scans, then you have substantial headroom. If your workload relies heavily on&lt;code&gt; LAG()&lt;/code&gt;,&lt;code&gt; LEAD()&lt;/code&gt;, &lt;code&gt;RANK() OVER (PARTITION BY …)&lt;/code&gt; then your practical ceiling with DuckDB (running on such consumer grade local hardware, at least) arrives much sooner.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. …But Make Sure you Checkpoint Enough
&lt;/h2&gt;

&lt;p&gt;Speaking of window functions, avery interesting recurring pattern I found was that at exactly 50,000 records, these hit a wall.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8frqsxwwbp62oln06rtu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8frqsxwwbp62oln06rtu.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Delta query time (seconds) vs record count (log scale, 1K–5M) showing the spike at 50K. Image created via D3.js by Author&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This pattern is consistent across all 5 runs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;20K records: 0.24 seconds (fast)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;50K records: 1.21 seconds (5x slower than expected)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;100K records: 0.20 seconds (fast again)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Assuming expected: linear interpolation between 20K (0.241s) and 100K (0.204s) at 50K: 0.241 + (0.204 − 0.241) × (50 − 20) / (100 − 20) ≈ 0.225s&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This specific spike appears in &lt;strong&gt;every single run&lt;/strong&gt; with CV (coefficient of variation) of only 10.3%. Expected time based on linear scaling was 0.23 seconds. Actual time? 1.21 seconds. &lt;strong&gt;That’s 430% slower than it should be!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It’s the only non-monotonic performance pattern in the entire dataset. Every other scale shows smooth, predictable scaling. The 50K spike for window/delta functions is the exception.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it happens:&lt;/strong&gt; &lt;a href="https://duckdb.org/docs/stable/sql/statements/checkpoint" rel="noopener noreferrer"&gt;Checkpoint more frequently.&lt;/a&gt; DuckDB has automatic checkpointing (&lt;code&gt;wal_autocheckpoint&lt;/code&gt;), but it doesn’t work reliably during bulk inserts from Python. (&lt;a href="https://github.com/duckdb/duckdb/issues/9721" rel="noopener noreferrer"&gt;Check this GitHub issue&lt;/a&gt;). Knowing this, I set checkpointing manually, but as it turned out I wasn’t checkpointing &lt;em&gt;enough&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Data was inserted in 1,000-row batches across multiple database connections, but I had a LOT of records (50 million) so I thought I’d manually trigger CHECKPOINT only every 100K rows.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;💡&lt;/em&gt; CHECKPOINT is a database operation that flushes pending writes to disk and optimizes storage layout. In DuckDB specifically:- Consolidates fragmented columnar segments created by many small inserts&lt;br&gt;
- Reorganizes data into optimized columnar format for faster reads&lt;br&gt;
- Flushes the write-ahead log (WAL) to the main database fileLike defragmenting a hard drive, CHECKPOINT reorganizes scattered data into contiguous, optimized blocks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because the scales were run sequentially (1K → 5K → 10K → 20K → 50K → 100K and so on), by the time the 50K test executed, dozens of small batch inserts had accumulated without a CHECKPOINT, leaving the write-ahead log fragmented across many small column segments.&lt;/p&gt;

&lt;p&gt;Now consider the query that spikes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;LAG(rank) OVER (PARTITION BY url, query ORDER BY timestamp)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This requires a full sort of the relevant partitions. Sorting is exactly the kind of operation that is most sensitive to storage layout, and fragmented column segments mean less efficient scanning and sorting. So at 50K, the query pays the fragmentation penalty.&lt;/p&gt;

&lt;p&gt;By 50K I had 50 fragmented segments. The window function’s sort pays the cost of reading across all those fragments. At 100K, CHECKPOINT compacted everything, so queries got faster despite more data.&lt;/p&gt;

&lt;p&gt;That’s all the data I found. If you want to know about my methodology for this (admittedly) very niche benchmark, read on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology — How this Low-End Benchmark Was Built
&lt;/h2&gt;




&lt;p&gt;What I wanted to do was stress-test DuckDB with realistic analytical queries at scales most teams would consider “cloud warehouse only.” So if I wanted to benchmark how it performed on low-end hardware, I had to flip the entire approach on its head.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — Getting Search Data
&lt;/h3&gt;

&lt;p&gt;I started by fetching 50,000 actual Google SERP results using Bright Data’s &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=practical_limits_of_duckdb_on_commodity_hardware" rel="noopener noreferrer"&gt;SERP API&lt;/a&gt;. Why SERP data? Because search results have realistic cardinality — lots of unique URLs, domains, queries, timestamps — exactly the kind of data that stresses analytical databases.&lt;/p&gt;

&lt;p&gt;I built ~2,550 unique queries by combining 50 base topics with 51 suffix variations:&lt;/p&gt;

&lt;p&gt;`&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_BASE_TOPICS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python programming&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;machine learning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web development&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="c1"&gt;# ... 47 more  
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  

&lt;span class="n"&gt;_QUERY_SUFFIXES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
    &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; tutorial&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; 2024&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; best practices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="c1"&gt;# ... 47 more  
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  

&lt;span class="n"&gt;SERP_QUERIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_BASE_TOPICS&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;suffix&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_QUERY_SUFFIXES&lt;/span&gt;  
&lt;span class="p"&gt;]&lt;/span&gt;  
&lt;span class="n"&gt;SERP_QUERIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromkeys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SERP_QUERIES&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://gist.github.com/sixthextinction/57c0cc9358eac133d07fdbbeadd832ff" rel="noopener noreferrer"&gt;serp_queries.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Each query fetched up to 20 organic results via the API, streamed directly into DuckDB as results came in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DuckDBManager&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;results_obtained&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;  
        &lt;span class="n"&gt;serp_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
                    &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;results_obtained&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delay_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://gist.github.com/sixthextinction/4e8ae4eefebc1c07587882b49ffaecdb" rel="noopener noreferrer"&gt;bright_data.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My DuckDB schema mirrors the SERP data structure the API returns. (&lt;a href="https://get.brightdata.com/lp-scraping-browser-acf1964?utm_content=practical_limits_of_duckdb_on_commodity_hardware" rel="noopener noreferrer"&gt;Check their docs here for more info&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;  
    CREATE TABLE IF NOT EXISTS serp_results (  
        id BIGINT PRIMARY KEY,  
        query TEXT NOT NULL,  
        timestamp TIMESTAMP NOT NULL,  
        result_position INTEGER NOT NULL,  
        title TEXT,  
        url TEXT,  
        snippet TEXT,  
        domain TEXT,  
        rank INTEGER,  
        previous_rank INTEGER,  
        rank_delta INTEGER  
    )  
&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://gist.github.com/sixthextinction/4ac6bc086e31dd1f4a8e56bef51b8e1c" rel="noopener noreferrer"&gt;duckdb_manager.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This adds up to 50K rows of real, varied data with natural patterns — not artificial test data…but that still wasn’t enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — Synthesize 50M Records from Real Patterns
&lt;/h3&gt;

&lt;p&gt;50K rows isn’t enough to stress-test at scale. But &lt;em&gt;completely&lt;/em&gt; random synthetic data loses the realistic patterns that make queries slow. So I extracted actual domains, queries, title structures, and snippets from the real data and used those to generate variations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_serp_patterns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DuckDBManager&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT DISTINCT query FROM serp_results ORDER BY RANDOM() LIMIT 100&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT DISTINCT domain FROM serp_results WHERE domain IS NOT NULL LIMIT 200&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;title_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT title FROM serp_results WHERE title IS NOT NULL LIMIT 50&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;snippet_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT snippet FROM serp_results WHERE snippet IS NOT NULL LIMIT 50&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;queries&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;domains&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;domains&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
            &lt;span class="p"&gt;...&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Synthetic rows were generated in batches of 10,000 and inserted directly, mixed in with the 50k real records, until we had 50M total:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;needed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;  
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;current_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;domains&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
        &lt;span class="p"&gt;...&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;  
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO serp_results SELECT * FROM df&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://gist.github.com/sixthextinction/d8e1057ddabaf98286279f281fa3734c" rel="noopener noreferrer"&gt;benchmark.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 3 — Design the Query Suite
&lt;/h3&gt;

&lt;p&gt;I chose three query types that represent common analytical workloads:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Percentile Queries:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;  
    SELECT   
        domain,  
        COUNT(&lt;/span&gt;&lt;span class="se"&gt;\*&lt;/span&gt;&lt;span class="nv"&gt;) as result_count,  
        AVG(rank) as avg_rank,  
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY rank) as median_rank,  
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY rank) as p25_rank,  
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY rank) as p75_rank,  
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY rank) as p95_rank  
    FROM serp_results  
    {where_clause}  
    GROUP BY domain  
    ORDER BY avg_rank  
    LIMIT 100  
&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Window Functions:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;  
    WITH ranked AS (  
        SELECT   
            url,  
            query,  
            rank,  
            timestamp,  
            LAG(rank) OVER (PARTITION BY url, query ORDER BY timestamp) as previous_rank  
        FROM serp_results  
        {id_filter}  
    )  
    SELECT   
        url,  
        query,  
        rank,  
        previous_rank,  
        rank - previous_rank as rank_delta,  
        timestamp  
    FROM ranked  
    WHERE previous_rank IS NOT NULL  
    ORDER BY ABS(rank_delta) DESC  
    LIMIT 100  
&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Aggregation Queries:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;  
    SELECT   
        domain,  
        COUNT(*) as total_results,  
        COUNT(DISTINCT query) as unique_queries,  
        AVG(rank) as avg_rank,  
        MIN(rank) as best_rank,  
        MAX(rank) as worst_rank,  
        COUNT(DISTINCT url) as unique_urls  
    FROM serp_results  
    {where_clause}  
    GROUP BY domain  
    HAVING COUNT(*) &amp;gt; 10  
    ORDER BY total_results DESC  
    LIMIT 50  
&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code (all queries) here: &lt;a href="https://gist.github.com/sixthextinction/533b2c022984272b5bbc071478ae9ebc" rel="noopener noreferrer"&gt;queries.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These test different bottlenecks: percentiles stress sorting and statistical aggregation; window functions stress partitioning and stateful operations; aggregations stress hash tables and &lt;code&gt;DISTINCT&lt;/code&gt;operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4 — Run at 18 Different Scales
&lt;/h3&gt;

&lt;p&gt;Instead of regenerating data for each scale, I used a single optimization: filter by ID. One dataset, 18 different windows into it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;  
    SELECT id FROM (  
        SELECT id FROM serp_results ORDER BY id LIMIT {target_count}  
    ) ORDER BY id DESC LIMIT 1  
&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

&lt;span class="n"&gt;max_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_id_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every query then receives this &lt;code&gt;max_id&lt;/code&gt; as a filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;where_clause&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WHERE id &amp;lt;= &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; AND domain IS NOT NULL AND domain != &lt;/span&gt;&lt;span class="sh"&gt;''"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://gist.github.com/sixthextinction/d8e1057ddabaf98286279f281fa3734c" rel="noopener noreferrer"&gt;benchmark.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This tested 1K, 5K, 10K, 20K, 50K, 100K, 200K, 500K, 1M, 2M, 5M, 10M, 15M, 20M, 25M, 30M, 40M, and 50M records against identical data — isolating row count as the only variable. Each scale measured query execution time and peak memory usage via &lt;code&gt;psutil&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 — Run Five Complete Iterations
&lt;/h3&gt;

&lt;p&gt;To ensure patterns weren’t random variance, I ran the entire benchmark suite five times. This produced 90 data points per query type (5 runs x 18 scales).&lt;/p&gt;

&lt;p&gt;The result: performance was shockingly consistent. Most metrics had less than 10% coefficient of variation across runs — the 50M window function result, for instance, varied by less than 1 second across all five runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Uniform synthetic distribution:&lt;/strong&gt; The 50M records were generated by sampling uniformly from ~200 real domains and ~100 real queries. Real-world data follows a power-law distribution — a small number of domains dominate traffic heavily. This means the &lt;code&gt;PARTITION BY url, query&lt;/code&gt; window function hits roughly equal-sized partitions in this benchmark, which is best-case behavior. In production data with skewed distributions, window function performance at scale could be meaningfully worse than the numbers here suggest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-table queries only:&lt;/strong&gt; Didn’t test multi-table joins or recursive CTEs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware-specific:&lt;/strong&gt; Tested on one machine (16GB RAM). Results may vary on different hardware.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query-specific:&lt;/strong&gt; These three query types don’t represent all analytical workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DuckDB version:&lt;/strong&gt; Tested on DuckDB &amp;gt;=1.0.0. Newer versions may perform differently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No concurrent queries:&lt;/strong&gt; Single-threaded only, no concurrent workload tested.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion — Scaling Is a UX Ceiling.
&lt;/h2&gt;




&lt;p&gt;This experiment was more about finding out where local analytics on consumer grade hardware stops feeling interactive, rather than stress-testing DuckDB.&lt;/p&gt;

&lt;p&gt;Across all runs, DuckDB remained stable and memory usage stayed modest. There was no hard failure point, no dramatic collapse in performance. Instead, execution times increased predictably as data grew — and different query shapes crossed the interactivity threshold at different scales. Aggregation-heavy workloads retained responsiveness far longer, while window-heavy queries reached the UX ceiling much sooner.&lt;/p&gt;

&lt;p&gt;This, then, is our practical takeaway: &lt;strong&gt;the limiting factor on local analytics isn’t usually RAM or disk; it’s the point at which queries stop feeling fluid.&lt;/strong&gt; At that boundary, the decision you have to make is entirely about whether (for your use case) the user experience is still good enough to keep things local.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Some links in this article are tracking links used for analytics purposes only. I do not receive any commission or compensation from them.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>python</category>
      <category>programming</category>
      <category>database</category>
    </item>
  </channel>
</rss>
