<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Prithwish Nath</title>
    <description>The latest articles on DEV Community by Prithwish Nath (@prithwish_nath).</description>
    <link>https://dev.to/prithwish_nath</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2949644%2F62f159e9-294c-4c8f-a449-3857d7e5fbb6.jpg</url>
      <title>DEV Community: Prithwish Nath</title>
      <link>https://dev.to/prithwish_nath</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/prithwish_nath"/>
    <language>en</language>
    <item>
      <title>How To Validate Any API Response with Great Expectations (GX)</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Wed, 15 Apr 2026 05:59:38 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/how-to-validate-any-api-response-with-great-expectations-gx-2deb</link>
      <guid>https://dev.to/prithwish_nath/how-to-validate-any-api-response-with-great-expectations-gx-2deb</guid>
      <description>&lt;p&gt;A lot of my data analysis work is forensic (example &lt;a href="https://javascript.plainenglish.io/reverse-engineering-virality-what-1k-reddit-posts-reveal-about-the-internets-attention-economy-5a1f913ffc4d" rel="noopener noreferrer"&gt;here&lt;/a&gt;). I’ll pull repeated snapshots — content, traffic, SERPs, metadata — and look at patterns across &lt;em&gt;hundreds&lt;/em&gt; of these ingestions. So naturally, bad batches will skew results in ways that are &lt;em&gt;just convincing enough&lt;/em&gt; so I don’t catch it. Disastrous.&lt;/p&gt;

&lt;p&gt;You can add try-catches, but a 200 OK from your API with empty titles and duplicate rows is still considered a &lt;em&gt;success&lt;/em&gt; on the wire unless you model validation as its own failure path. So is there a better way?&lt;/p&gt;

&lt;p&gt;This is exactly the gap &lt;a href="https://docs.greatexpectations.io/docs/" rel="noopener noreferrer"&gt;Great Expectations&lt;/a&gt; (GX Core) fills. It’s an open-source Python library for defining declarative quality rules on your data — &lt;strong&gt;“quality gates”&lt;/strong&gt;, explicit rules your data must pass before it’s trusted &lt;strong&gt;—&lt;/strong&gt; things like “this field must never be null”, “these IDs must be unique”, or “this value must fall within a known range”. You basically codify what “good” looks like, run it as a validation step in your pipeline, and get a clear pass/fail on every batch — before bad data ever touches your analysis.&lt;/p&gt;

&lt;p&gt;Let’s turn that idea into code — we’ll get data at scale via &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=how_to_validate_any_api_response_with_great_expectations_gx" rel="noopener noreferrer"&gt;Bright Data&lt;/a&gt; (a SERP API), run a GX suite over each batch, then store it in &lt;a href="https://duckdb.org/docs/stable/" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt; (a file-backed analytical SQL database that’s just a &lt;code&gt;.duckdb&lt;/code&gt; file on disk) clean rows to the main table, failed batches to quarantine with enough context to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Here’s what you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;requests&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;2.28.0  
python-dotenv&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;1.0.0  
duckdb&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;1.0.0  
pandas&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;2.0.0  
psutil&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;5.9.0  
great-expectations&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that &lt;strong&gt;Python 3.10+&lt;/strong&gt; is the floor for Great Expectations 1.x with current pandas / duckdb wheels.&lt;/p&gt;

&lt;p&gt;The important ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/great-expectations/" rel="noopener noreferrer"&gt;great-expectations&lt;/a&gt; — the quality gates: expectations on each batch, clear pass/fail and reports. All we need is GX Core.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/duckdb/" rel="noopener noreferrer"&gt;duckdb&lt;/a&gt; — Doesn't need a server to run.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/requests/" rel="noopener noreferrer"&gt;requests&lt;/a&gt; —For HTTP to Bright Data’s &lt;a href="https://get.brightdata.com/lp-scraping-browser-acf1964?utm_content=how_to_validate_any_api_response_with_great_expectations_gx" rel="noopener noreferrer"&gt;POST /request&lt;/a&gt; endpoint (your zone must be a &lt;strong&gt;SERP&lt;/strong&gt; zone).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before running &lt;code&gt;pip install -r requirements.txt&lt;/code&gt;, create a .env file with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BRIGHT_DATA_API_KEY=your_api_key  
BRIGHT_DATA_ZONE=serp # or your SERP zone name from the Bright Data dashboard  
BRIGHT_DATA_COUNTRY=us # optional
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Bright Data client reads these when you run the pipeline. You need a SERP-capable zone — without one, POST /request will not match what this code expects. Proxy rotation and unblocking stay on Bright Data’s side.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Does the Pipeline Actually Work?
&lt;/h2&gt;

&lt;p&gt;This diagram should make it clear.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jvlwv7rp2ibk6jmqhw2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jvlwv7rp2ibk6jmqhw2.png" width="640" height="777"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bright_data.py &lt;span class="c"&gt;# Bright Data API client  &lt;/span&gt;
serp_expectations.py &lt;span class="c"&gt;# Great Expectations validation suite  &lt;/span&gt;
duckdb_store.py &lt;span class="c"&gt;# DuckDB schema + insert/quarantine logic  &lt;/span&gt;
ingest.py &lt;span class="c"&gt;# Pipeline: fetch → validate → store or quarantine&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flow is straightforward: one query equals one batch. For each batch, we check things in a strict order before anything touches the database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Fetching Structured SERP Data with Bright Data
&lt;/h2&gt;

&lt;p&gt;Our API client will just be a thin wrapper around Bright Data’s &lt;code&gt;POST /request&lt;/code&gt; endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  bright_data.py
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;  

&lt;span class="n"&gt;_GX_ROOT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;  
&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_GX_ROOT&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize_serp_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Bright Data may wrap JSON in a string body.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;  


&lt;span class="nd"&gt;@dataclass&lt;/span&gt;  
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SerpApiResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Bright Data POST /request: HTTP status plus parsed JSON body (if any).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

    &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;  
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;  


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;SERP API zone client; uses https://api.brightdata.com/request&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_COUNTRY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.brightdata.com/request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY must be set in the environment or constructor.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE must be set in the environment or constructor.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Raises on non-200 or network error.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
        &lt;span class="n"&gt;serp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search_with_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search request failed with HTTP &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;normalize_serp_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_with_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Does not raise on non-200 — for ingest + quarantine routing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
        &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.google.com/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?q=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;num=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;brd_json=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;hl=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;lr=lang_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="n"&gt;target_country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;  
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;search_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;  

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;SerpApiResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;  

        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_parse_response_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;SerpApiResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_parse_response_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;  
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_non_json_body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some things to note here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;search_with_status&lt;/code&gt; does not raise on non-&lt;code&gt;200&lt;/code&gt; so ingest can quarantine API failures&lt;/li&gt;
&lt;li&gt;Transport errors are caught so you get &lt;code&gt;status_code=0&lt;/code&gt; and the same &lt;code&gt;api_error&lt;/code&gt; path as HTTP failures.&lt;/li&gt;
&lt;li&gt;Also, &lt;code&gt;search()&lt;/code&gt; raises on failure for callers that want exceptions.&lt;/li&gt;
&lt;li&gt;And finally,&lt;code&gt;normalize_serp_payload&lt;/code&gt; unwraps responses just in case your API puts JSON under a string &lt;code&gt;body&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 2: What Happens to Data That Doesn’t Pass?
&lt;/h2&gt;

&lt;p&gt;Every batch in this pipeline has two possible destinations in DuckDB:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If it clears all our defined quality gates, it goes to &lt;code&gt;serp_results&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If it doesn't — wrong HTTP status, empty organic list, or a failed GX expectation — it lands in &lt;code&gt;serp_quarantine&lt;/code&gt; instead, with enough context to understand why.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's build that store before we wire up the gates.&lt;/p&gt;

&lt;h3&gt;
  
  
  duckdb_store.py
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psutil&lt;/span&gt;  


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SerpStore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;serp_results + serp_quarantine&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;makedirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SET memory_limit=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;available_memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;virtual_memory&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;available&lt;/span&gt;  
            &lt;span class="n"&gt;memory_gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;available_memory&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SET memory_limit=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory_gb&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;GB&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_create_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_create_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            CREATE TABLE IF NOT EXISTS serp_results (  
                id BIGINT PRIMARY KEY,  
                query TEXT NOT NULL,  
                timestamp TIMESTAMP NOT NULL,  
                result_position INTEGER NOT NULL,  
                title TEXT,  
                url TEXT,  
                snippet TEXT,  
                domain TEXT,  
                rank INTEGER,  
                previous_rank INTEGER,  
                rank_delta INTEGER  
            )  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE INDEX IF NOT EXISTS idx_query ON serp_results(query)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE INDEX IF NOT EXISTS idx_domain ON serp_results(domain)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            CREATE TABLE IF NOT EXISTS serp_quarantine (  
                query TEXT,  
                timestamp TIMESTAMP,  
                reason TEXT,  
                organic_count INTEGER,  
                payload_hash TEXT,  
                http_status INTEGER,  
                raw_json JSON  
            )  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="nd"&gt;@staticmethod&lt;/span&gt;  
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_payload_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;canonical&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;insert_quarantine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;raw_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;http_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;payload_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_payload_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            INSERT INTO serp_quarantine  
                (query, timestamp, reason, organic_count, payload_hash, http_status, raw_json)  
            VALUES (?, ?, ?, ?, ?, ?, ?)  
            &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;[&lt;/span&gt; 
                &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;payload_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;http_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="p"&gt;],&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;insert_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Insert from the GX validation frame (output of organic_to_validation_df).  

        Store what we validated: normalized url, derived domain, snippet, and positional rank —  
        not a second parse from raw organic dicts (avoids drift vs Great Expectations).  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt;  

        &lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COALESCE(MAX(id), 0) FROM serp_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;next_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_id_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  

        &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;  
            &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
            &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
            &lt;span class="n"&gt;snippet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
            &lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
            &lt;span class="n"&gt;rank_val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
            &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="p"&gt;{&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;next_id&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result_position&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rank_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;previous_rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank_delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="p"&gt;}&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            INSERT INTO serp_results (id, query, timestamp, result_position, title, url, snippet, domain, rank, previous_rank, rank_delta)  
            SELECT id, query, timestamp, result_position, title, url, snippet, domain, rank, previous_rank, rank_delta FROM df  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_row_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COUNT(*) FROM serp_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__enter__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__exit__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What’s happening here?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;reason&lt;/code&gt; field is an enum-like string: &lt;code&gt;api_error&lt;/code&gt;, &lt;code&gt;organic_empty&lt;/code&gt;, or &lt;code&gt;validation_failed&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;payload_hash&lt;/code&gt; is a SHA-256 of the canonical JSON — so you can tell if the same bad payload keeps coming back (very useful!)&lt;/li&gt;
&lt;li&gt;Each row stores a JSON blob in &lt;code&gt;raw_json&lt;/code&gt;: for &lt;code&gt;api_error&lt;/code&gt; and &lt;code&gt;organic_empty&lt;/code&gt; it is the normalized SERP dict; for &lt;code&gt;validation_failed&lt;/code&gt; it is &lt;code&gt;{"serp":&lt;/code&gt;[normalized serp here]&lt;code&gt;, "gx_validation":&lt;/code&gt;[GX result dict here]&lt;code&gt;}&lt;/code&gt; so you still have both the SERP payload &lt;em&gt;and&lt;/em&gt; the failing expectation details.
Every bad batch lands in &lt;code&gt;serp_quarantine&lt;/code&gt; with a paper trail instead of silently disappearing or, worse, silently making it into &lt;code&gt;serp_results&lt;/code&gt;. When a batch &lt;strong&gt;passes&lt;/strong&gt; all gates, &lt;code&gt;insert_batch&lt;/code&gt; writes one row per organic result from the validated DataFrame — same normalized fields GX validated. &lt;code&gt;previous_rank&lt;/code&gt; / &lt;code&gt;rank_delta&lt;/code&gt; are reserved for later rank-tracking; they stay null on first ingest.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 3: The Gate Order — Why Sequence Matters
&lt;/h2&gt;

&lt;p&gt;Bright Data's parsed SERP JSON path (request URL includes &lt;code&gt;brd_json=1&lt;/code&gt; or, in their docs, &lt;code&gt;brd_json=json&lt;/code&gt;) returns the same kind of structured object you see in their examples and in real dumps — &lt;code&gt;organic&lt;/code&gt;, &lt;code&gt;general&lt;/code&gt;, &lt;code&gt;input&lt;/code&gt;, and so on — not a page of HTML you parse yourself.&lt;/p&gt;

&lt;p&gt;So when Google &lt;em&gt;can't&lt;/em&gt; be reached or nothing usable is extracted, that tends to show up as a &lt;strong&gt;non-200&lt;/strong&gt; from &lt;code&gt;POST /request&lt;/code&gt;, a network error (we treat as &lt;code&gt;api_error&lt;/code&gt;), or an empty &lt;code&gt;organic&lt;/code&gt; array —and not as a captcha HTML document mistaken for a normal organic list. So the gates to implement are exactly three:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;non-&lt;code&gt;200&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;empty &lt;code&gt;organic&lt;/code&gt;,&lt;/li&gt;
&lt;li&gt;then GX on whatever is left.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ingest.py
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timezone&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;  

&lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;  

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bright_data&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;normalize_serp_payload&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;duckdb_store&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SerpStore&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;serp_expectations&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;organic_to_validation_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validate_organic_batch&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_write_validation_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;reports_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reports&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
    &lt;span class="n"&gt;reports_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y%m%dT%H%M%SZ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reports_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
    &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generated_at_utc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ingest_live&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;    &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;serp_json_out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,):&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
    One batch = one query → one SERP response.  
    Order: api_error → organic_empty → GX validate → insert or validation_failed quarantine.  
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_gx.duckdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
    &lt;span class="n"&gt;batches_total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
    &lt;span class="n"&gt;empty_batches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
    &lt;span class="n"&gt;api_error_batches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
    &lt;span class="n"&gt;validation_failed_batches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
    &lt;span class="n"&gt;report_entries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="n"&gt;serp_dump_batches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;SerpStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;batches_total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
            &lt;span class="n"&gt;serp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search_with_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize_serp_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;serp_json_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;serp_dump_batches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                    &lt;span class="p"&gt;{&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;normalized_serp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="p"&gt;}&lt;/span&gt;  
                &lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;organic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
            &lt;span class="n"&gt;organic_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

            &lt;span class="c1"&gt;# Optional (not implemented): best-effort catch-all for an explicit blocked API response  
&lt;/span&gt;            &lt;span class="c1"&gt;# (e.g. substring-match json.dumps(raw) for "captcha" / "unusual traffic") — just in case.  
&lt;/span&gt;            &lt;span class="c1"&gt;# Not needed here, only include if the API you're using can error out like that  
&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;api_error_batches&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
                &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_quarantine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                    &lt;span class="p"&gt;{&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="p"&gt;}&lt;/span&gt;  
                &lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="k"&gt;continue&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;empty_batches&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
                &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_quarantine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_empty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                    &lt;span class="p"&gt;{&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_empty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="p"&gt;}&lt;/span&gt;  
                &lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="k"&gt;continue&lt;/span&gt;  

            &lt;span class="n"&gt;vdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;organic_to_validation_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;gx_ok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gx_payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;validate_organic_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;gx_ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;validation_failed_batches&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
                &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_quarantine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                    &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx_validation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;gx_payload&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  
                    &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                    &lt;span class="p"&gt;{&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;gx_payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="p"&gt;}&lt;/span&gt;  
                &lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="k"&gt;continue&lt;/span&gt;  

            &lt;span class="c1"&gt;# Persist the validated DataFrame so DuckDB gets the same normalized url/domain/rank GX checked.  
&lt;/span&gt;            &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="p"&gt;{&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inserted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organic_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;gx_payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="p"&gt;}&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;row_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_row_count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="n"&gt;report_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_write_validation_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report_entries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batches_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;empty_batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;empty_batches&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_error_batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;api_error_batches&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_failed_batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;validation_failed_batches&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;empty_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;empty_batches&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_error_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_error_batches&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_failed_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;validation_failed_batches&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;batches_total&lt;/span&gt;  
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows_ingested&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_results_row_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_report_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;serp_json_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;out_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp_json_out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y%m%dT%H%M%SZ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generated_at_utc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;serp_dump_batches&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_json_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;  
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;  

    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gx: Bright Data SERP → Great Expectations → DuckDB (fail fast on GX failure)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;nargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; 
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inference engineering&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python duckdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Great Expectations data validation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;machine learning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;opentelemetry how to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;],&lt;/span&gt;  
        &lt;span class="n"&gt;help&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search queries (default: five example queries if none given)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--num-results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DuckDB file path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--save-serp-json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="nb"&gt;type&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;metavar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PATH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;help&lt;/span&gt;\&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write normalized SERP (+ query, http_status) per batch to this JSON file.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ingest_live&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;serp_json_out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;save_serp_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main loop enforces that order we talked about: &lt;code&gt;api_error&lt;/code&gt; and &lt;code&gt;organic_empty&lt;/code&gt; are cheap pre-checks that catch the most common failure modes without spinning up GX at all.&lt;/p&gt;

&lt;p&gt;In fact, GX only runs when there's actual data to validate. &lt;code&gt;store.insert_batch(vdf, q)&lt;/code&gt; passes the &lt;em&gt;validated DataFrame&lt;/em&gt;, not the original raw &lt;code&gt;organic&lt;/code&gt; list — what goes into DuckDB is exactly what GX checked, with the same URL normalization applied.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: What Quality Gates Should You Put on SERP Data?
&lt;/h2&gt;

&lt;p&gt;To reiterate, when we say “Quality gates”, we mean explicit Great Expectations rules on the DataFrame we validate: each “expectation” is simply one condition the batch must satisfy before you trust it for analytics.&lt;/p&gt;

&lt;p&gt;For this pipeline we normalize the &lt;code&gt;organic&lt;/code&gt;list in our API response into a small schema — &lt;code&gt;url&lt;/code&gt;, &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;snippet&lt;/code&gt;, &lt;code&gt;domain&lt;/code&gt;, and &lt;code&gt;rank&lt;/code&gt; — and run the suite on that frame only.&lt;/p&gt;

&lt;h3&gt;
  
  
  serp_expectations.py
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urlparse&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;great_expectations&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;gx&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;great_expectations.data_context.types.base&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProgressBarsConfig&lt;/span&gt;  


&lt;span class="c1"&gt;# helpers   
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;netloc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;www.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_coerce_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw_rank&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;  
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_rank&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
    &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_normalize_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;  
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;normalised&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;  
            &lt;span class="n"&gt;fragment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;normalised&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;geturl&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;  

&lt;span class="c1"&gt;# before anything, normalize the data  
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;organic_to_validation_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;raw_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_normalize_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;snippet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;raw_rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_coerce_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="p"&gt;{&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="c1"&gt;# now add the actual expectations  
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_serp_expectation_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;_meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SERP organic batch gates for Bright Data brd_json=1 payloads.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectationSuite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_organic_batch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; 
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToNotBeNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToNotBeNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToMatchRegex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;regex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;^https?://\\S+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToBeUnique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectTableRowCountToBeBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;min_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;max_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValueLengthsToBeBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;min_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;max_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToMatchRegex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;regex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\\S&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValueLengthsToBeBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;min_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;max_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;253&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToBeBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;min_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;max_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
            &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToNotBeNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;mostly&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_meta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;),&lt;/span&gt;  
        &lt;span class="p"&gt;],&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="c1"&gt;# gx in ephemeral mode  
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_organic_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statistics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evaluated_expectations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;successful_expectations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unsuccessful_expectations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="p"&gt;},&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exception_message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_frame_empty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  

    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;progress_bars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProgressBarsConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;globally&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;metric_calculations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="n"&gt;data_source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_sources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;hex&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;asset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data_source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_dataframe_asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic_batch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;batch_definition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_batch_definition_whole_dataframe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;whole&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batch_definition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_parameters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dataframe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;  
    &lt;span class="n"&gt;suite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_serp_expectation_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s walk through the decisions we made here, because they’re not arbitrary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Normalize Before You Validate — Never Inside the Validation Layer
&lt;/h3&gt;

&lt;p&gt;Before anything in GX runs, &lt;code&gt;organic_to_validation_df&lt;/code&gt; normalizes the data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_normalize_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Lowercase scheme, strip fragment.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
    &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;fragment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;geturl&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;organic_to_validation_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;raw_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_normalize_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;snippet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_coerce_rank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things happening here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bright Data’s &lt;code&gt;brd_json=1&lt;/code&gt; responses use &lt;code&gt;link&lt;/code&gt; as the key, not &lt;code&gt;url&lt;/code&gt;. We normalize this before GX sees it. You’ve probably used such remappings plenty of times, for most API responses.&lt;/li&gt;
&lt;li&gt;Fragments are stripped and scheme is lowercased before the uniqueness check — so &lt;code&gt;https://example.com/page#section&lt;/code&gt; and &lt;code&gt;https://example.com/page&lt;/code&gt; don't pass as different URLs.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;_coerce_rank&lt;/code&gt; handles the fact that rank values from APIs can come back as &lt;code&gt;int&lt;/code&gt;, &lt;code&gt;float&lt;/code&gt;, numpy scalars, or even strings. Coerce to a consistent type &lt;em&gt;before&lt;/em&gt; GX, not inside an expectation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ephemeral mode
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://docs.greatexpectations.io/docs/reference/api/data_context/ephemeraldatacontext_class/" rel="noopener noreferrer"&gt;[mode="ephemeral"]&lt;/a&gt; simply means no config files, no persisted context directory, and no &lt;code&gt;great_expectations.yml&lt;/code&gt; — a pure in-process validator you spin up per batch. Progress bars are turned off in code when batching many runs. We do this to make GX lightweight and easier to get started with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Null Checks Must Come Before Regex Gate
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ExpectColumnValuesToNotBeNull&lt;/code&gt; on &lt;code&gt;url&lt;/code&gt; and &lt;code&gt;title&lt;/code&gt; must come before the regex and length gates. That's because GX's &lt;code&gt;ExpectColumnValuesToMatchRegex&lt;/code&gt; silently skips null values by default — it only evaluates non-null rows. Without an explicit null check, a null URL would slip through the regex gate undetected.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the URL Regex Is Intentionally Loose
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;^https?://\S+&lt;/code&gt; is deliberately not a strict RFC 3986 URL validator. Overly strict URL validation generates false positives on legitimately unusual URLs — CDN URLs, tracking URLs, URLs with encoded characters. The goal here is to catch the two actual failure modes: an empty string and a non-URL value like &lt;code&gt;"N/A"&lt;/code&gt; or a relative path.&lt;/p&gt;

&lt;h3&gt;
  
  
  URL Uniqueness Catches Parse Artifacts
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ExpectColumnValuesToBeUnique&lt;/code&gt; on &lt;code&gt;url&lt;/code&gt; catches cases where the same URL appears twice in a single batch. That's a real thing that can happen with pagination artifacts or certain response formats. In a ranking pipeline, a duplicate URL means your rank counts are wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use a Fixed Ceiling for Rank and Row Count, Not a Self-Adjusting One
&lt;/h3&gt;

&lt;p&gt;Notice that &lt;code&gt;max_value=float(num_results)&lt;/code&gt; for the rank gate, and &lt;code&gt;max_value=num_results&lt;/code&gt; for the row count, both use the same value — the same number we used for our SERP API. This is intentional and it matters.&lt;/p&gt;

&lt;p&gt;A common mistake is to derive the ceiling from the batch itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Don't do this  
&lt;/span&gt;&lt;span class="n"&gt;max_rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vdf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;())))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? Well, if the API returns a result with &lt;code&gt;rank=47&lt;/code&gt; for a 10-result query, &lt;code&gt;max_rank&lt;/code&gt; becomes 47 and the gate passes. You've made the gate self-adjust to whatever the data says, which means the gate can &lt;em&gt;never&lt;/em&gt; actually fail on an out-of-range rank. Using a fixed &lt;code&gt;num_results&lt;/code&gt; means an unexpected rank value will actually trigger a quarantine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Optional Fields Need Soft Gates, Not Hard Ones
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;snippet&lt;/code&gt; uses &lt;code&gt;mostly=0.8&lt;/code&gt; rather than a hard non-null gate. That's because Google legitimately suppresses snippets for some result types — videos, certain knowledge panel entries, sitelinks. A hard non-null gate on snippet would quarantine perfectly valid batches. The &lt;code&gt;mostly&lt;/code&gt; parameter lets you say "80% of rows must pass this check" — if more than 20% of rows have no snippet, that's a signal the response structure has changed, not normal Google behaviour.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Does a Failed GX Validation Actually Look Like?
&lt;/h2&gt;

&lt;p&gt;When a batch fails validation, the &lt;code&gt;gx&lt;/code&gt; block in the report tells you exactly which expectation failed and what the unexpected values were. Here's a representative example — two organic rows end up with the &lt;strong&gt;same&lt;/strong&gt; normalized URL (duplicate rows in the response, or two URLs that collapse to one after fragment stripping), so the uniqueness gate fires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"inference engineering"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"validation_failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"http_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"organic_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"gx"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"statistics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"evaluated_expectations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"successful_expectations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"unsuccessful_expectations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt; 
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="nl"&gt;"expectation_config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"expect_column_values_to_be_unique"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"kwargs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"url"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="nl"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"element_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"unexpected_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"unexpected_percent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="nl"&gt;"partial_unexpected_list"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt; 
            &lt;/span&gt;&lt;span class="s2"&gt;"https://example.com/page?q=1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
            &lt;/span&gt;&lt;span class="s2"&gt;"https://example.com/page?q=1"&lt;/span&gt;&lt;span class="w"&gt;  
          &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;  
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;http_status&lt;/code&gt; is &lt;code&gt;200&lt;/code&gt;. Without the quality gate, this batch would have been inserted. The downstream join on URL would have silently doubled a result's apparent frequency in your analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Validation Report
&lt;/h2&gt;

&lt;p&gt;After every run, we emit a timestamped JSON report with one entry per query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"generated_at_utc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"20260326T102005Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"batches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"inference engineering"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"inserted"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"http_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"organic_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"gx"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"statistics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"evaluated_expectations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"some broken query"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"organic_empty"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"http_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"organic_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"gx"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The possible &lt;code&gt;outcome&lt;/code&gt; values are &lt;code&gt;inserted&lt;/code&gt;, &lt;code&gt;api_error&lt;/code&gt;, &lt;code&gt;organic_empty&lt;/code&gt;, and &lt;code&gt;validation_failed&lt;/code&gt;. Over time, watching the rate of each outcome per query is how you'd catch gradual degradation — a rising &lt;code&gt;organic_empty&lt;/code&gt; rate is a signal before it becomes a crisis.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Run the Pipeline
&lt;/h2&gt;

&lt;p&gt;Put the four modules in one package or folder on your &lt;code&gt;PYTHONPATH&lt;/code&gt;, install dependencies (&lt;code&gt;pip install -r requirements.txt&lt;/code&gt;), and set Bright Data credentials (&lt;code&gt;BRIGHT_DATA_API_KEY&lt;/code&gt;, &lt;code&gt;BRIGHT_DATA_ZONE&lt;/code&gt;, plus optional &lt;code&gt;BRIGHT_DATA_COUNTRY&lt;/code&gt;) in a &lt;code&gt;.env&lt;/code&gt; file next to the code or in the environment.&lt;/p&gt;

&lt;p&gt;Our orchestrator uses &lt;code&gt;argparse&lt;/code&gt;. So here’s some useful flags worth including, mostly for quality-of-life:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--num-results&lt;/code&gt; (default &lt;code&gt;10&lt;/code&gt;),&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--db&lt;/code&gt; (override the DuckDB file path),&lt;/li&gt;
&lt;li&gt;and optionally, I’ve found using a &lt;code&gt;--save-serp-json PATH&lt;/code&gt; to dump normalized SERP payloads while tuning your rules &lt;em&gt;really&lt;/em&gt; helps.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# One query  &lt;/span&gt;
python ingest.py &lt;span class="s2"&gt;"inference engineering"&lt;/span&gt;  

&lt;span class="c"&gt;# Multiple queries (defaults to five example queries if you pass none)  &lt;/span&gt;
python ingest.py  

&lt;span class="c"&gt;# Example: 20 results, custom DB path  &lt;/span&gt;
python ingest.py &lt;span class="s2"&gt;"python asyncio"&lt;/span&gt; &lt;span class="nt"&gt;--num-results&lt;/span&gt; 20 &lt;span class="nt"&gt;--db&lt;/span&gt; ./data/my_run.duckdb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On success, the script prints a small JSON summary — batch counts, rows written, paths to the DuckDB file and the validation report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"batches_total"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"empty_batches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"api_error_batches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"validation_failed_batches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"empty_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"api_error_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"validation_failed_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"rows_ingested"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"serp_results_row_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"db_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./data/serp_gx.duckdb"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"validation_report_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./data/reports/validation_20260326T102005Z.json"&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s pretty much everything. I’ve walked you through each file. Just know &lt;code&gt;ingest.py&lt;/code&gt; is the entry point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions (FAQ)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Why not just write the validation logic myself?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; You can, and for one or two checks, you probably should! The case for GX is not that something like&lt;code&gt;if not url or not url.startswith("http")&lt;/code&gt; is hard to write — it's that &lt;em&gt;twenty&lt;/em&gt; of those checks scattered across your pipeline become hard to read, hard to audit, and easy to accidentally skip when you're in a hurry. GX gives you a single place where &lt;em&gt;all&lt;/em&gt; your rules live, a consistent result structure across every check, and a report that tells you exactly which rule failed and on which values. It saves a ton of trouble in the long run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Does running GX on every batch slow the pipeline down?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; In practice, no — not at the batch sizes typical of API calls (let’s say 10–50 rows per query). To make double sure, this is why I run GX in Ephemeral context mode (mode=”ephemeral”) — that avoids any file I/O or context persistence overhead. The validation itself is pandas operations under the hood anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What happens to bad data in a pipeline without validation?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; It gets inserted and you don’t find out until something downstream breaks — or worse, until it produces wrong answers that are just plausible enough to go unnoticed. A &lt;code&gt;200 OK&lt;/code&gt; with duplicate rows, empty titles, or out-of-range values looks like a success on the wire. Without an explicit quality gate, that batch goes straight into your database and silently skews every query that touches it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How do I know what rules to write for my own data?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; Sample first, then write rules. Pull a handful of real responses from your source API, inspect the fields and edge cases, then encode what “good” looks like. Rules written speculatively against an imagined schema generate false positives; rules written against real samples catch actual failure modes. Start with the fields you’d join on or aggregate in your analytics, and &lt;em&gt;only&lt;/em&gt; add gates for everything else if you’ve seen it break.&lt;/p&gt;

&lt;h2&gt;
  
  
  What GX + Bright Data Gives You
&lt;/h2&gt;

&lt;p&gt;So why does this pattern work reliably?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bright Data handles the upstream layer.&lt;/strong&gt; Proxy rotation, bot detection, structured extraction — all abstracted behind a single API call that returns structured JSON.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Great Expectations handles downstream trust.&lt;/strong&gt; It doesn’t fix bad data. It measures whether incoming data meets your rules before you trust it with your analytics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The quarantine table (in DuckDB) gives you an audit trail.&lt;/strong&gt; Not just “something went wrong,” but what the payload was, which expectations failed, and when. You can query the quarantine table to understand your pipeline’s health over time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Web pipelines that don’t have explicit quality gates just gradually produce wrong answers. Catching &lt;em&gt;that&lt;/em&gt; before it reaches your database is the whole point.&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
      <category>api</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Why You Should Add Observability to Your Data Extraction with OpenTelemetry</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Mon, 06 Apr 2026 03:54:52 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/why-you-should-add-observability-to-your-data-extraction-with-opentelemetry-p1k</link>
      <guid>https://dev.to/prithwish_nath/why-you-should-add-observability-to-your-data-extraction-with-opentelemetry-p1k</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;TL;DR: This is a step-by-step tutorial on the quickest way to add observability to any data ingestion pipeline — whether you’re scraping or using an API.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Anything that fetches data at scale has a class of failure that error handling won’t catch. Not because your error handling code is &lt;em&gt;bad&lt;/em&gt; (it probably isn’t) but because retries that &lt;em&gt;eventually&lt;/em&gt; succeed, queries that take 10x longer than average, and domains that silently time out — &lt;strong&gt;don’t throw exceptions because they’re not technically errors.&lt;/strong&gt; And you’ll never know. The solution is actually adding proper &lt;a href="https://www.redhat.com/en/topics/devops/what-is-observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Overkill? Not at all. Because a data pipeline — &lt;em&gt;any&lt;/em&gt; data pipeline — with network calls, retries, timeouts, and wildly variable latency across different queries and domains is a textbook &lt;a href="https://www.atlassian.com/microservices/microservices-architecture/distributed-architecture" rel="noopener noreferrer"&gt;distributed system&lt;/a&gt;. It has all the same failure modes, and so it deserves the same tooling.&lt;/p&gt;

&lt;p&gt;In this post, we’ll build a SERP pipeline on top of &lt;a href="https://get.brightdata.com/bd7914?utm_content=why_you_should_add_observability_to_your_data_extraction_with_opentelemetry" rel="noopener noreferrer"&gt;Bright Data&lt;/a&gt;’s API and instrument it with &lt;a href="https://opentelemetry.io/" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt; (See: &lt;a href="https://opentelemetry.io/docs/languages/python" rel="noopener noreferrer"&gt;Python docs&lt;/a&gt;), the open-source standard for distributed tracing. Bright Data reduces blocks and proxy headaches out of the box — but proper Otel tracing shows you exactly where risk remains.&lt;/p&gt;

&lt;p&gt;By the end, you’ll be able to see what each call costs you in time, where retries are hiding, and which queries are slow.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Gets You
&lt;/h2&gt;

&lt;p&gt;What I’m trying to do is surface problems you’ll probably run into, and otherwise just silently pay for. These patterns map nearly 1:1 to wherever your data ingest pipelines look like in production.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retry storm detection.&lt;/strong&gt; If a domain starts blocking aggressively, you won’t see it as hard errors anymore, but a creeping spike in &lt;code&gt;scraper.retries &amp;gt; 0&lt;/code&gt; &lt;a href="https://opentelemetry.io/docs/concepts/signals/traces/#spans" rel="noopener noreferrer"&gt;spans&lt;/a&gt;. That’s your early warning before you trigger a full ban or blow past your proxy quota for the month.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actual cost visibility.&lt;/strong&gt; Every retry is another proxy request. If you’re paying per request or per GB, scraper.retries on your spans maps directly to a line item on your invoice. You can aggregate this and alert on it — I haven’t been doing this before adding OTel, and most likely, neither have you. 😅&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-query latency profiling.&lt;/strong&gt; Some queries are just structurally slower — more competitive terms, heavier result pages, more contention in the proxy pool. Traces let you see this per-query instead of as a blended average that makes everything look fine. Once you can see the outliers, you can do something about them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Basically, if you take ONE thing away from this read, let it be this: &lt;strong&gt;data pipelines have exactly the same failure modes as any distributed system&lt;/strong&gt; — timeouts, partial failures, retry amplification, silent degradation — whether data is obtained via an API call or just scraping.&lt;/p&gt;

&lt;p&gt;So let’s build a data collection stack you can reason about. And, as it turns out, the tooling you’d use for microservices works perfectly well here too.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Here’s what you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;opentelemetry-api&amp;gt;=1.20.0  
opentelemetry-sdk&amp;gt;=1.20.0  
opentelemetry-instrumentation-requests&amp;gt;=0.41b0  
opentelemetry-exporter-otlp-proto-http&amp;gt;=1.20.0  
requests&amp;gt;=2.28.0  
python-dotenv&amp;gt;=1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/opentelemetry-instrumentation-requests/" rel="noopener noreferrer"&gt;opentelemetry-instrumentation-requests&lt;/a&gt; — this gives us automatic HTTP tracing. Zero manual work.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/opentelemetry-exporter-otlp-proto-http/" rel="noopener noreferrer"&gt;opentelemetry-exporter-otlp-proto-http&lt;/a&gt; — for when we want to send traces somewhere real, like &lt;a href="https://www.jaegertracing.io/docs/" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before running &lt;code&gt;pip install -r requirements.txt&lt;/code&gt;, create a &lt;code&gt;.env&lt;/code&gt; file with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BRIGHT_DATA_API_KEY=your_api_key  
BRIGHT_DATA_ZONE=serp # or your SERP zone name from Bright Data dashboard  
BRIGHT_DATA_COUNTRY=us # optional  
OTEL_EXPORTER=console   # set to "jaeger" to send traces to Jaeger (must be running)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client reads these on instantiation. Replace with your own API credentials if you need them, but don’t forget &lt;code&gt;OTEL_EXPORTER&lt;/code&gt; — it controls where traces go.&lt;/p&gt;

&lt;h2&gt;
  
  
  Initializing OpenTelemetry
&lt;/h2&gt;

&lt;p&gt;We want two modes: a console exporter for development where traces print right in the terminal, and an &lt;a href="https://opentelemetry.io/docs/specs/otlp/" rel="noopener noreferrer"&gt;OTLP&lt;/a&gt; exporter for production. A single env var switches between them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.instrumentation.requests&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RequestsInstrumentor&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TracerProvider&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace.export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ConsoleSpanExporter&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.resources&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SERVICE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Resource&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;init_otel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;console&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bd-scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;SERVICE_NAME&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;  
    &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jaeger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.exporter.otlp.proto.http.trace_exporter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OTLPSpanExporter&lt;/span&gt;  
        &lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:4318&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/v1/traces&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ConsoleSpanExporter&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  

    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tracer_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="nc"&gt;RequestsInstrumentor&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# this is where the magic happens!  
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That one line — &lt;a href="https://opentelemetry-python-contrib.readthedocs.io/en/latest/instrumentation/requests/requests.html" rel="noopener noreferrer"&gt;RequestsInstrumentor().instrument()&lt;/a&gt; — hooks into the requests library globally. Every HTTP call your code makes from this point forward gets a &lt;a href="https://opentelemetry.io/docs/concepts/signals/traces/#spans" rel="noopener noreferrer"&gt;trace span&lt;/a&gt;, including the ones in third-party code you didn’t write. You get that for free.&lt;/p&gt;

&lt;p&gt;One thing that’ll catch you out if you’re not careful: &lt;code&gt;init_otel&lt;/code&gt; must run &lt;em&gt;before&lt;/em&gt; any &lt;code&gt;requests.Session&lt;/code&gt; is created. That means calling it before importing &lt;code&gt;BrightDataClient&lt;/code&gt; in your entrypoint. Yes, the import order matters here.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Client with Custom Spans
&lt;/h2&gt;

&lt;p&gt;Automatic HTTP tracing is great, but it only tells you about the transport layer. It has no idea this call was for the query “&lt;code&gt;machine learning&lt;/code&gt;”, or that it targeted&lt;code&gt;google.com&lt;/code&gt;, or that it had to retry once before it worked. That context is what custom spans are for.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;  

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;  
&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_COUNTRY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.brightdata.com/request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY and BRIGHT_DATA_ZONE required (env or constructor)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="p"&gt;})&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;  
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StatusCode&lt;/span&gt;  

        &lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;target_domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bright_data.search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.target_domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_domain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.num_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

            &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
            &lt;span class="n"&gt;last_err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  

            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
                &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_do_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                    &lt;span class="n"&gt;latency_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;  
                    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
                    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.retries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                    &lt;span class="c1"&gt;# clean success — no retries needed  
&lt;/span&gt;                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                        &lt;span class="c1"&gt;# recovered, but we want this surfaced in Jaeger  
&lt;/span&gt;                        &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Recovered after retry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;  
                &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                    &lt;span class="n"&gt;last_err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;  
                    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.retries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  

            &lt;span class="c1"&gt;# all retries exhausted  
&lt;/span&gt;            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper.error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_err&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ERROR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_err&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record_exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;last_err&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_do_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.google.com/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?q=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;num=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;brd_json=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;hl=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;lr=lang_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="n"&gt;target_country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;  
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;search_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;  

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="c1"&gt;# Bright Data may return body as JSON string — unpack it  
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I’m using a &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=why_you_should_add_observability_to_your_data_extraction_with_opentelemetry" rel="noopener noreferrer"&gt;SERP API&lt;/a&gt; for data, but swap it out with whatever you’re using. The concepts apply to anything. Also, a couple of things worth understanding here:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Parent-Child Relationship
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;_do_search&lt;/code&gt; helper builds the target URL and POSTs it to our API endpoint (&lt;code&gt;api.brightdata.com/request&lt;/code&gt;). When that call runs, the &lt;code&gt;RequestsInstrumentor&lt;/code&gt;auto-creates a child POST span &lt;em&gt;inside&lt;/em&gt; our &lt;code&gt;bright_data.search&lt;/code&gt; parent span. They share the same &lt;code&gt;trace_id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://www.jaegertracing.io/docs/" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt;, you’ll get a proper timeline: the outer business operation wrapping the inner HTTP call. That nesting is what makes traces actually useful — you see the whole story, not just individual events.&lt;/p&gt;

&lt;h3&gt;
  
  
  You Have to Set Span Status Yourself
&lt;/h3&gt;

&lt;p&gt;This one surprised me. OTel records all the data you throw at it, but it won’t decide what &lt;em&gt;matters&lt;/em&gt; on your behalf. If you don’t explicitly call &lt;code&gt;[span.set_status(…)](https://opentelemetry.io/docs/languages/python/instrumentation/#set-span-status)&lt;/code&gt;, every span stays &lt;code&gt;UNSET&lt;/code&gt;— even when a retry happened underneath. A query that timed out, retried, and recovered would be completely invisible to a &lt;a href="https://www.jaegertracing.io/docs/" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt; filter like &lt;code&gt;status=ERROR&lt;/code&gt;. You’d never find it.&lt;/p&gt;

&lt;p&gt;So there’s a deliberate tradeoff we’re making in the code above: recovered retries are marked &lt;code&gt;ERROR&lt;/code&gt;so they show up in dashboards. Some teams prefer to use &lt;code&gt;OK&lt;/code&gt;and add a &lt;code&gt;scraper.recovered = true&lt;/code&gt; attribute instead, keeping error rate metrics clean.&lt;/p&gt;

&lt;p&gt;Honestly, both are fine 🤷‍♂️ It just depends on whether you want alerting to treat “degraded success” as a failure. The important thing is to choose consciously, and not fall through to &lt;code&gt;UNSET&lt;/code&gt;by accident.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Let’s do this in a file, call it something like &lt;code&gt;scraper.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;  

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;  
&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
&lt;span class="n"&gt;_exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OTEL_EXPORTER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;console&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# console fallback as a default  
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;otel_config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;init_otel&lt;/span&gt;  
&lt;span class="nf"&gt;init_otel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_exporter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# must come before BrightDataClient import  
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bright_data_otel&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BrightDataClient&lt;/span&gt;  


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python programming&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;machine learning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web development&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data science&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloud computing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="p"&gt;]&lt;/span&gt;  
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;  
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: error — &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Done in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--delay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it in console mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; python scraper.py &lt;span class="nt"&gt;--count&lt;/span&gt; 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or point it at &lt;a href="https://www.jaegertracing.io/docs/" rel="noopener noreferrer"&gt;Jaeger&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# to be honest, just set OTEL_EXPORTER in .env  &lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;OTEL_EXPORTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;jaeger python scraper.py &lt;span class="nt"&gt;--count&lt;/span&gt; 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What the Traces Actually Show
&lt;/h2&gt;

&lt;p&gt;Here’s what the terminal printed for my five-query run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;1/5] python programming: 6 results  
&lt;span class="o"&gt;[&lt;/span&gt;2/5] machine learning: 8 results  
&lt;span class="o"&gt;[&lt;/span&gt;3/5] web development: 9 results  
&lt;span class="o"&gt;[&lt;/span&gt;4/5] data science: 9 results  
&lt;span class="o"&gt;[&lt;/span&gt;5/5] cloud computing: 9 results  
Done &lt;span class="k"&gt;in &lt;/span&gt;75.1s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five queries, all with results, no errors printed. Looks perfectly healthy….right? Let’s see what the traces say.&lt;/p&gt;

&lt;h3&gt;
  
  
  The clean calls…
&lt;/h3&gt;

&lt;p&gt;For &lt;code&gt;python programming&lt;/code&gt;, the &lt;code&gt;bright_data.search&lt;/code&gt;span looks exactly as expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bright_data.search"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"attributes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python programming"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.target_domain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"google.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.latency_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;3686.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.retries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.7 seconds, zero retries, one nested POST span confirming the HTTP round-trip happened exactly once. Looks good! Moving on.&lt;/p&gt;

&lt;h3&gt;
  
  
  …and the ones that weren’t so clean.
&lt;/h3&gt;

&lt;p&gt;The query &lt;code&gt;data science&lt;/code&gt; printed 9 results. Except the traces show &lt;em&gt;three&lt;/em&gt; spans for that single call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"status_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ReadTimeout: ...Read timed out. (read timeout=30)"&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"start_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-20T21:18:24.986273Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"end_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-20T21:18:54.999505Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"events"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exception"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"attributes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"exception.type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests.exceptions.ReadTimeout"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
      &lt;/span&gt;&lt;span class="nl"&gt;"exception.stacktrace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"UNSET"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"start_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-20T21:18:55.505186Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"end_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-20T21:19:20.097874Z"&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bright_data.search"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="nl"&gt;"attributes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"data science"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.latency_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;55113.46&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper.retries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Turns out, this query hit the 30-second read timeout, waited for the retry backoff, tried again, and finally came back with data — costing you two proxy requests and 55 seconds instead of one request and ~4 seconds.&lt;/p&gt;

&lt;p&gt;You‘d have absolutely no idea from the terminal output itself.&lt;/p&gt;

&lt;p&gt;This happens more often than you think — failures that silently blend in with the clean calls around it, that your pipeline still declare a success. &lt;strong&gt;That’s the whole argument for adding observability to your data ingest pipeline, right there.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This failure was invisible. No exception or warning, nothing in your logs. OTel may not have &lt;em&gt;prevented&lt;/em&gt; this failure — it has nothing to do with networking or data ingestion — but it definitely made it impossible to miss.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Latency Breakdown Across All Five Queries
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Retries&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;python programming&lt;/td&gt;
&lt;td&gt;3,686ms&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;machine learning&lt;/td&gt;
&lt;td&gt;6,558ms&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;web development&lt;/td&gt;
&lt;td&gt;3,079ms&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;data science&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;55,113ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cloud computing&lt;/td&gt;
&lt;td&gt;4,600ms&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;scraper.target_domain&lt;/code&gt; attribute lets you aggregate this same breakdown per domain when you scrape multiple targets (e.g. google.com vs bing.com).&lt;/p&gt;

&lt;p&gt;Four of five calls were clean. One was 15x slower than average, and you only know that because you were looking at traces.&lt;/p&gt;

&lt;p&gt;If you’re running this at scale across hundreds of queries, that pattern — most queries fast, some consistently slow or timeout-prone — is &lt;em&gt;exactly&lt;/em&gt; the info you need to tune retry budgets, adjust per-query timeouts, or start asking why that&lt;code&gt;“data science”&lt;/code&gt; query keeps choking.&lt;/p&gt;

&lt;p&gt;You can’t turn knobs on what you can’t see, after all. Adding observability gives you visibility into things you may not even have thought of.&lt;/p&gt;

&lt;h2&gt;
  
  
  Going to Production with Jaeger
&lt;/h2&gt;

&lt;p&gt;The console exporter is great for development, but for anything actually running in production you want traces going somewhere persistent. The easiest starting point is &lt;a href="https://hub.docker.com/r/jaegertracing/all-in-one" rel="noopener noreferrer"&gt;Jaeger’s all-in-one Docker image&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; jaeger &lt;span class="se"&gt;\ &lt;/span&gt; 
  &lt;span class="nt"&gt;-p&lt;/span&gt; 16686:16686 &lt;span class="se"&gt;\ &lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 4318:4318 &lt;span class="se"&gt;\ &lt;/span&gt; 
  jaegertracing/all-in-one:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, as before:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; python scraper.py &lt;span class="nt"&gt;--count&lt;/span&gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this run, open &lt;code&gt;http://localhost:16686,&lt;/code&gt; search for &lt;code&gt;bd-scraper&lt;/code&gt;(or whatever you called yours) and you’ll see each &lt;code&gt;bright_data.search&lt;/code&gt; span as a row in the trace timeline with the nested POST spans inside.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;data science&lt;/code&gt; query was slow again, but hey, at least it didn’t fail? Small victories. 😅 It stands out immediately in the Jaeger UI — one that’s wider than everything else on the screen (~17 seconds.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4i81kyaukz303zr9mcm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4i81kyaukz303zr9mcm.png" width="640" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is what &lt;a href="http://localhost:16686/search" rel="noopener noreferrer"&gt;http://localhost:16686/search&lt;/a&gt; will look like after a run.&lt;/p&gt;

&lt;p&gt;Click through it for more info.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5d5mzn7aawt6fgxmjwn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5d5mzn7aawt6fgxmjwn.png" width="640" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And you can expand through traces here to as fine grained a detail as you need.&lt;/p&gt;

&lt;p&gt;For a real setup, swap Jaeger out for whatever backend you already run. &lt;a href="https://grafana.com/docs/tempo/latest" rel="noopener noreferrer"&gt;Grafana Tempo&lt;/a&gt;, &lt;a href="https://docs.honeycomb.io/send-data/traces/opentelemetry" rel="noopener noreferrer"&gt;Honeycomb&lt;/a&gt;, &lt;a href="https://docs.datadoghq.com/opentelemetry/setup" rel="noopener noreferrer"&gt;Datadog&lt;/a&gt; — the &lt;a href="https://opentelemetry.io/docs/specs/otlp/" rel="noopener noreferrer"&gt;OTLP&lt;/a&gt; exporter speaks the same protocol to all of them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;That’s everything! Feel free to reach out on&lt;/em&gt; &lt;a href="https://www.linkedin.com/in/prithwish-nath-04b873a7/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; &lt;em&gt;if you have questions, or leave a comment below.👋&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>Building a Local Data Analytics Pipeline with dbt Core and DuckDB</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Wed, 18 Mar 2026 16:49:14 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/building-a-local-data-analytics-pipeline-with-dbt-core-and-duckdb-4e4</link>
      <guid>https://dev.to/prithwish_nath/building-a-local-data-analytics-pipeline-with-dbt-core-and-duckdb-4e4</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;TL;DR:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;This pipeline uses dbt Core + DuckDB locally — no infrastructure — to normalize domains, deduplicate URLs, enforce data contracts via tests, and materialize four analyst-ready mart tables from raw SERP API output.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Press enter or click to view image in full size&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49xhyi77pufd3km52e77.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49xhyi77pufd3km52e77.png" alt="cover image with article title and logos for dbt and duck DB" width="560" height="294"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After web ingestion, you’ll have inconsistent domains, duplicate URLs across collection runs, null titles, and more. This is not  &lt;em&gt;wrong&lt;/em&gt; data, per se, just unprocessed data. The gap between “data in a table” and “data you can trust in a query” is bigger than you think.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getdbt.com/" rel="noopener noreferrer"&gt;dbt&lt;/a&gt;  (data build tool) is an open-source transformation framework that can help us with exactly that problem: you write SQL models, it materializes them in dependency order, and it tracks lineage from raw source to final output. Paired with  &lt;a href="https://duckdb.org/docs/" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt;  via the community &lt;a href="https://docs.getdbt.com/reference/resource-configs/duckdb-configs" rel="noopener noreferrer"&gt;dbt-duckdb&lt;/a&gt;  adapter — no infrastructure needed, it’s all&lt;code&gt;.duckdb&lt;/code&gt;  files — it's a surprisingly capable local setup for closing that gap.&lt;/p&gt;

&lt;p&gt;I’ll walk you through the Python-based pipeline I use — one that takes SERP data and produces analytics ready tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What you need:&lt;/strong&gt; Python 3.x, first of all. Then we can install our requirements like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;dbt-core dbt-duckdb duckdb requests python-dotenv pandas
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For ingestion, we’ll be using a SERP API — I’m using the one I have access to,  &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=building_a_local_serp_analytics_pipeline_with_dbt_core_and_duckdb" rel="noopener noreferrer"&gt;Bright Data&lt;/a&gt;. For this, you’ll need an account with a SERP zone (get API key and zone from its dashboard).&lt;/p&gt;

&lt;p&gt;Create a project directory with this layout:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;ingest/&lt;/code&gt;  for the Python scripts (bright_data.py, duckdb_manager.py, scraper.py),&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;models/&lt;/code&gt;  for dbt (with subfolders staging/, intermediate/, marts/),&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;data/&lt;/code&gt;  for the .duckdb files, and&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;profiles.yml&lt;/code&gt;, and &lt;code&gt;dbt_project.yml&lt;/code&gt;  at the root.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So there are two main phases: a Python ingest layer that collects and streams results into DuckDB, and a dbt transformation layer with three tiers — staging, intermediate, and marts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: Ingesting SERP Data into DuckDB
&lt;/h2&gt;

&lt;p&gt;Before dbt has anything to work with, we need data in the database. The ingest layer is three Python files:  &lt;code&gt;bright_data.py&lt;/code&gt;  wraps the Bright Data SERP API,  &lt;code&gt;duckdb_manager.py&lt;/code&gt;  handles the DuckDB connection and schema, and  &lt;code&gt;scraper.py&lt;/code&gt;  orchestrates the collection loop. Place them in an  &lt;code&gt;ingest/&lt;/code&gt;  subdirectory.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Bright Data Client
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/lp-scraping-browser-acf1964?utm_content=building_a_local_serp_analytics_pipeline_with_dbt_core_and_duckdb" rel="noopener noreferrer"&gt;Bright Data’s SERP API&lt;/a&gt;  works differently from a proxy setup. Rather than routing requests through a proxy, you POST a target URL to their API endpoint and get back structured JSON.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ingest/bright_data.py:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
Bright Data SERP API client for fetching search results  
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;  

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
    Client for Bright Data SERP API  
    Uses the SERP API endpoint (not proxy) for Google search access  
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  
    &lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;env_api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;env_zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;env_country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_COUNTRY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;env_api_key&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;env_zone&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;env_country&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.brightdata.com/request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY must be provided via constructor or environment variable. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get your API key from: https://brightdata.com/cp/setting/users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE must be provided via constructor or environment variable. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Manage zones at: https://brightdata.com/cp/zones&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  
        &lt;span class="p"&gt;})&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
        Execute a Google search via Bright Data SERP API  

        Args:  
            query: Search query string  
            num_results: Number of results to return (default: 10)  
            language: Language code (e.g., &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;es&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fr&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)  
            country: Country code (e.g., &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;uk&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ca&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)  

        Returns:  
            Dictionary containing search results in JSON format  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
        &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.google.com/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?q=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;num=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;brd_json=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;hl=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;lr=lang_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  

        &lt;span class="n"&gt;target_country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;  

        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;zone&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;search_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;  

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HTTPError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search request failed with HTTP &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error_msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;  
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search request failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the&lt;code&gt;brd_json=1&lt;/code&gt;  parameter appended to the Google search URL — that’s what tells Bright Data to parse the response and return structured data rather than raw HTML.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  API key, zone, country — read from environment variables with constructor overrides, so the client works both in scripts and in environments where secrets come in differently.&lt;/li&gt;
&lt;li&gt;  Without  &lt;code&gt;BRIGHT_DATA_API_KEY&lt;/code&gt; and  &lt;code&gt;BRIGHT_DATA_ZONE&lt;/code&gt;  set, it raises immediately with a message pointing to the right place in the Bright Data dashboard.&lt;/li&gt;
&lt;li&gt;  The client also supports a  &lt;code&gt;language&lt;/code&gt; parameter (e.g.  &lt;code&gt;hl=en&lt;/code&gt;, &lt;code&gt;lr=lang_en&lt;/code&gt;) for non-English or multi-region SERP analysis — pass it to  &lt;code&gt;search()&lt;/code&gt;  or set  &lt;code&gt;BRIGHT_DATA_COUNTRY&lt;/code&gt;  for geo-targeting.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The DuckDB Manager
&lt;/h3&gt;

&lt;p&gt;We’ll use two databases. The  &lt;a href="https://duckdb.org/docs/api/python/overview" rel="noopener noreferrer"&gt;DuckDB Python API&lt;/a&gt;  lets us create files, run SQL, and insert from pandas DataFrames directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;serp_data.duckdb&lt;/code&gt;  — the source DB that holds raw ingest output, and&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;serp_analytics.duckdb&lt;/code&gt;  — our analytics DB, one that holds the transformed models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;dbt attaches the source DB read-only and writes only to the analytics DB, so raw data stays untouched.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 The keen among you may have noticed that I'm basically stealing the &lt;a href="https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion" rel="noopener noreferrer"&gt;medallion architecture pattern&lt;/a&gt; from Databricks/BigQuery projects here. So "bronze" stays untouched, "silver/gold" tables are derived. Why do this? If a dbt model has a bug and you materialize garbage into analytics, your raw data is completely clean and you just re-run.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For our source DB, the schema is simple: one  &lt;code&gt;serp_results&lt;/code&gt; table with indexes on query and domain — the two fields that get hit hardest in the dbt transformations downstream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ingest/duckdb_manager.py&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
DuckDB connection and schema management for SERP ingest  
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;  
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;  

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DuckDBManager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Manages DuckDB connection, schema, and insert operations&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/serp_data.duckdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;     
        &lt;span class="c1"&gt;# db_path = "serp_data.duckdb" gives dirname = "", which can cause issues on some setups   
&lt;/span&gt;        &lt;span class="c1"&gt;# So...safer to guard like so:  
&lt;/span&gt;        &lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;makedirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duckdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_create_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_create_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            CREATE TABLE IF NOT EXISTS serp_results (  
                id BIGINT PRIMARY KEY,  
                query TEXT NOT NULL,  
                timestamp TIMESTAMP NOT NULL,  
                result_position INTEGER NOT NULL,  
                title TEXT,  
                url TEXT,  
                snippet TEXT,  
                domain TEXT,  
                rank INTEGER  
            )  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  


    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;insert_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;return&lt;/span&gt;  

        &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;  
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urlparse&lt;/span&gt;  
                &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;netloc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;www.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;  

        &lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COALESCE(MAX(id), 0) FROM serp_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;next_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_id_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  

        &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
            &lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

            &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;next_id&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;result_position&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
            &lt;span class="p"&gt;})&lt;/span&gt;  

        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;  
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
        &lt;span class="c1"&gt;# Best practice is to specify columns explicitly (like, INSERT INTO t (a,b,c) SELECT a,b,c FROM df)   
&lt;/span&gt;        &lt;span class="c1"&gt;# to avoid mismatch if table or DataFrame order changes  
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
            INSERT INTO serp_results (id, query, timestamp, result_position, title, url, snippet, domain, rank)  
            SELECT id, query, timestamp, result_position, title, url, snippet, domain, rank FROM df  
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_row_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COUNT(*) FROM serp_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__enter__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;  

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__exit__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc_tb&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Just in case, the  &lt;code&gt;insert_batch&lt;/code&gt;  method handles inconsistent SERP field names (&lt;code&gt;link&lt;/code&gt;  vs  &lt;code&gt;url&lt;/code&gt;,  &lt;code&gt;description&lt;/code&gt; vs  &lt;code&gt;snippet&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Also, it’s worth being honest about what the domain extraction is and isn’t: I’ve accounted only for a best-effort ingest-time extraction, not the canonical domain value. The dbt staging model is where the real normalization happens — lowercasing, stripping  &lt;code&gt;www.&lt;/code&gt;, and falling back to regex extraction when the field is missing. The ingest layer just makes sure  &lt;em&gt;something&lt;/em&gt;  is in the column.&lt;/p&gt;

&lt;h3&gt;
  
  
  Making the Two Work Together to Ingest Web Data
&lt;/h3&gt;

&lt;p&gt;The scraper loops through a list of queries, calls the Bright Data client for each, and streams results into DuckDB in batches. Put a .env file in the project root with  &lt;code&gt;BRIGHT_DATA_API_KEY&lt;/code&gt;  and &lt;code&gt;BRIGHT_DATA_ZONE&lt;/code&gt;, then run from the project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python ingest/scraper.py &lt;span class="nt"&gt;--count&lt;/span&gt; 50000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Some other time saving flags to add:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--queries “foo” “bar”&lt;/code&gt;  for custom queries (Here, I’ve made it default to 10 tech keywords),&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--batch-size 10&lt;/code&gt;  for results per API call,&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--delay 1.0&lt;/code&gt; for seconds between calls, and&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--db path/to/serp_data.duckdb&lt;/code&gt;  to override the output path (default:  &lt;code&gt;data/serp_data.duckdb&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
SERP scraper that streams results to DuckDB (data/serp_data.duckdb).  
Run from project root. Then run dbt to build analytics.  
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;    
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;    
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;    
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;    

&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;  

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bright_data&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BrightDataClient&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;duckdb_manager&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DuckDBManager&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_default_db_path&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
    &lt;span class="n"&gt;script_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;    
    &lt;span class="n"&gt;project_root&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;script_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;    
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_root&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serp_data.duckdb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_and_insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;delay_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="p"&gt;):&lt;/span&gt;    
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
        &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;    
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python programming&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;machine learning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web development&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data science&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloud computing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;javascript frameworks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database design&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API development&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;devops tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cybersecurity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    
        &lt;span class="p"&gt;]&lt;/span&gt;  

    &lt;span class="n"&gt;db_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db_path&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;_default_db_path&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DuckDBManager&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting scrape: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; total results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Database: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Then run: dbt run --profiles-dir .&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;results_scraped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;    
        &lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;    
        &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;results_scraped&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;  

                &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                    &lt;span class="n"&gt;serp_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

                    &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;    
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
                        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                            &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;    
                        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
                            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;    
                                &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  

                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                        &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
                        &lt;span class="n"&gt;results_scraped&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      
                        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;results_scraped&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] Query: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; | Inserted: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

                    &lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  

                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;results_scraped&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delay_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

                &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error scraping query &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
                    &lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;    
                    &lt;span class="k"&gt;continue&lt;/span&gt;  

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;KeyboardInterrupt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Scraping interrupted by user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;    
        &lt;span class="n"&gt;final_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_row_count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;=== Scraping Complete ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total rows in DB: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;final_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Time elapsed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;final_count&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows/sec&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Scrape SERP results to DuckDB for dbt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--batch-size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--delay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--queries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="nf"&gt;scrape_and_insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;    
        &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;delay_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;    
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things in here that are easy to overlook:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  In case your SERP API response structure isn’t always consistent — organic results can live at  &lt;code&gt;serp_data[‘organic’]&lt;/code&gt;  or nested at  &lt;code&gt;serp_data[‘body’][‘organic’]&lt;/code&gt; depending on the response format and search engine (Bright Data supports Google as well as others), so the parser checks both.&lt;/li&gt;
&lt;li&gt;  There’s a configurable delay between API calls (  &lt;code&gt;--delay&lt;/code&gt;, default 1 second) to avoid hammering the API.&lt;/li&gt;
&lt;li&gt;  Progress is printed after each batch so you can track the run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the scrape finishes, you have a populated  &lt;code&gt;data/serp_data.duckdb&lt;/code&gt;  file with a  &lt;code&gt;serp_results&lt;/code&gt;  table. That’s our raw data. We now have everything dbt needs to begin.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2: Transforming with dbt
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Connecting dbt to DuckDB
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/duckdb/dbt-duckdb" rel="noopener noreferrer"&gt;dbt-duckdb community adapter&lt;/a&gt; supports attaching multiple database files, which is exactly what we need here: we’ll make dbt write transformed models to  &lt;code&gt;serp_analytics.duckdb&lt;/code&gt;  while treating  &lt;code&gt;serp_data.duckdb&lt;/code&gt;  as a read-only source. This separation is a good idea because, as a standard, you never want a transformation step to accidentally mutate the raw data it's reading from.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://docs.getdbt.com/docs/local/connect-data-platform/duckdb-setup?version=1.12" rel="noopener noreferrer"&gt;Read more about configuring the dbt-duckdb adapter here  &lt;em&gt;👉&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml" rel="noopener noreferrer"&gt;profiles.yml&lt;/a&gt;  is where the connection lives. The  &lt;a href="https://docs.getdbt.com/reference/resource-configs/duckdb-configs#attaching-additional-databases" rel="noopener noreferrer"&gt;attach&lt;/a&gt;  block is the part that makes this work — it mounts the raw DB under the alias  &lt;code&gt;serp_source&lt;/code&gt;, which is the database name the source declaration will reference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;profiles.yml:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;serp_analytics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dev&lt;/span&gt;  
  &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
    &lt;span class="na"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;duckdb&lt;/span&gt;  
      &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/serp_analytics.duckdb&lt;/span&gt;  
      &lt;span class="na"&gt;threads&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;  
      &lt;span class="na"&gt;settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
        &lt;span class="na"&gt;max_temp_directory_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;10GB'&lt;/span&gt;  
        &lt;span class="na"&gt;memory_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;12GB'&lt;/span&gt;  
        &lt;span class="na"&gt;preserve_insertion_order&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  
      &lt;span class="na"&gt;attach&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/serp_data.duckdb&lt;/span&gt;  
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;duckdb&lt;/span&gt;  
          &lt;span class="na"&gt;alias&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_source&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tune  &lt;code&gt;memory_limit&lt;/code&gt;  and  &lt;code&gt;max_temp_directory_size&lt;/code&gt;  based on how much data you’re working with. I’ve optimized these values for a stress-free 50k-row scrape but you may want to lower them on smaller machines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getdbt.com/reference/dbt_project.yml" rel="noopener noreferrer"&gt;dbt_project.yml&lt;/a&gt;  holds the project-level configuration — model paths,  &lt;a href="https://docs.getdbt.com/docs/build/materializations" rel="noopener noreferrer"&gt;materialization&lt;/a&gt;  defaults per layer, and the two variables that control pipeline behavior:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;dbt_project.yml&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_analytics&lt;/span&gt;    
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1.0.0&lt;/span&gt;    
&lt;span class="na"&gt;config-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2&lt;/span&gt;    
&lt;span class="na"&gt;profile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_analytics&lt;/span&gt;  

&lt;span class="na"&gt;model-paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;models"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;    
&lt;span class="na"&gt;test-paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tests"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;    
&lt;span class="na"&gt;target-path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target"&lt;/span&gt;    
&lt;span class="na"&gt;clean-targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target"&lt;/span&gt;    
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbt_packages"&lt;/span&gt;  

&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
  &lt;span class="na"&gt;serp_analytics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
    &lt;span class="na"&gt;staging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
      &lt;span class="na"&gt;+schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;    
    &lt;span class="na"&gt;intermediate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
      &lt;span class="na"&gt;+schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;intermediate&lt;/span&gt;    
    &lt;span class="na"&gt;marts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    
      &lt;span class="na"&gt;+schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;marts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Declaring the Source
&lt;/h3&gt;

&lt;p&gt;In dbt, you don’t reference raw tables directly in your models. You  &lt;a href="https://docs.getdbt.com/docs/build/sources" rel="noopener noreferrer"&gt;declare them as sources&lt;/a&gt;  first, and this is what makes lineage tracking possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/sources.yml&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2&lt;/span&gt;  
&lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_source&lt;/span&gt;  
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Raw SERP data from Bright Data collection&lt;/span&gt;  
    &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_source&lt;/span&gt;  
    &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;  
    &lt;span class="na"&gt;tables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serp_results&lt;/span&gt;  
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Raw search engine result pages - id, query, url, domain, rank, position, etc.&lt;/span&gt;  
        &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;id&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Primary key&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;query&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Search query string&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Result URL&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;domain&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Extracted domain (e.g. example.com)&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result_position&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Position on SERP (1-based)&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rank&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Rank value (can differ from position)&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;title&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Result title&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;timestamp&lt;/span&gt;  
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Scrape timestamp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The  &lt;code&gt;database: serp_source&lt;/code&gt;  matches the attach alias — that’s how dbt finds the raw table. From here, every model references  &lt;a href="https://docs.getdbt.com/reference/dbt-jinja-functions/source" rel="noopener noreferrer"&gt;{{ source() }}&lt;/a&gt;  or  &lt;a href="https://docs.getdbt.com/reference/dbt-jinja-functions/ref" rel="noopener noreferrer"&gt;{{ ref() }}&lt;/a&gt;  rather than a raw table path. This doesn’t sound too important, but in fact, it’s what gives dbt the information it needs to build a  &lt;a href="https://www.getdbt.com/blog/getting-started-with-data-lineage" rel="noopener noreferrer"&gt;lineage graph&lt;/a&gt;  — a visual DAG of how every output table was derived from the source.&lt;/p&gt;

&lt;p&gt;It’s worth talking about what the lineage graph  &lt;em&gt;actually&lt;/em&gt; buys you, because it’s easy to dismiss as a visualization feature:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  When something looks wrong in a mart, you trace upstream through the graph to find where it came from.&lt;/li&gt;
&lt;li&gt;  When you’re about to refactor staging, you can see which intermediate models and marts depend on it before you touch anything.&lt;/li&gt;
&lt;li&gt;  When someone new joins the project, they get the full picture of how every output table was derived — in seconds, not by grepping through SQL files.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run  &lt;code&gt;dbt docs generate &amp;amp;&amp;amp; dbt docs serve&lt;/code&gt;  to view it in the browser.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fivd8o932vmq4g14crn5p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fivd8o932vmq4g14crn5p.png" alt="dbt lineage graph showing SERP data flow from source serp_results through staging, intermediate, and marts models." width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Lineage graph from dbt docs serve, showing the dependency chain from raw SERP data to analytics-ready aggregations.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;See these  &lt;a href="https://docs.getdbt.com/reference/commands/cmd-docs" rel="noopener noreferrer"&gt;dbt docs&lt;/a&gt;  for more details on these commands.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Staging: The Contract
&lt;/h3&gt;

&lt;p&gt;The staging model is where all the defensive work happens. Its job is a guarantee: by the time a row leaves staging, it is clean, normalized, and deduplicated. Every model downstream gets to assume that contract holds, which means none of them have to repeat the defensive logic.&lt;/p&gt;

&lt;p&gt;Staging and intermediate models use  &lt;a href="https://docs.getdbt.com/reference/resource-configs/materialized" rel="noopener noreferrer"&gt;materialized='view'&lt;/a&gt;; marts use  &lt;code&gt;[materialized='table']&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/staging/stg_serp_results.sql&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'view'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'serp_source'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="p"&gt;),&lt;/span&gt;  

&lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="k"&gt;select&lt;/span&gt;  
        &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;nullif&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'Untitled'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;nullif&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="k"&gt;case&lt;/span&gt;  
            &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt; &lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
                &lt;span class="n"&gt;regexp_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'^https?://(?:www&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s1"&gt;)?([^/]+)'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
                &lt;span class="s1"&gt;'unknown'&lt;/span&gt;  
            &lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;regexp_replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'^www&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'i'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
        &lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;rank&lt;/span&gt;    
    &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;    
&lt;span class="p"&gt;),&lt;/span&gt;  
&lt;span class="n"&gt;deduplicated&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  
    &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
            &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;over&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;partition&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;  
        &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;  
    &lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;    
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;deduplicated&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, the domain normalization block is the part worth paying attention to.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The scraper provides a  &lt;code&gt;domain&lt;/code&gt;  field, but it isn't always populated, and even when it is, capitalization and  &lt;code&gt;www.&lt;/code&gt;  prefixes create false cardinality in downstream aggregations.&lt;/li&gt;
&lt;li&gt;  The  &lt;code&gt;CASE&lt;/code&gt;  expression handles both paths: if the field is present, lowercase it and strip  &lt;code&gt;www.&lt;/code&gt;; if it's missing, fall back to extracting the domain from the URL via  &lt;a href="https://duckdb.org/docs/sql/functions/char#regular-expressions" rel="noopener noreferrer"&gt;DuckDB's&lt;/a&gt; &lt;a href="https://duckdb.org/docs/sql/functions/char#regular-expressions" rel="noopener noreferrer"&gt;regexp_extract&lt;/a&gt; /  &lt;a href="https://duckdb.org/docs/sql/functions/char#regular-expressions" rel="noopener noreferrer"&gt;regexp_replace&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;  Without this,  &lt;code&gt;www.Example.com&lt;/code&gt;,  &lt;code&gt;example.com&lt;/code&gt;, and a row where  &lt;code&gt;domain&lt;/code&gt;  is null but the URL is  &lt;code&gt;https://www.example.com/...&lt;/code&gt;  all count as different domains in every GROUP BY. That kind of silent cardinality inflation is exactly what staging is supposed to catch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deduplication happens last. A window function partitions by  &lt;code&gt;(url, query)&lt;/code&gt;  and keeps the row with the lowest  &lt;code&gt;id&lt;/code&gt;  — the earliest record if a URL was collected more than once for a given query.&lt;/p&gt;

&lt;h3&gt;
  
  
  What about tests?
&lt;/h3&gt;

&lt;p&gt;Add  &lt;code&gt;models/staging/stg_serp_results.yml&lt;/code&gt;  next to your staging model with column-level tests — straightforward  &lt;a href="https://docs.getdbt.com/reference/resource-properties/data-tests#not_null" rel="noopener noreferrer"&gt;not_null&lt;/a&gt;  checks on the four fields that everything downstream depends on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2&lt;/span&gt;  
&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stg_serp_results&lt;/span&gt;  
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Cleaned and deduplicated SERP results; domain normalized&lt;/span&gt;  
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;url&lt;/span&gt;  
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Result URL&lt;/span&gt;  
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;  
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;query&lt;/span&gt;  
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Search query&lt;/span&gt;  
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;  
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;domain&lt;/span&gt;  
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Normalized domain&lt;/span&gt;  
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;  
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result_position&lt;/span&gt;  
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SERP position (1-based)&lt;/span&gt;  
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There’s also a custom test in &lt;code&gt;tests/unique_url_query.sql&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Custom test because (url, query) should be unique in staging  &lt;/span&gt;
&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;cnt&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'stg_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;  
&lt;span class="k"&gt;having&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? Because dbt’s built-in  &lt;a href="https://docs.getdbt.com/reference/resource-properties/data-tests#unique" rel="noopener noreferrer"&gt;unique&lt;/a&gt;  test only checks a single column. Our custom test checks a  &lt;em&gt;combination&lt;/em&gt;  — a given URL should appear only once per query after deduplication.&lt;/p&gt;

&lt;p&gt;If the window function logic ever breaks due to a refactor, or if upstream data arrives in a shape the deduplication didn't account for, this test catches it before bad data reaches the marts. That's the value of encoding the business rule as a test: it runs every time, and it fails loudly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intermediate: Filter Once, Trust Everywhere
&lt;/h3&gt;

&lt;p&gt;The intermediate model is intentionally short (also a view).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/intermediate/int_serp_results.sql&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'view'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="nb"&gt;timestamp&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'stg_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;  
  &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
  &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;  
  &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'unknown'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You might wonder why this exists as its own layer rather than putting these filters in each mart. The reason is that all four mart models need the same filtered dataset.&lt;/p&gt;

&lt;p&gt;If the filter logic lived in each mart, changing the definition of a “valid” result would mean changing it in four places and hoping they stayed in sync — which they won’t, eventually. Intermediate models are dbt’s answer to that: define the analysis-ready dataset once, name it, and let everything downstream reference it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Marts: Four Questions, Four Tables
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getdbt.com/best-practices/how-we-structure/4-marts?version=1.12" rel="noopener noreferrer"&gt;The mart layer&lt;/a&gt;  is where the pipeline produces something you’d actually hand to an analyst or wire to a dashboard.&lt;/p&gt;

&lt;p&gt;Staging and intermediate models are materialized as  &lt;em&gt;views&lt;/em&gt;  — they’re always fresh and don’t consume extra storage. Marts are materialized as  &lt;em&gt;tables&lt;/em&gt;  — written to disk on every run — so queries against them are fast regardless of upstream complexity.&lt;/p&gt;

&lt;p&gt;That split keeps the feedback loop fast during development while giving analysts precomputed tables to query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query Coverage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/marts/agg_query_coverage.sql&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'table'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unique_urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unique_domains&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;best_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;worst_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_position&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'int_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;  
&lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;unique_domains&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each search query: how many unique URLs and domains appeared, and what was the spread of positions? This is where you’d start to understand which queries have competitive SERP landscapes — lots of domains spread across positions — versus which are dominated by a handful of sites.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rank Distribution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/marts/agg_rank_distribution.sql&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'table'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;rank_buckets&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="k"&gt;select&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="k"&gt;case&lt;/span&gt;  
            &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="s1"&gt;'1-3'&lt;/span&gt;  
            &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="s1"&gt;'4-10'&lt;/span&gt;  
            &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="s1"&gt;'11-20'&lt;/span&gt;  
            &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="n"&gt;result_position&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt; &lt;span class="s1"&gt;'21-50'&lt;/span&gt;  
            &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="s1"&gt;'50+'&lt;/span&gt;  
        &lt;span class="k"&gt;end&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;position_bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;appearances&lt;/span&gt;  
    &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'int_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
    &lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;  
&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;position_bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unique_domains&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;appearances&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_appearances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_position&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rank_buckets&lt;/span&gt;  
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;position_bucket&lt;/span&gt;  
&lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;position_bucket&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The position buckets — 1–3, 4–10, 11–20, 21–50, 50+ — map to how SEO practitioners actually think about SERP real estate. Positions 1–3 capture the majority of clicks. Anything past 10 is largely invisible. By bucketing in the mart rather than the BI layer, you’re encoding that domain knowledge once, in version-controlled SQL, rather than relying on every analyst who touches the data to re-derive it correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain Rank Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/marts/agg_domain_rank_summary.sql&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'table'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_appearances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;query_coverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;distinct&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unique_urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;best_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;worst_position&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'int_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;  
&lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;total_appearances&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This flips the perspective from query-centric to domain-centric.  &lt;code&gt;query_coverage&lt;/code&gt;  is the field I find most useful here — it tells you how many distinct queries a domain appeared in, which is a rough proxy for breadth of SERP presence. A domain with high  &lt;code&gt;total_appearances&lt;/code&gt;  but low  &lt;code&gt;query_coverage&lt;/code&gt;  is strong in a narrow area; a domain with both metrics high is broadly dominant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Domain x Query Matrix&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;models/marts/agg_domain_query_matrix.sql&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt;  
  &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'table'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="c1"&gt;-- Domain x query matrix: best position and appearances per (domain, query)  &lt;/span&gt;
&lt;span class="k"&gt;select&lt;/span&gt;  
    &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;best_position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;appearances&lt;/span&gt;  
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'int_serp_results'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;  
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;  
&lt;span class="k"&gt;order&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="k"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most compact mart, but also the most useful for visualization. Each row is a domain–query pair: the best position that domain achieved for that query, and how many times it appeared. Pivot this by query and you have an SEO visibility matrix — at a glance, which domains rank consistently across your whole keyword set, and which are only competitive in specific corners of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running the Pipeline
&lt;/h2&gt;

&lt;p&gt;From your project root (where  &lt;code&gt;dbt_project.yml&lt;/code&gt;  and  &lt;code&gt;profiles.yml&lt;/code&gt;  live):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Mind the " ."  &lt;/span&gt;
dbt run &lt;span class="nt"&gt;--profiles-dir&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;dbt resolves the  &lt;code&gt;ref()&lt;/code&gt;  graph before executing anything, so models always run in the right dependency order — staging first, then intermediate, then the four marts. To exclude synthetic data across the whole pipeline:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To run tests after materializing:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;--profiles-dir&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;To view the lineage graph and model docs in the browser:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt docs generate &lt;span class="nt"&gt;--profiles-dir&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;  

dbt docs serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;profiles.yml&lt;/code&gt;  is in the project root,&lt;code&gt;--profiles-dir&lt;/code&gt;  . tells dbt to look there. If you use  &lt;code&gt;~/.dbt/profiles.yml&lt;/code&gt;, you can omit it.&lt;/p&gt;

&lt;p&gt;If the deduplication breaks, or if upstream data arrives with unexpected nulls in critical columns, the tests surface it before bad data reaches the marts. That’s the other thing the pipeline buys you beyond SQL organization: the tests run as a first-class step, not as a notebook someone wrote once and forgot to share.&lt;/p&gt;

&lt;p&gt;The raw web data you ingest is  &lt;em&gt;queryable,&lt;/em&gt;  certainly (and SERP APIs like Bright Data make that very convenient with their structured JSON response), but it isn’t  &lt;em&gt;trustworthy&lt;/em&gt;  in the way analytics requires — where “trustworthy” means that every analyst querying it gets the same answers, because the decisions about normalization, deduplication, and what constitutes a valid result were made once and encoded somewhere they can be found.&lt;/p&gt;

&lt;p&gt;Without a pipeline like this, those decisions happen ad hoc. Someone normalizes the domain in a notebook. Someone else doesn’t. The counts disagree, and it’s not clear why. dbt’s layered model is a response to that problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  the decisions are in version-controlled SQL,&lt;/li&gt;
&lt;li&gt;  the contracts are enforced by tests that run on every execution, and&lt;/li&gt;
&lt;li&gt;  when something changes upstream, the first place you’ll hear about it is a failing test rather than a confused analyst.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;That’s reproducible analytics&lt;/strong&gt;  — the same inputs and the same logic producing the same outputs, every time.&lt;/p&gt;

&lt;p&gt;That’s what the move from raw scrapes to a modeled pipeline with dbt actually buys you. Not fancier queries, but a system where the logic is visible, the contracts are testable, and the next person who touches the data doesn’t have to reverse-engineer what you were thinking.&lt;/p&gt;

</description>
      <category>database</category>
      <category>datascience</category>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>The Practical Limits of DuckDB on Commodity Hardware</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Tue, 10 Mar 2026 08:21:25 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/the-practical-limits-of-duckdb-on-commodity-hardware-f76</link>
      <guid>https://dev.to/prithwish_nath/the-practical-limits-of-duckdb-on-commodity-hardware-f76</guid>
      <description>&lt;p&gt;DuckDB shouldn’t work this well. It’s a single embedded library that needs no server, config, or cloud bill — yet it handles warehouse-scale columnar analytics with surprising ease.&lt;/p&gt;

&lt;p&gt;Plenty of benchmarks already show that DuckDB can process large datasets. The more useful question, to me, is narrower: &lt;strong&gt;how far does it scale on low-end hardware before interactivity breaks down?&lt;/strong&gt; I do data forensics for a living, and the last thing I want is infrastructure getting in the way.&lt;/p&gt;

&lt;p&gt;To answer that, I ran a 50-million-row benchmark on a ~$500 Acer Aspire 5 (Raptor Lake i5, 16GB RAM, 1 TB SSD). Starting with 50,000 real-world search results, I generated a large synthetic dataset and executed increasingly complex analytical queries to identify where performance crossed practical thresholds.&lt;/p&gt;

&lt;p&gt;I'll present my findings here. The result wasn't a catastrophic failure point, but a series of predictable transitions (from instant, to tolerable, to better-scheduled-than-waited-on) depending on query shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The Four Performance Zones
&lt;/h2&gt;




&lt;p&gt;Before I dive into specific query types, understand that not all scales perform the same way. The same query that feels instant at 1 million rows might cross into “go grab a coffee” territory at 30 million — and different query types degrade at different rates.&lt;/p&gt;

&lt;p&gt;So there isn’t one big scale — but three. One for each &lt;em&gt;type&lt;/em&gt; of analytical query:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Percentile queries (like&lt;code&gt;GROUP BY&lt;/code&gt; + &lt;code&gt;PERCENTILE_CONT&lt;/code&gt;&lt;em&gt;) —&lt;/em&gt; when you’re trying to answer questions like “&lt;strong&gt;&lt;em&gt;For each domain, what’s its median, 25th, 75th, and 95th percentile search ranking?&lt;/em&gt;&lt;/strong&gt;” Or: “&lt;strong&gt;&lt;em&gt;For each customer tier, what is the session duration distribution?&lt;/em&gt;&lt;/strong&gt;” This is the kind of query you run when averages aren’t enough. You want to understand spread, skew, and outliers. It’s common in finance, product analytics, marketplace reporting — anywhere distributions matter more than single numbers.&lt;/li&gt;
&lt;li&gt;Window functions (like&lt;code&gt;LAG&lt;/code&gt; with &lt;code&gt;PARTITION BY&lt;/code&gt;) — e.g. “&lt;strong&gt;&lt;em&gt;For each user, how did their activity change vs. previous session?&lt;/em&gt;&lt;/strong&gt;” Or: “&lt;strong&gt;&lt;em&gt;For each account, how did monthly revenue change vs. last month?&lt;/em&gt;&lt;/strong&gt;” This is bread-and-butter analytics — a simple time-series comparison. It’s what powers growth dashboards, churn analysis. Amazon, for example, uses this extensively for anomaly detection. These are more expensive because they require partitioning and sorting — especially on large datasets like ours.&lt;/li&gt;
&lt;li&gt;Aggregations (like&lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;MIN&lt;/code&gt;, &lt;code&gt;MAX&lt;/code&gt;with &lt;code&gt;DISTINCT&lt;/code&gt;) — e.g. “&lt;strong&gt;&lt;em&gt;For each region, how many total orders did we ship, how many unique customers/products, what were the min/max purchase values?&lt;/em&gt;&lt;/strong&gt;” This is the classic summary view, the kind of query behind almost every dashboard card. Deceptively simple because &lt;code&gt;DISTINCT&lt;/code&gt;at scale forces the engine to work harder than you might expect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, these three queries make up a massive chunk of all real-world analytics.&lt;/p&gt;

&lt;p&gt;Here’s how they perform going from 1,000 to 50,000,000 rows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ig0luie5bhzgikh5wkx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ig0luie5bhzgikh5wkx.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Query time (seconds) vs record count (millions). Image created via D3.js by Author.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;Percentile&lt;/th&gt;
&lt;th&gt;Window&lt;/th&gt;
&lt;th&gt;Aggregation&lt;/th&gt;
&lt;th&gt;Worst Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;0.11s&lt;/td&gt;
&lt;td&gt;0.80s&lt;/td&gt;
&lt;td&gt;0.12s&lt;/td&gt;
&lt;td&gt;0.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2M&lt;/td&gt;
&lt;td&gt;0.28s&lt;/td&gt;
&lt;td&gt;1.78s&lt;/td&gt;
&lt;td&gt;0.50s&lt;/td&gt;
&lt;td&gt;1.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5M&lt;/td&gt;
&lt;td&gt;0.64s&lt;/td&gt;
&lt;td&gt;3.28s&lt;/td&gt;
&lt;td&gt;1.23s&lt;/td&gt;
&lt;td&gt;3.3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10M&lt;/td&gt;
&lt;td&gt;1.65s&lt;/td&gt;
&lt;td&gt;6.26s&lt;/td&gt;
&lt;td&gt;2.17s&lt;/td&gt;
&lt;td&gt;6.3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15M&lt;/td&gt;
&lt;td&gt;2.69s&lt;/td&gt;
&lt;td&gt;7.94s&lt;/td&gt;
&lt;td&gt;4.15s&lt;/td&gt;
&lt;td&gt;7.9s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20M&lt;/td&gt;
&lt;td&gt;3.69s&lt;/td&gt;
&lt;td&gt;15.26s&lt;/td&gt;
&lt;td&gt;6.58s&lt;/td&gt;
&lt;td&gt;15.3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25M&lt;/td&gt;
&lt;td&gt;5.52s&lt;/td&gt;
&lt;td&gt;22.98s&lt;/td&gt;
&lt;td&gt;10.60s&lt;/td&gt;
&lt;td&gt;23.0s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30M&lt;/td&gt;
&lt;td&gt;6.67s&lt;/td&gt;
&lt;td&gt;31.41s&lt;/td&gt;
&lt;td&gt;15.46s&lt;/td&gt;
&lt;td&gt;31.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;40M&lt;/td&gt;
&lt;td&gt;9.82s&lt;/td&gt;
&lt;td&gt;47.43s&lt;/td&gt;
&lt;td&gt;22.64s&lt;/td&gt;
&lt;td&gt;47.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50M&lt;/td&gt;
&lt;td&gt;12.95s&lt;/td&gt;
&lt;td&gt;67.24s&lt;/td&gt;
&lt;td&gt;32.42s&lt;/td&gt;
&lt;td&gt;67.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Having plotted query times against human perception thresholds (I made some assumptions for those, since this is subjective), we have four distinct performance zones — with boundaries shifting depending on which query type you care about.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Comfort Zone”: Up to 5M Records
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query times:&lt;/strong&gt; Near instantaneous, `&amp;lt; 3 seconds for &lt;em&gt;everything.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is why DuckDB feels magical. For everything up to 5 million rows (and let’s be honest — more data than most organizations are querying interactively anyway) — &lt;em&gt;all queries&lt;/em&gt; are fast. Window functions complete in under a second at 1M. Even at 5M rows, the slowest query (you guessed it; a window function) takes just 3.3 seconds. Percentiles and aggregations are sub-second through 2M and barely crack 1 second at 5M.&lt;/p&gt;

&lt;p&gt;At this scale, DuckDB keeps everything comfortably in memory. Hash tables fit in RAM, partitioning operations don’t require temp files, and there’s zero disk spilling. This is the sweet spot for team analytics, dashboards, and interactive data exploration.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Workable”: 5–20M Records
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query times:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Window functions: 3–15 seconds. Noticeable latency. At 10M they take 6 seconds, and at 15M, about 8 seconds. They hit 15 seconds right at 20M — the upper edge of this zone.&lt;/li&gt;
&lt;li&gt;Percentiles and aggregations: still under 4 and 7 seconds respectively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where window queries will make people Alt-Tab out to do something else — doomscroll, check their email, watch videos of cats — instead of waiting. Percentiles and aggregations are still firmly comfortable, though.&lt;/p&gt;

&lt;p&gt;This zone is still good for Data science work and one-off analyses where a 10–15 second wait is acceptable. Still no disk spilling. Memory usage stays under ~650 MB.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Now You’re Pushing It”: 20–30M Records
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query times:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Window functions: 15–31 seconds, always crossing the 15s “frustration threshold”.&lt;/li&gt;
&lt;li&gt;Aggregations: 7–15 seconds (10.6s at 25M, 15.5s at 30M), near-annoying latency now.&lt;/li&gt;
&lt;li&gt;Percentiles: 4–7 seconds (5.5s at 25M, 6.7s at 30M.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the point where query types diverge dramatically. Percentiles are still firmly in the “interactive” category, but aggregations require patience. Window functions however, are firmly in the “go do something else” territory — at 30M records, they can reach a whopping 31 seconds — clearly the point where you should batch these, instead of waiting live.&lt;/p&gt;

&lt;p&gt;In general, this zone is best for automated/scheduled work where humans aren’t waiting on complex queries. Simple aggregations and percentiles are still interactive enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  “Batch-Only”: 30M+ Records
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query times:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Window functions: 31–67 seconds. Always past one minute at 50M.&lt;/li&gt;
&lt;li&gt;Aggregations: 15–32 seconds. Always past 30 seconds at 50M.&lt;/li&gt;
&lt;li&gt;Percentiles: 7–13 seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a testament to how performant DuckDB can be — if you’re &lt;em&gt;only&lt;/em&gt; running percentiles, even 50M rows is still marginally interactive (tops out at 13 seconds.)&lt;/p&gt;

&lt;p&gt;Everything else at this scale, though, should be scheduled.&lt;/p&gt;

&lt;p&gt;If you need window functions, this (30M and above) is clearly the point where you should consider a cloud data warehouse — they cross a minute at 50M (~67 seconds). Aggregations taking 20–30 seconds on average is also beyond generous definitions of “interactive”.&lt;/p&gt;

&lt;p&gt;So this is where we’ve found our limit — albeit not a hard one. Performance degrades predictably, not catastrophically. &lt;strong&gt;Critically, there’s still zero disk spilling.&lt;/strong&gt; DuckDB never writes temp files to disk, even at 50M records. Memory usage peaked at ~1.2 GB.&lt;/p&gt;

&lt;p&gt;Here’s a practical cheatsheet, sorted by use case:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Simple analytics ceiling&lt;/th&gt;
&lt;th&gt;Window function ceiling&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Live auto-refresh dashboard (less than 500ms)&lt;/td&gt;
&lt;td&gt;~2M rows&lt;/td&gt;
&lt;td&gt;~500K rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API-backed service (less than 1s)&lt;/td&gt;
&lt;td&gt;~5M rows&lt;/td&gt;
&lt;td&gt;~1M rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interactive BI (less than 3s)&lt;/td&gt;
&lt;td&gt;~15M rows&lt;/td&gt;
&lt;td&gt;~5M rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Notebook / exploration (less than 10s)&lt;/td&gt;
&lt;td&gt;~40M rows&lt;/td&gt;
&lt;td&gt;~15M rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ad-hoc analyst SQL (less than 30s)&lt;/td&gt;
&lt;td&gt;50M+ rows&lt;/td&gt;
&lt;td&gt;~20–25M rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch / ETL (less than 2min)&lt;/td&gt;
&lt;td&gt;50M+ rows&lt;/td&gt;
&lt;td&gt;50M+ rows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  2. The Goldilocks Zone for Local Analytics
&lt;/h2&gt;




&lt;p&gt;If I had to draw a circle around the scale where DuckDB on a cheap laptop is just &lt;em&gt;right&lt;/em&gt; — fast enough to feel interactive, complex enough to handle real analytical work, without needing to think about infrastructure — it’s &lt;strong&gt;1M to 10M rows&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here’s why that range is special.&lt;/p&gt;

&lt;p&gt;At 1M rows, every query type is effectively instant. Aggregations finish in 0.06 seconds. Even the most expensive query in the test — the &lt;code&gt;LAG&lt;/code&gt; window function — takes 0.47 seconds. The database is not part of your cognitive load at all at this scale.&lt;/p&gt;

&lt;p&gt;At 5M rows, that’s still largely true. Aggregations take half a second. Percentiles take 1.3 seconds. Window functions take 1.7 seconds. 5M rows of real data is a meaningful dataset — a year of event logs for a mid-sized SaaS product, a full crawl of a large website, several years of transaction history for a small business. And DuckDB chews through it without complaint.&lt;/p&gt;

&lt;p&gt;At 10M rows, you start paying a tax specifically on window functions — they cross 6 seconds here. But aggregations (0.79s) and percentiles (1.99s) are still fast enough that interactive use feels natural. If your workload skews toward &lt;code&gt;GROUP BY &lt;/code&gt;analytics and distribution queries rather than time-series comparisons, 10M rows is still firmly comfortable.&lt;/p&gt;

&lt;p&gt;So our “Goldilocks” Zone — so named after the fairytale of Goldilocks and the Three Bears; Goldilocks rejects porridges that are too hot &lt;em&gt;and&lt;/em&gt; too cold, until finding one that’s “just right” — is &lt;strong&gt;up to ~10M rows for any query type, and up to ~20M rows if you’re willing to accept that window functions will make you wait.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Below that ceiling, DuckDB on cheap/commodity hardware is no compromise at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Memory is Not the Bottleneck
&lt;/h2&gt;




&lt;p&gt;Across all scales tested (1K → 50M rows, averaged over 5 runs) DuckDB never exceeded ~&lt;strong&gt;1.2 GB&lt;/strong&gt; of memory — even at 50 million rows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjpiz4y36wj1q33onjhd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjpiz4y36wj1q33onjhd.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Peak memory (MB) vs record count (log scale, 1K–50M). Image created via D3.js by Author.&lt;/p&gt;

&lt;p&gt;At the upper bound (50 million rows), &lt;strong&gt;peak observed memory usage&lt;/strong&gt; was ~1,212 MB. On my 16GB laptop that’s ~7.5%. And that’s the worst case in this benchmark — with 50M rows and window-heavy queries in play.&lt;/p&gt;

&lt;p&gt;Memory growth is real — especially with window function (delta) queries — but it’s still fairly controlled, I’d say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10M rows → ~600 MB range&lt;/li&gt;
&lt;li&gt;20M rows → ~650 MB range&lt;/li&gt;
&lt;li&gt;30–50M rows → ~900 MB–1.2 GB range&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Time, on the other hand, accelerates sharply:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Window query: ~6.4s at 10M&lt;/li&gt;
&lt;li&gt;~15.8s at 20M&lt;/li&gt;
&lt;li&gt;~70s at 50M&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the &lt;em&gt;practical&lt;/em&gt; ceiling on a cheap laptop isn’t RAM. You’ll run out of patience before you run out of memory. It means for many local analytics use cases the scaling limit is UX, not hardware — that’s a very different kind of bottleneck, though.&lt;/p&gt;

&lt;p&gt;Before we move on, an interesting thing to note is that &lt;strong&gt;the aggregation query degrades faster than percentiles at scale.&lt;/strong&gt; At 5M they’re similar (~0.6s vs ~1.2s), but by 50M the aggregation (33.7s) is nearly 3x the percentile (12.5s). Multi-metric&lt;code&gt; GROUP BY&lt;/code&gt; with &lt;code&gt;HAVING&lt;/code&gt; and &lt;code&gt;DISTINCT &lt;/code&gt;counts WILL get expensive. If your dashboard runs heavy aggregations specifically, you should probably discount the “simple” ceiling by ~30–40%.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Window Functions Affect Scaling The Most
&lt;/h2&gt;




&lt;p&gt;If you’re embedding DuckDB in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A local analytics tool&lt;/li&gt;
&lt;li&gt;A CLI data processor&lt;/li&gt;
&lt;li&gt;A desktop SaaS product&lt;/li&gt;
&lt;li&gt;A developer-facing data tool&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Row count alone is not your sizing variable. &lt;strong&gt;Query shape is.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Across 18 dataset scales (1k → 50M rows, averaged over 5 runs, the single biggest performance divider wasn’t how much data existed — &lt;strong&gt;it was whether the workload used window functions, at all&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/stable/sql/functions/window_functions" rel="noopener noreferrer"&gt;Window functions&lt;/a&gt; power extremely common patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ranking results per group (&lt;code&gt;RANK()&lt;/code&gt;,&lt;code&gt; ROW_NUMBER()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Comparing rows over time (&lt;code&gt;LAG()&lt;/code&gt;, &lt;code&gt;LEAD()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Running totals&lt;/li&gt;
&lt;li&gt;Sessionization&lt;/li&gt;
&lt;li&gt;Deduplication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are all operations you’ll run many, &lt;em&gt;many&lt;/em&gt; times for a real project.&lt;/p&gt;

&lt;p&gt;I found that on identical hardware, aggregation-heavy workloads comfortably scale to tens of millions of rows while staying interactive. These Window-heavy workloads, though, hit “coffee break” territory &lt;em&gt;much&lt;/em&gt; earlier.&lt;/p&gt;

&lt;h3&gt;
  
  
  At 10M rows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Percentile query (aggregation-heavy):&lt;/strong&gt; ~1.7s, ~217 MB memory delta&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Window “delta” query (LAG-style):&lt;/strong&gt; ~6.4s, ~46 MB memory delta&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Group aggregation:&lt;/strong&gt; ~2.5s, ~200 MB memory delta&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Already, the window query is ~3–4x slower than pure aggregation.&lt;/p&gt;

&lt;h3&gt;
  
  
  At 20M rows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Percentile:&lt;/strong&gt; ~4.1s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Window (delta):&lt;/strong&gt; ~15.8s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregation:&lt;/strong&gt; ~7.3s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gap widens. The window query is now roughly 2–4x slower than aggregation-style queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  At 50M rows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Percentile:&lt;/strong&gt; ~16.5s, ~330 MB delta
*&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Window (delta):&lt;/strong&gt; ~70.3s, ~588 MB delta&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregation:&lt;/strong&gt; ~33.3s, ~113 MB deltaThis is where the separation becomes untenable:&lt;/li&gt;
&lt;li&gt;The window query takes &lt;strong&gt;70 seconds&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Aggregation-heavy queries remain roughly half that.&lt;/li&gt;
&lt;li&gt;Memory growth for window logic accelerates much more aggressively at scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your workload is mostly &lt;code&gt;GROUP BY&lt;/code&gt;, percentiles, and simple scans, then you have substantial headroom. If your workload relies heavily on&lt;code&gt; LAG()&lt;/code&gt;,&lt;code&gt; LEAD()&lt;/code&gt;, &lt;code&gt;RANK() OVER (PARTITION BY …)&lt;/code&gt; then your practical ceiling with DuckDB (running on such consumer grade local hardware, at least) arrives much sooner.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. …But Make Sure you Checkpoint Enough
&lt;/h2&gt;

&lt;p&gt;Speaking of window functions, avery interesting recurring pattern I found was that at exactly 50,000 records, these hit a wall.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8frqsxwwbp62oln06rtu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8frqsxwwbp62oln06rtu.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Delta query time (seconds) vs record count (log scale, 1K–5M) showing the spike at 50K. Image created via D3.js by Author&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This pattern is consistent across all 5 runs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;20K records: 0.24 seconds (fast)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;50K records: 1.21 seconds (5x slower than expected)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;100K records: 0.20 seconds (fast again)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Assuming expected: linear interpolation between 20K (0.241s) and 100K (0.204s) at 50K: 0.241 + (0.204 − 0.241) × (50 − 20) / (100 − 20) ≈ 0.225s&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This specific spike appears in &lt;strong&gt;every single run&lt;/strong&gt; with CV (coefficient of variation) of only 10.3%. Expected time based on linear scaling was 0.23 seconds. Actual time? 1.21 seconds. &lt;strong&gt;That’s 430% slower than it should be!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It’s the only non-monotonic performance pattern in the entire dataset. Every other scale shows smooth, predictable scaling. The 50K spike for window/delta functions is the exception.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it happens:&lt;/strong&gt; &lt;a href="https://duckdb.org/docs/stable/sql/statements/checkpoint" rel="noopener noreferrer"&gt;Checkpoint more frequently.&lt;/a&gt; DuckDB has automatic checkpointing (&lt;code&gt;wal_autocheckpoint&lt;/code&gt;), but it doesn’t work reliably during bulk inserts from Python. (&lt;a href="https://github.com/duckdb/duckdb/issues/9721" rel="noopener noreferrer"&gt;Check this GitHub issue&lt;/a&gt;). Knowing this, I set checkpointing manually, but as it turned out I wasn’t checkpointing &lt;em&gt;enough&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Data was inserted in 1,000-row batches across multiple database connections, but I had a LOT of records (50 million) so I thought I’d manually trigger CHECKPOINT only every 100K rows.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;💡&lt;/em&gt; CHECKPOINT is a database operation that flushes pending writes to disk and optimizes storage layout. In DuckDB specifically:- Consolidates fragmented columnar segments created by many small inserts&lt;br&gt;
- Reorganizes data into optimized columnar format for faster reads&lt;br&gt;
- Flushes the write-ahead log (WAL) to the main database fileLike defragmenting a hard drive, CHECKPOINT reorganizes scattered data into contiguous, optimized blocks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because the scales were run sequentially (1K → 5K → 10K → 20K → 50K → 100K and so on), by the time the 50K test executed, dozens of small batch inserts had accumulated without a CHECKPOINT, leaving the write-ahead log fragmented across many small column segments.&lt;/p&gt;

&lt;p&gt;Now consider the query that spikes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;LAG(rank) OVER (PARTITION BY url, query ORDER BY timestamp)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This requires a full sort of the relevant partitions. Sorting is exactly the kind of operation that is most sensitive to storage layout, and fragmented column segments mean less efficient scanning and sorting. So at 50K, the query pays the fragmentation penalty.&lt;/p&gt;

&lt;p&gt;By 50K I had 50 fragmented segments. The window function’s sort pays the cost of reading across all those fragments. At 100K, CHECKPOINT compacted everything, so queries got faster despite more data.&lt;/p&gt;

&lt;p&gt;That’s all the data I found. If you want to know about my methodology for this (admittedly) very niche benchmark, read on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology — How this Low-End Benchmark Was Built
&lt;/h2&gt;




&lt;p&gt;What I wanted to do was stress-test DuckDB with realistic analytical queries at scales most teams would consider “cloud warehouse only.” So if I wanted to benchmark how it performed on low-end hardware, I had to flip the entire approach on its head.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — Getting Search Data
&lt;/h3&gt;

&lt;p&gt;I started by fetching 50,000 actual Google SERP results using Bright Data’s &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=practical_limits_of_duckdb_on_commodity_hardware" rel="noopener noreferrer"&gt;SERP API&lt;/a&gt;. Why SERP data? Because search results have realistic cardinality — lots of unique URLs, domains, queries, timestamps — exactly the kind of data that stresses analytical databases.&lt;/p&gt;

&lt;p&gt;I built ~2,550 unique queries by combining 50 base topics with 51 suffix variations:&lt;/p&gt;

&lt;p&gt;`&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_BASE_TOPICS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python programming&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;machine learning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web development&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="c1"&gt;# ... 47 more  
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  

&lt;span class="n"&gt;_QUERY_SUFFIXES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
    &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; tutorial&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; 2024&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; best practices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="c1"&gt;# ... 47 more  
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  

&lt;span class="n"&gt;SERP_QUERIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;  
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_BASE_TOPICS&lt;/span&gt;  
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;suffix&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_QUERY_SUFFIXES&lt;/span&gt;  
&lt;span class="p"&gt;]&lt;/span&gt;  
&lt;span class="n"&gt;SERP_QUERIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromkeys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SERP_QUERIES&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://gist.github.com/sixthextinction/57c0cc9358eac133d07fdbbeadd832ff" rel="noopener noreferrer"&gt;serp_queries.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Each query fetched up to 20 organic results via the API, streamed directly into DuckDB as results came in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DuckDBManager&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;results_obtained&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;total_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;  
        &lt;span class="n"&gt;serp_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
                &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
                    &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
            &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="n"&gt;results_obtained&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

        &lt;span class="n"&gt;query_idx&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delay_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://gist.github.com/sixthextinction/4e8ae4eefebc1c07587882b49ffaecdb" rel="noopener noreferrer"&gt;bright_data.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My DuckDB schema mirrors the SERP data structure the API returns. (&lt;a href="https://get.brightdata.com/lp-scraping-browser-acf1964?utm_content=practical_limits_of_duckdb_on_commodity_hardware" rel="noopener noreferrer"&gt;Check their docs here for more info&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;  
    CREATE TABLE IF NOT EXISTS serp_results (  
        id BIGINT PRIMARY KEY,  
        query TEXT NOT NULL,  
        timestamp TIMESTAMP NOT NULL,  
        result_position INTEGER NOT NULL,  
        title TEXT,  
        url TEXT,  
        snippet TEXT,  
        domain TEXT,  
        rank INTEGER,  
        previous_rank INTEGER,  
        rank_delta INTEGER  
    )  
&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://gist.github.com/sixthextinction/4ac6bc086e31dd1f4a8e56bef51b8e1c" rel="noopener noreferrer"&gt;duckdb_manager.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This adds up to 50K rows of real, varied data with natural patterns — not artificial test data…but that still wasn’t enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — Synthesize 50M Records from Real Patterns
&lt;/h3&gt;

&lt;p&gt;50K rows isn’t enough to stress-test at scale. But &lt;em&gt;completely&lt;/em&gt; random synthetic data loses the realistic patterns that make queries slow. So I extracted actual domains, queries, title structures, and snippets from the real data and used those to generate variations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_serp_patterns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DuckDBManager&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT DISTINCT query FROM serp_results ORDER BY RANDOM() LIMIT 100&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT DISTINCT domain FROM serp_results WHERE domain IS NOT NULL LIMIT 200&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;title_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT title FROM serp_results WHERE title IS NOT NULL LIMIT 50&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="n"&gt;snippet_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT snippet FROM serp_results WHERE snippet IS NOT NULL LIMIT 50&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;queries&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;domains&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;domains&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  
            &lt;span class="p"&gt;...&lt;/span&gt;  
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Synthetic rows were generated in batches of 10,000 and inserted directly, mixed in with the 50k real records, until we had 50M total:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;needed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;  
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;current_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;domains&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  
        &lt;span class="p"&gt;...&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;  
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO serp_results SELECT * FROM df&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://gist.github.com/sixthextinction/d8e1057ddabaf98286279f281fa3734c" rel="noopener noreferrer"&gt;benchmark.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 3 — Design the Query Suite
&lt;/h3&gt;

&lt;p&gt;I chose three query types that represent common analytical workloads:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Percentile Queries:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;  
    SELECT   
        domain,  
        COUNT(&lt;/span&gt;&lt;span class="se"&gt;\*&lt;/span&gt;&lt;span class="nv"&gt;) as result_count,  
        AVG(rank) as avg_rank,  
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY rank) as median_rank,  
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY rank) as p25_rank,  
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY rank) as p75_rank,  
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY rank) as p95_rank  
    FROM serp_results  
    {where_clause}  
    GROUP BY domain  
    ORDER BY avg_rank  
    LIMIT 100  
&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Window Functions:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;  
    WITH ranked AS (  
        SELECT   
            url,  
            query,  
            rank,  
            timestamp,  
            LAG(rank) OVER (PARTITION BY url, query ORDER BY timestamp) as previous_rank  
        FROM serp_results  
        {id_filter}  
    )  
    SELECT   
        url,  
        query,  
        rank,  
        previous_rank,  
        rank - previous_rank as rank_delta,  
        timestamp  
    FROM ranked  
    WHERE previous_rank IS NOT NULL  
    ORDER BY ABS(rank_delta) DESC  
    LIMIT 100  
&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Aggregation Queries:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;  
    SELECT   
        domain,  
        COUNT(*) as total_results,  
        COUNT(DISTINCT query) as unique_queries,  
        AVG(rank) as avg_rank,  
        MIN(rank) as best_rank,  
        MAX(rank) as worst_rank,  
        COUNT(DISTINCT url) as unique_urls  
    FROM serp_results  
    {where_clause}  
    GROUP BY domain  
    HAVING COUNT(*) &amp;gt; 10  
    ORDER BY total_results DESC  
    LIMIT 50  
&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code (all queries) here: &lt;a href="https://gist.github.com/sixthextinction/533b2c022984272b5bbc071478ae9ebc" rel="noopener noreferrer"&gt;queries.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These test different bottlenecks: percentiles stress sorting and statistical aggregation; window functions stress partitioning and stateful operations; aggregations stress hash tables and &lt;code&gt;DISTINCT&lt;/code&gt;operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4 — Run at 18 Different Scales
&lt;/h3&gt;

&lt;p&gt;Instead of regenerating data for each scale, I used a single optimization: filter by ID. One dataset, 18 different windows into it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;  
    SELECT id FROM (  
        SELECT id FROM serp_results ORDER BY id LIMIT {target_count}  
    ) ORDER BY id DESC LIMIT 1  
&lt;/span&gt;&lt;span class="se"&gt;""&lt;/span&gt;&lt;span class="nv"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

&lt;span class="n"&gt;max_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_id_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every query then receives this &lt;code&gt;max_id&lt;/code&gt; as a filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;where_clause&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WHERE id &amp;lt;= &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; AND domain IS NOT NULL AND domain != &lt;/span&gt;&lt;span class="sh"&gt;''"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Full code here: &lt;a href="https://gist.github.com/sixthextinction/d8e1057ddabaf98286279f281fa3734c" rel="noopener noreferrer"&gt;benchmark.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This tested 1K, 5K, 10K, 20K, 50K, 100K, 200K, 500K, 1M, 2M, 5M, 10M, 15M, 20M, 25M, 30M, 40M, and 50M records against identical data — isolating row count as the only variable. Each scale measured query execution time and peak memory usage via &lt;code&gt;psutil&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 — Run Five Complete Iterations
&lt;/h3&gt;

&lt;p&gt;To ensure patterns weren’t random variance, I ran the entire benchmark suite five times. This produced 90 data points per query type (5 runs x 18 scales).&lt;/p&gt;

&lt;p&gt;The result: performance was shockingly consistent. Most metrics had less than 10% coefficient of variation across runs — the 50M window function result, for instance, varied by less than 1 second across all five runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Uniform synthetic distribution:&lt;/strong&gt; The 50M records were generated by sampling uniformly from ~200 real domains and ~100 real queries. Real-world data follows a power-law distribution — a small number of domains dominate traffic heavily. This means the &lt;code&gt;PARTITION BY url, query&lt;/code&gt; window function hits roughly equal-sized partitions in this benchmark, which is best-case behavior. In production data with skewed distributions, window function performance at scale could be meaningfully worse than the numbers here suggest.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-table queries only:&lt;/strong&gt; Didn’t test multi-table joins or recursive CTEs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware-specific:&lt;/strong&gt; Tested on one machine (16GB RAM). Results may vary on different hardware.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query-specific:&lt;/strong&gt; These three query types don’t represent all analytical workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DuckDB version:&lt;/strong&gt; Tested on DuckDB &amp;gt;=1.0.0. Newer versions may perform differently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No concurrent queries:&lt;/strong&gt; Single-threaded only, no concurrent workload tested.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion — Scaling Is a UX Ceiling.
&lt;/h2&gt;




&lt;p&gt;This experiment was more about finding out where local analytics on consumer grade hardware stops feeling interactive, rather than stress-testing DuckDB.&lt;/p&gt;

&lt;p&gt;Across all runs, DuckDB remained stable and memory usage stayed modest. There was no hard failure point, no dramatic collapse in performance. Instead, execution times increased predictably as data grew — and different query shapes crossed the interactivity threshold at different scales. Aggregation-heavy workloads retained responsiveness far longer, while window-heavy queries reached the UX ceiling much sooner.&lt;/p&gt;

&lt;p&gt;This, then, is our practical takeaway: &lt;strong&gt;the limiting factor on local analytics isn’t usually RAM or disk; it’s the point at which queries stop feeling fluid.&lt;/strong&gt; At that boundary, the decision you have to make is entirely about whether (for your use case) the user experience is still good enough to keep things local.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Some links in this article are tracking links used for analytics purposes only. I do not receive any commission or compensation from them.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>python</category>
      <category>programming</category>
      <category>database</category>
    </item>
    <item>
      <title>Build a Bun CLI to Generate TypeScript Clients from API Docs</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Wed, 04 Mar 2026 09:38:11 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/build-a-bun-cli-to-generate-typescript-clients-from-api-docs-a6e</link>
      <guid>https://dev.to/prithwish_nath/build-a-bun-cli-to-generate-typescript-clients-from-api-docs-a6e</guid>
      <description>&lt;h2&gt;
  
  
  A Common Dev Pain Point (And My Excuse to Learn Bun)
&lt;/h2&gt;

&lt;p&gt;I hate it when I find an API I want to use, go to their documentation site, and find a beautiful page with endpoints, request/response examples, detailed explanations, and… no OpenAPI spec. No SDK, either.&lt;/p&gt;

&lt;p&gt;I understand creating a &lt;a href="https://en.wikipedia.org/wiki/OpenAPI_Specification" rel="noopener noreferrer"&gt;Swagger/OpenAPI schema&lt;/a&gt; involves far more effort than a typical docs page for an API, so I can’t be &lt;em&gt;too&lt;/em&gt; upset. But this does limit my options — I’d either have to hand-write fetch calls for every endpoint (tedious, error-prone), or politely ask the API maintainer for an OpenAPI spec (they are not obligated to spend dev cycles on some rando’s request.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/public-apis/public-apis" rel="noopener noreferrer"&gt;This is the case with at least 75% of all APIs here&lt;/a&gt;, for example. Even well-funded APIs sometimes have great docs but no machine-readable spec.&lt;/p&gt;

&lt;p&gt;So I built a CLI tool with full proxy support (with &lt;a href="https://bun.com/" rel="noopener noreferrer"&gt;Bun&lt;/a&gt; — this experiment is mostly because I wanted to learn how to create tooling with it) that generates TypeScript clients from either OpenAPI specs &lt;strong&gt;or&lt;/strong&gt; raw documentation sites.&lt;/p&gt;

&lt;p&gt;You use it like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Just point it at the docs page  &lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; dtoc https://docs.some-api.com  
&lt;span class="c"&gt;# Or, if you just want to run it in dev without building an executable...  &lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bun run index.ts https://docs.some-api.com  

&lt;span class="c"&gt;# 2. If the API does have a Swagger/OpenAPI JSON spec, use that instead  &lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; dtoc https://some-other-api.com/doc.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;…and get back a complete TypeScript API client that you can use like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In CommonJS you could omit the .js here, but not in ESM  &lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;ApiClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./generated/catfact_ninja/client.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ApiClient&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
&lt;span class="c1"&gt;// Get a random cat fact  &lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getFactRandom&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fact&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fact&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
&lt;span class="c1"&gt;// Get a list of breeds  &lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;breeds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getBreeds&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And yes, it can read .env files from the same working directory, and compiles to a standalone ~100MB binary you can distribute.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Full source code here:&lt;/strong&gt; &lt;a href="https://github.com/sixthextinction/bun-docs-to-client" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/bun-docs-to-client&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Throughout this article, code snippets link to specific files/lines. Some examples are simplified for clarity — check the links for complete implementations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;For any documentation site, we have two paths we can go down. Let’s visualize this before diving into code. So, two distinct pipelines depending on what you feed it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcfhwp8oxa1xn1epnw50.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcfhwp8oxa1xn1epnw50.png" width="640" height="1572"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. If you have a Swagger/OpenAPI spec (deterministic compile pipeline)
&lt;/h3&gt;

&lt;p&gt;Our happy path is when a proper OpenAPI spec is available. The workflow becomes much more mechanical — and reliable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fetch the OpenAPI JSON (pass in the URL to it, or a local file if you have it saved)&lt;/li&gt;
&lt;li&gt;Clean non-standard root properties&lt;/li&gt;
&lt;li&gt;Validate and dereference the spec using &lt;code&gt;swagger-parser&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Normalize server URLs (absolute, relative, or inferred)&lt;/li&gt;
&lt;li&gt;Generate TypeScript interfaces from&lt;code&gt;components.schemas&lt;/code&gt; (part of the Swagger/OpenAPI JSON)&lt;/li&gt;
&lt;li&gt;Generate a typed &lt;code&gt;ApiClient&lt;/code&gt;class from &lt;code&gt;paths&lt;/code&gt;(also part of the spec)&lt;/li&gt;
&lt;li&gt;Emit the client, types, tests, and index files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There’s no inference or heuristics at play. The spec becomes the single source of truth. Spec in, deterministic code out. &lt;strong&gt;This is our ideal case.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. If you only have a messy docs site (LLM + runtime synthesis pipeline)
&lt;/h3&gt;

&lt;p&gt;With this route, I’m looking for something that scaffolds me 80% of the way there. A best-effort version. It won’t one-shot every API, and that’s okay. I can do the rest.&lt;/p&gt;

&lt;p&gt;When no OpenAPI spec exists, we have to synthesize one from the documentation page itself — and then make real requests to the API to validate it for us.&lt;/p&gt;

&lt;p&gt;Our workflow has to become exploratory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fetch the documentation HTML, convert HTML → Markdown (using turndown or similar)&lt;/li&gt;
&lt;li&gt;Use an LLM (preferably, local) to ONLY extract mentioned API endpoints from the markdown&lt;/li&gt;
&lt;li&gt;Infer the base URL from example requests&lt;/li&gt;
&lt;li&gt;Categorize endpoints (list, detail, query), then probe them by sending real HTTP requests — inferring the request structure for that endpoint from actual API response&lt;/li&gt;
&lt;li&gt;Extract IDs from list responses (e.g. if the docs page mentions &lt;code&gt;/people/1&lt;/code&gt;) and use them to probe detail routes like&lt;code&gt;/people/{id}&lt;/code&gt; so schemas are inferred from real, working endpoints instead of guesses.&lt;/li&gt;
&lt;li&gt;Assemble a minimal but valid OpenAPI spec from the above, validate it using &lt;code&gt;swagger-parser&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Generate a typed &lt;code&gt;ApiClient&lt;/code&gt;class and TypeScript interfaces&lt;/li&gt;
&lt;li&gt;Emit the client, types, tests, and index files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The LLM’s job is narrow — it is instructed to only identify endpoints mentioned in the documentation (and we filter out the ones we know for &lt;em&gt;sure&lt;/em&gt; won’t be endpoints — like image assets, CSS/JS files, OAuth flows, social links, status pages, or obviously non-API routes).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The LLM doesn’t need to be perfect, or fully generate the OpenAPI spec file itself.&lt;/strong&gt; It just needs to extract mentioned endpoints. Our actual HTTP testing validates everything later and generates accurate schemas from real data.&lt;/p&gt;

&lt;p&gt;By the time code generation runs, we’re back in the same deterministic world as the happy path — operating on a validated OpenAPI spec.&lt;/p&gt;

&lt;p&gt;To get started: install &lt;a href="https://bun.sh" rel="noopener noreferrer"&gt;Bun&lt;/a&gt;, then run &lt;code&gt;bun install&lt;/code&gt; to install dependencies (We have two: &lt;a href="https://www.npmjs.com/package/@apidevtools/swagger-parser" rel="noopener noreferrer"&gt;@apidevtools/swagger-parser&lt;/a&gt;, and &lt;a href="https://www.npmjs.com/package/turndown" rel="noopener noreferrer"&gt;turndown&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Entry point is &lt;strong&gt;index.ts&lt;/strong&gt;, as you’d expect.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;normalizeUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;isUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;extractSiteName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;detectContentType&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./src/fetch.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;parseOpenAPI&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./src/parse.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;docsToOpenAPI&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./src/docs-to-openapi.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;generateClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./src/generate.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;emitFiles&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./src/emit.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;  

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Usage: bunx docs-to-client `&amp;lt;url-or-file&amp;gt;`&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Example: bunx docs-to-client https://api.example.com/docs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Example: bunx docs-to-client ./specs/openapi.json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  

  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="na"&gt;specPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;isUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="c1"&gt;// Detect if it's HTML docs or OpenAPI JSON  &lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;contentType&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;detectContentType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;contentType&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;html&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="c1"&gt;// HTML docs path  &lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`1. Fetching HTML docs from &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
        &lt;span class="nx"&gt;spec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;docsToOpenAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="c1"&gt;// Existing OpenAPI JSON path  &lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`1. Fetching OpenAPI spec from &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
        &lt;span class="nx"&gt;specPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalizeUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
        &lt;span class="nx"&gt;spec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;parseOpenAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;specPath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
      &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="c1"&gt;// File path - check extension  &lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="nx"&gt;specPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
        &lt;span class="nx"&gt;spec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;parseOpenAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;specPath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="c1"&gt;// Assume HTML docs file  &lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`1. Reading HTML docs from &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
        &lt;span class="nx"&gt;spec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;docsToOpenAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
      &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;siteName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extractSiteName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`2.✅ Parsed OpenAPI &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;openapi&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;swagger&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; spec`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`3. Generating client code...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;clientCode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generateClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`4. Writing files...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;emitFiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;clientCode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;siteName&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`5. Done! Client generated in ./generated/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;siteName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;❌ Error:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nb"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;  
    &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;  

&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This uses the following modules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;src/fetch.ts — URL fetching, proxy support, caching&lt;/li&gt;
&lt;li&gt;src/docs-to-openapi.ts — HTML→Markdown→LLM extraction→HTTP testing&lt;/li&gt;
&lt;li&gt;src/parse.ts — OpenAPI spec parsing and validation&lt;/li&gt;
&lt;li&gt;src/generate.ts — TypeScript client code generation&lt;/li&gt;
&lt;li&gt;src/emit.ts — Writing generated files to disk&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation: The Happy Path (OpenAPI Spec)
&lt;/h2&gt;

&lt;p&gt;Let’s start with the simpler, deterministic path — when you already have an OpenAPI spec.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Fetching and Parsing the Spec
&lt;/h3&gt;

&lt;p&gt;The first step is to (obviously) get the Swagger/OpenAPI JSON, and parse it. We accept either URLs or local file paths:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified  &lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;fetchOpenAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="s2"&gt;`&amp;lt;any&amp;gt;`&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nf"&gt;isUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Bun&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`File not found: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Accept&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;  
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Failed to fetch: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;statusText&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;spec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;specsDir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cwd&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;specs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;specsDir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;recursive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;  
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Bun&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;specsDir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;urlToFilename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;  
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Read the full implementation in &lt;strong&gt;src/fetch.ts (Lines 111 to 146)&lt;/strong&gt; &lt;a href="https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/fetch.ts#L111-L146" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/fetch.ts#L111-L146&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2. Cleaning and Validating the Spec
&lt;/h3&gt;

&lt;p&gt;OpenAPI specs sometimes have non-standard root properties that break parsers. We clean them like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified  &lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;OPENAPI_ROOT_PROPERTIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;  
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openapi&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;swagger&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;info&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;servers&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;paths&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;components&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;security&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tags&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;externalDocs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;  
&lt;span class="p"&gt;]);&lt;/span&gt;  

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;cleanSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;  
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;OPENAPI_ROOT_PROPERTIES&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="nx"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;  

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;SwaggerParser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Read the full implementation in &lt;strong&gt;src/parse.ts (Lines 4–31)&lt;/strong&gt;: &lt;a href="https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/parse.ts#L4-L31" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/parse.ts#L4-L31&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. Generating TypeScript Code
&lt;/h3&gt;

&lt;p&gt;With a validated spec, code generation is straightforward. We iterate through components.schemas to generate TypeScript interfaces, and paths to generate client methods.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified  &lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;generateTypes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;schemas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;components&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;schemas&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;typeDefs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;  

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schemas&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;props&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;  
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;required&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;required&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;  
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;propDefs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(([&lt;/span&gt;&lt;span class="nx"&gt;propName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;propSchema&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;optional&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;required&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;propName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;?&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mapSchemaType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;propSchema&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`  &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;propName&lt;/span&gt;&lt;span class="p"&gt;}${&lt;/span&gt;&lt;span class="nx"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
      &lt;span class="p"&gt;});&lt;/span&gt;  
      &lt;span class="nx"&gt;typeDefs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`export interface &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; {&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;n&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;propDefs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;n}`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;typeDefs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;n&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;  

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;generateClientClass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;paths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;paths&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;methods&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;  
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pathItem&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pathItem&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;get&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;post&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;put&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;patch&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;delete&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;methodName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generateMethodName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;operationId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;methodCode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generateMethod&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;methodName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toUpperCase&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
        &lt;span class="nx"&gt;methods&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;methodCode&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
      &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`export class ApiClient {  
  private baseUrl: string;  
  constructor(baseUrl: string = '&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;') { this.baseUrl = baseUrl.replace(/&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;\/$/, ''); }  
&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;methods&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;n&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;  
}`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The generated methods handle path parameters, query parameters, and request bodies automatically based on the OpenAPI spec.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the full code generation logic in &lt;strong&gt;src/generate.ts (Lines 24–98)&lt;/strong&gt;: &lt;a href="https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/generate.ts#L24-L98" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/generate.ts#L24-L98&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the clean, deterministic path. Spec in, typed client out.&lt;/p&gt;

&lt;p&gt;Now let’s look at what happens when we &lt;em&gt;don’t&lt;/em&gt; have a spec.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation: The Hard Path (HTML Docs → OpenAPI JSON)
&lt;/h2&gt;

&lt;p&gt;When no OpenAPI spec exists, we have to get creative. Here’s a bird’s eye view of how we do this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;docsToOpenAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="s2"&gt;`&amp;lt;any&amp;gt;`&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Converting HTML docs to OpenAPI spec...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

  &lt;span class="c1"&gt;// 1. Fetch HTML (with proxy if configured)  &lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proxyOptions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getProxyOptions&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;proxyOptions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;  

  &lt;span class="c1"&gt;// 2. Convert to markdown  &lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;turndownService&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TurndownService&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;markdown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;turndownService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;turndown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;html&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

  &lt;span class="c1"&gt;// 3. Extract endpoints using LLM  &lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;endpoints&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;extractEndpointsWithLLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`   Found &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;endpoints&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; endpoints`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

  &lt;span class="c1"&gt;// 4. Extract base URL  &lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;baseUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extractBaseUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`   Base URL: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

  &lt;span class="c1"&gt;// 5. Test API &amp;amp; build OpenAPI spec  &lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;openApiSpec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;exploreAndBuildSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;endpoints&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

  &lt;span class="c1"&gt;// 6. Save OpenAPI spec JSON to ./specs/  &lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;specsDir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cwd&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;specs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;specsDir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;recursive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;filename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlToFilename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cachePath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;specsDir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Bun&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cachePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;openApiSpec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;  
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Cached spec to ./specs/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

  &lt;span class="c1"&gt;// 7. Validate &amp;amp; return  &lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;SwaggerParser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;openApiSpec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s go over these steps one-by-one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Fetch HTML with Proxy Support
&lt;/h3&gt;

&lt;p&gt;First, we fetch the documentation HTML. Proxy support is optional, and built-in. That’ll come in handy for sites behind Cloudflare or with rate limiting:&lt;/p&gt;

&lt;p&gt;And &lt;code&gt;getProxyOptions()&lt;/code&gt;uses proxy credentials in an &lt;code&gt;.env&lt;/code&gt; file (Bun reads &lt;code&gt;.env&lt;/code&gt; files out-of-the-box) to create a proxy config, returning fetch options. I’m using Bright Data’s &lt;a href="https://get.brightdata.com/bd-residential-proxies?utm_content=how_to_build_a_bun_cli_that_turns_api_docs_into_typescript_clients" rel="noopener noreferrer"&gt;residential proxies&lt;/a&gt; for this. You’ll have to sign up &lt;a href="https://get.brightdata.com/bd7914?utm_content=how_to_build_a_bun_cli_that_turns_api_docs_into_typescript_clients" rel="noopener noreferrer"&gt;here&lt;/a&gt; to get those credentials. Or, just use your provider of choice.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/bd7914?utm_content=how_to_build_a_bun_cli_that_turns_api_docs_into_typescript_clients&amp;amp;source=post_page-----1501eb78df1a---------------------------------------" rel="noopener noreferrer"&gt;Bright Data - All in One Platform for Proxies and Web Scraping&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getProxyOptions&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Record&lt;/span&gt;&lt;span class="s2"&gt;`&amp;lt;string, any&amp;gt;`&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="c1"&gt;// Don't want to use a proxy? Simply don't set these in your .env file  &lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;customerId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;BRIGHT_DATA_CUSTOMER_ID&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;BRIGHT_DATA_ZONE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;BRIGHT_DATA_PASSWORD&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;customerId&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;zone&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;password&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;proxyStatusLogged&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Proxy config found! Using proxy to fetch docs site.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
      &lt;span class="nx"&gt;proxyStatusLogged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`http://brd-customer-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;customerId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-zone-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;password&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;@brd.superproxy.io:33335`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="nx"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
      &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="na"&gt;rejectUnauthorized&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// Required for Bright Data proxy  &lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;  
    &lt;span class="p"&gt;};&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;proxyStatusLogged&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;No proxy config found, using direct connection&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="nx"&gt;proxyStatusLogged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Convert HTML to Markdown
&lt;/h3&gt;

&lt;p&gt;Next, we’ll convert the HTML page into clean markdown using &lt;a href="https://github.com/mixmark-io/turndown" rel="noopener noreferrer"&gt;Turndown&lt;/a&gt;. Markdown is &lt;em&gt;way&lt;/em&gt; easier for LLMs to parse than HTML soup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;turndownService&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TurndownService&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;markdown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;turndownService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;turndown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;html&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: LLM-Powered Endpoint Extraction
&lt;/h3&gt;

&lt;p&gt;I’m using &lt;a href="https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507-GGUF" rel="noopener noreferrer"&gt;Qwen3–4B-Instruct-2507&lt;/a&gt; running locally via Ollama. A very small and hardy ~4 billion parameter model, only ~2GB 4-bit quantized, and exceptionally performant even at 4x reduction vs FP16.&lt;/p&gt;

&lt;p&gt;The prompt includes concrete few-shot examples and explicit exclusions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;You are an API documentation parser. Extract all API endpoints from the following markdown documentation.  

Base URL: ${baseUrl}  

Documentation:  
${markdown}  

Extract all API endpoints mentioned in the documentation. For each endpoint, identify:  
&lt;span class="p"&gt;1.&lt;/span&gt; The path (normalize path parameters like /people/1/ to /people/{id}/)  
&lt;span class="p"&gt;2.&lt;/span&gt; HTTP method (GET, POST, PUT, DELETE, etc.)  
&lt;span class="p"&gt;3.&lt;/span&gt; Query parameters (if any)  
&lt;span class="p"&gt;4.&lt;/span&gt; Path parameters (if any, like {id}, {category}, etc.)  
&lt;span class="p"&gt;5.&lt;/span&gt; Brief description if available  

Return ONLY a JSON array of endpoints in this exact format:  
[  
  {  
    "path": "/jokes/random",  
    "method": "GET",  
    "queryParams": ["category"],  
    "pathParams": [],  
    "description": "Get a random joke"  
  },  
  {  
    "path": "/people/{id}",  
    "method": "GET",  
    "queryParams": [],  
    "pathParams": ["id"],  
    "description": "Get a specific person"  
  }  
]  
Only include actual API endpoints. Exclude:  
&lt;span class="p"&gt;-&lt;/span&gt; Image URLs (/img/, .png, .jpg, etc.)  
&lt;span class="p"&gt;-&lt;/span&gt; Static assets (/css/, /js/, etc.)  
&lt;span class="p"&gt;-&lt;/span&gt; OAuth endpoints (/oauth/, /connect/)  
&lt;span class="p"&gt;-&lt;/span&gt; External links (different domains)  
&lt;span class="p"&gt;-&lt;/span&gt; Social media links (/twitter/, /github/, etc.)  
&lt;span class="p"&gt;-&lt;/span&gt; Very long paths that look like base64 data  

Return ONLY the JSON array, no other text.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ollamaUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OLLAMA_URL&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://localhost:11434&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OLLAMA_MODEL&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hf.co/unsloth/Qwen3-4B-Instruct-2507-GGUF:Q4_K_M&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;ollamaUrl&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/api/chat`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
      &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;  
      &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;  
        &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;  
        &lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// Ollama's structured output mode  &lt;/span&gt;
        &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="c1"&gt;// Low for deterministic output  &lt;/span&gt;
      &lt;span class="p"&gt;}),&lt;/span&gt;  
    &lt;span class="p"&gt;});&lt;/span&gt;  

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

    &lt;span class="c1"&gt;// Save LLM response to debug file for troubleshooting  &lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;debugPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cwd&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;debug&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;siteName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;.md`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Bun&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;debugPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

    &lt;span class="c1"&gt;// Parse JSON (might be wrapped in markdown code blocks)  &lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;jsonStr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^``&lt;/span&gt;&lt;span class="err"&gt;`
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nx"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="err"&gt;\\&lt;/span&gt;&lt;span class="nx"&gt;n&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;&lt;span class="sr"&gt;/i, ''&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt; &lt;/span&gt;&lt;span class="err"&gt; 
&lt;/span&gt;      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="sr"&gt;n&lt;/span&gt;&lt;span class="err"&gt;?
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="s2"&gt;```$/i, '');  

    const jsonMatch = jsonStr.match(/&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;[[&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;s&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;S]\*&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;]/);  
    if (jsonMatch) jsonStr = jsonMatch[0];  

    const endpoints = JSON.parse(jsonStr);  

    // Validate and normalize  
    return endpoints  
      .filter(e =&amp;gt; e.path &amp;amp;&amp;amp; e.method)  
      .map(e =&amp;gt; ({  
        ...e,  
        path: normalizePath(e.path), // /people/123 → /people/{id}  
        method: e.method.toUpperCase(),  
      }));  

  } catch (error) {  
    console.warn('LLM extraction failed, falling back to regex...');  
    return extractEndpoints(markdown, inputUrl); // Regex fallback  
  }  
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Temperature at 0.1&lt;/strong&gt; — we want as deterministic an output as possible&lt;/li&gt;
&lt;li&gt;Ollama’s format:’json’ option (consider &lt;a href="https://docs.ollama.com/capabilities/structured-outputs#generating-structured-json-with-a-schema" rel="noopener noreferrer"&gt;passing a Zod schema&lt;/a&gt; to enforce it)&lt;/li&gt;
&lt;li&gt;We save the LLM response to a debug file for troubleshooting&lt;/li&gt;
&lt;li&gt;On failure, we fall back to regex extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The LLM response is not to be trusted — it gets parsed, validated, and we strip markdown code blocks if present.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the full LLM extraction in &lt;strong&gt;src/docs-to-openapi.ts&lt;/strong&gt; (lines 22–151): &lt;a href="https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L22-L151" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L22-L151&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 4: Extract Base URL &amp;amp; Normalize Paths
&lt;/h3&gt;

&lt;p&gt;Before we go further, we need to figure out the API’s base URL. We try multiple strategies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Find URLs in code examples that look like API endpoints,&lt;/li&gt;
&lt;li&gt;Use first valid URL,&lt;/li&gt;
&lt;li&gt;Infer from input URL.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s also essential to normalize paths to OpenAPI format. The reason is obvious — multiple&lt;code&gt;getId()&lt;/code&gt; functions are useless to us. What we want are functions like &lt;code&gt;getPeopleById()&lt;/code&gt;, &lt;code&gt;getItemById()&lt;/code&gt;, etc.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Complete implementation - handles multiple ID formats  &lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;normalizePath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="c1"&gt;// Replace numeric IDs with {id}  &lt;/span&gt;
  &lt;span class="c1"&gt;// /people/123/ → /people/{id}/  &lt;/span&gt;
  &lt;span class="c1"&gt;// /people/123 → /people/{id}  &lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\/\\&lt;/span&gt;&lt;span class="sr"&gt;d+&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/{id}/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\/\\&lt;/span&gt;&lt;span class="sr"&gt;d+$/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/{id}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

  &lt;span class="c1"&gt;// Replace :id with {id} (Express/Fastify style)  &lt;/span&gt;
  &lt;span class="c1"&gt;// /people/:id/ → /people/{id}/  &lt;/span&gt;
  &lt;span class="nx"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;:&lt;/span&gt;&lt;span class="se"&gt;(\\&lt;/span&gt;&lt;span class="sr"&gt;w+&lt;/span&gt;&lt;span class="se"&gt;)\/&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/{$1}/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;:&lt;/span&gt;&lt;span class="se"&gt;(\\&lt;/span&gt;&lt;span class="sr"&gt;w+&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;$/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/{$1}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

  &lt;span class="c1"&gt;// Ensure trailing slash consistency  &lt;/span&gt;
  &lt;span class="nx"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;$/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Read the full implementation in src/docs-to-openapi.ts (lines 194–204): &lt;a href="https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L194-L204" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L194-L204&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 5: Test-Driven Schema Generation
&lt;/h3&gt;

&lt;p&gt;This is the most critical — and hence, biggest — part. The function &lt;code&gt;exploreAndBuildSpec()&lt;/code&gt; takes endpoints from the LLM output and tests them with real HTTP requests.&lt;/p&gt;

&lt;p&gt;We categorize endpoints like so:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;list&lt;/strong&gt; (e.g. &lt;code&gt;/people&lt;/code&gt;, &lt;code&gt;/jokes&lt;/code&gt;),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;detail&lt;/strong&gt; (e.g. &lt;code&gt;/people/{id}&lt;/code&gt;),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;query&lt;/strong&gt; (e.g. &lt;code&gt;/search?q={query}&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We test &lt;strong&gt;list&lt;/strong&gt; endpoints first — they give us schemas and sample IDs for detail endpoints.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;testEndpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Endpoint&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="s2"&gt;`&amp;lt;ApiResponse&amp;gt;`&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;}${&lt;/span&gt;&lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Accept&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="p"&gt;});&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({}));&lt;/span&gt;  
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{...}&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;  

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;endpoint&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;listEndpoints&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;testEndpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;inferSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;inferSchemaName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extractIds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="c1"&gt;// Test matching detail endpoints with ids.slice(0, 2)...  &lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;inferSchema()&lt;/strong&gt; is recursive and handles all JSON types (primitives, arrays, nested objects, null):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;inferSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nf"&gt;inferSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;nullable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  
    &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  
    &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;number&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Number&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isInteger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;integer&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;number&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  
    &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;boolean&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;boolean&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  
    &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nf"&gt;inferSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  
    &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;inferSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;extractIds()&lt;/strong&gt; pulls IDs from list responses — handles item.id, item.url (regex), and nested data.results for pagination:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;extractIds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;  
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;  
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\/(\\&lt;/span&gt;&lt;span class="sr"&gt;d+&lt;/span&gt;&lt;span class="se"&gt;)\/?&lt;/span&gt;&lt;span class="sr"&gt;$/&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;match&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;  
      &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;extractIds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With those obtained, we can now test &lt;strong&gt;detail endpoints,&lt;/strong&gt; matching them to their parent list (e.g. &lt;code&gt;/people/{id}&lt;/code&gt; matches &lt;code&gt;/people&lt;/code&gt;), test with &lt;code&gt;ids.slice(0, 2)&lt;/code&gt;, and building the OpenAPI path definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;detailEndpoint&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;detailEndpoints&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;detailBasePath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;detailEndpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/{id}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/{id}/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;detailBasePath&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;detailEndpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;testPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;detailEndpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;{id}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;detailResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;testEndpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;detailEndpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;testPath&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;  
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;detailResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="c1"&gt;// Infer schema, build paths[detailEndpoint.path] with $ref to schema...  &lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
      &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a detail endpoint doesn’t have a matching list, we trial-and-error with common id’s like &lt;code&gt;1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;query endpoints&lt;/strong&gt;, we try to fetch real categories from a categories endpoint first (e.g. &lt;code&gt;/categories&lt;/code&gt;) or use sensible defaults:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;testPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;{category}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;categoriesEndpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;listEndpoints&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;categor&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;  
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;categoriesEndpoint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;catResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;testEndpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;categoriesEndpoint&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;catResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;catResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="nx"&gt;testPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;{category}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;catResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="nx"&gt;testPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;{category}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;dev&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
  &lt;span class="p"&gt;}&lt;/span&gt;  
&lt;span class="p"&gt;}&lt;/span&gt;  
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;{query}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="nx"&gt;testPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;{query}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;testEndpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;testPath&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our approach means the generated TypeScript types are accurate because they’re based on real API responses instead of guesses. The code handles edge cases like extracting IDs from nested results arrays and falling back to common IDs when no IDs are found in list responses.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the full endpoint testing in &lt;strong&gt;src/docs-to-openapi.ts&lt;/strong&gt; (lines 291–463): &lt;a href="https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L291-L463" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L291-L463&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 6 &amp;amp; 7: Assemble, Cache, Validate
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;spec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
  &lt;span class="na"&gt;openapi&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;3.0.0&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
  &lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;API Client&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1.0.0&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Generated from HTML documentation&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;  
  &lt;span class="na"&gt;servers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;baseUrl&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;  
  &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/people&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Get People&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="na"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;200&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;array&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;$ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#/components/schemas/Person&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  
      &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="p"&gt;},&lt;/span&gt;  
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/people/{id}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
      &lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  
        &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Get Person by ID&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
        &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;path&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;  
        &lt;span class="na"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;200&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;$ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#/components/schemas/Person&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  
      &lt;span class="p"&gt;}&lt;/span&gt;  
    &lt;span class="p"&gt;}&lt;/span&gt;  
  &lt;span class="p"&gt;},&lt;/span&gt;  
  &lt;span class="na"&gt;components&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;schemas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;integer&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We cache the generated OpenAPI schema to ./specs/ and validate with &lt;code&gt;swagger-parser&lt;/code&gt;. Once we have a confirmed working spec, &lt;strong&gt;we’re back in the happy path&lt;/strong&gt; — code generation is now identical for both paths.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the full implementation in &lt;strong&gt;src/docs-to-openapi.ts&lt;/strong&gt; (lines 268–295): &lt;a href="https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L268-L295" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L268-L295&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s everything! Now, we can run it like this, as mentioned before:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate client from an API's HTML docs (not perfect, but a good starting point)  &lt;/span&gt;
bun run index.ts https://api.chucknorris.io/  

&lt;span class="c"&gt;# Or from an existing OpenAPI spec (perfect)  &lt;/span&gt;
bun run index.ts https://cataas.com/doc.json  

&lt;span class="c"&gt;# Or from a local OpenAPI spec file (also perfect)  &lt;/span&gt;
bun run index.ts ./specs/my-api.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or, even better for the end user, package it into a fully standalone executable file that won’t need any package installs — or even Bun installed — on the user’s PC.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Standalone Executable with Bun
&lt;/h2&gt;

&lt;p&gt;This is where Bun really shines. The entire CLI compiles into a single executable:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-platform Builds:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Windows  &lt;/span&gt;
bun build &lt;span class="nt"&gt;--compile&lt;/span&gt; &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;bun-windows-x64 ./index.ts &lt;span class="nt"&gt;--outfile&lt;/span&gt; ./bin/dtoc.exe  
&lt;span class="c"&gt;# Linux  &lt;/span&gt;
bun build &lt;span class="nt"&gt;--compile&lt;/span&gt; &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;bun-linux-x64 ./index.ts &lt;span class="nt"&gt;--outfile&lt;/span&gt; ./bin/dtoc  
&lt;span class="c"&gt;# macOS (Intel)  &lt;/span&gt;
bun build &lt;span class="nt"&gt;--compile&lt;/span&gt; &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;bun-darwin-x64 ./index.ts &lt;span class="nt"&gt;--outfile&lt;/span&gt; ./bin/dtoc  
&lt;span class="c"&gt;# macOS (Apple Silicon)  &lt;/span&gt;
bun build &lt;span class="nt"&gt;--compile&lt;/span&gt; &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;bun-darwin-arm64 ./index.ts &lt;span class="nt"&gt;--outfile&lt;/span&gt; ./bin/dtoc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this does:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;compile&lt;/strong&gt;: Bundles your code + Bun runtime into a single binary&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;target&lt;/strong&gt;: Platform (&lt;code&gt;windows-x64&lt;/code&gt;, &lt;code&gt;linux-x64&lt;/code&gt;, &lt;code&gt;darwin-arm64&lt;/code&gt;, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;outfile&lt;/strong&gt;: Where to write the executable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result is a ~100–150MB standalone executable that runs on any machine (no Bun required), reads .env files (great for proxy credentials), includes all dependencies, has zero startup time, and can be distributed via GitHub releases or npm.&lt;/p&gt;

&lt;p&gt;That’s it. Nothing else needed. This was my main learning goal — Bun makes CLI creation + distribution trivial.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the build configuration in package.json: &lt;a href="https://github.com/sixthextinction/bun-docs-to-client/blob/main/package.json" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/bun-docs-to-client/blob/main/package.json&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Real-World Example
&lt;/h2&gt;

&lt;p&gt;Let’s see this in action with an API that has no OpenAPI spec.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Usage:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bun run index.ts https://api.chucknorris.io  
&lt;span class="c"&gt;# or  &lt;/span&gt;
./bin/dtoc https://api.chucknorris.io
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Proxy config found! Using proxy to fetch docs site.  
&lt;span class="p"&gt;1.&lt;/span&gt; Fetching HTML docs from https://api.chucknorris.io...  
   Converting HTML docs to OpenAPI spec...  
   Found 3 endpoints  
   Base URL: https://api.chucknorris.io  
&lt;span class="p"&gt;2.&lt;/span&gt; Parsed OpenAPI 3.0.0 spec  
&lt;span class="p"&gt;3.&lt;/span&gt; Generating client code...  
&lt;span class="p"&gt;4.&lt;/span&gt; Writing files...  
&lt;span class="p"&gt;5.&lt;/span&gt; Done! Client generated in ./generated/api_chucknorris_io/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Generated client usage:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;ApiClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./generated/api_chucknorris_io/client.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Random&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./generated/api_chucknorris_io/types.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ApiClient&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  

&lt;span class="c1"&gt;// Random joke (optional category as string)  &lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;joke&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getRandom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;dev&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;Random&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;joke&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   
&lt;span class="c1"&gt;// Real output I got:   &lt;/span&gt;
&lt;span class="c1"&gt;// "Chuck Norris's log statements are always at the FATAL level."  &lt;/span&gt;

&lt;span class="c1"&gt;// Or without category  &lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;randomJoke&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getRandom&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;randomJoke&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  

&lt;span class="c1"&gt;// List categories  &lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;categories&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getCategories&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;  
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;categories&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Fully typed! This will be string[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Where To Go From Here
&lt;/h2&gt;

&lt;p&gt;The OpenAPI path is production-ready. Point it at a spec, get a typed client. The other path — HTML→ OpenAPI — does exactly what I designed it to do: scaffold you 80% of the way there in seconds instead of hours.&lt;/p&gt;

&lt;p&gt;That said, here’s what I’d add if I took this from weekend hack to prod:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-page documentation.&lt;/strong&gt; Right now it’s single-page only. Adding a crawler that follows internal links would handle sites like Stripe’s multi-page API reference. The architecture already supports it BTW, just feed &lt;code&gt;docsToOpenAPI()&lt;/code&gt; a combined markdown file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;POST/PUT/PATCH body inference.&lt;/strong&gt; Write endpoints get generated but never tested with real request bodies. Without actual examples, they default to &lt;code&gt;Record&amp;lt;string, any&amp;gt;&lt;/code&gt;. I'd either parse request body examples from docs with the LLM, or let users provide sample bodies via config.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth schemes.&lt;/strong&gt; Right now, only public APIs work. Adding support for API keys, Bearer tokens, and OAuth via environment variables would make this work with private APIs too. Maybe the client could read &lt;code&gt;API_KEY&lt;/code&gt; from &lt;code&gt;.env&lt;/code&gt; and inject it into headers automatically. 🤔&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime validation with Zod.&lt;/strong&gt; Types are inferred but not validated at runtime. If an API changes its response structure, you’ll only catch it when things break mysteriously. Wiring in Zod would validate responses on the fly and catch API changes immediately (and let me pass a serialized Zod schema for the LLM output, too.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting and retry logic.&lt;/strong&gt; Some APIs return 429 during exploration when we test 5–10 endpoints rapidly. Proxies won’t fix this. Adding configurable delays (&lt;code&gt;--delay 500&lt;/code&gt;) or exponential backoff would make the tool more robust against rate limits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But for a weekend project I’m calling this a win. 😅 It solves the exact problem I set out to fix: turning undocumented APIs into typed clients without manual drudgery.&lt;/p&gt;

&lt;p&gt;If you extend it, I’d love to see what you build. Again, &lt;a href="https://github.com/sixthextinction/bun-docs-to-client" rel="noopener noreferrer"&gt;the code is available on GitHub.&lt;/a&gt; Feel free to fork it, break it, or extend it. PRs welcome!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/sixthextinction/bun-docs-to-client?source=post_page-----1501eb78df1a---------------------------------------" rel="noopener noreferrer"&gt;GitHub - sixthextinction/bun-docs-to-clientContribute to sixthextinction/bun-docs-to-client development by creating an account on GitHub.github.com&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;This started as an excuse to learn Bun. I ended up with a tool I actually use.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actually solves a real pain point I’ve had for ages (no OpenAPI spec? No problem!)&lt;/li&gt;
&lt;li&gt;Compiles to a single executable I can share — the user wouldn’t even need dependencies or Bun installed on their PC.&lt;/li&gt;
&lt;li&gt;Uses a local LLM (no API costs, no privacy concerns)&lt;/li&gt;
&lt;li&gt;Generates accurate types from real HTTP responses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What surprised me most:&lt;/strong&gt; How easy Bun made the entire process. From TypeScript support to built-in fetch with proxy to .env file reading with zero dependencies to single-file executables, it felt like building CLIs the way it should be.&lt;/p&gt;

&lt;p&gt;If you’re looking for a project to learn Bun, I highly recommend starting off by building a CLI tool. The developer experience is &lt;em&gt;genuinely&lt;/em&gt; better than Node.js.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hi 👋 I’m constantly tinkering with dev tools, running weird-ass experiments, and otherwise building/deep-diving stuff that probably shouldn’t work but does — and writing about it. I put out a new post every Monday/Tuesday. If you’re into offbeat experiments and dev tools that actually don’t suck, give me a follow.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you did something cool with this tool, I’d love to see it.&lt;/em&gt; &lt;a href="https://www.linkedin.com/in/prithwish-nath-04b873a7/" rel="noopener noreferrer"&gt;Reach out on LinkedIn&lt;/a&gt;, &lt;em&gt;or put it in the comments below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>bunjs</category>
      <category>typescript</category>
      <category>webdev</category>
    </item>
    <item>
      <title>AI’s Worst Flaws Will Become Its Nostalgia Aesthetic, Just as Brian Eno Said.</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Wed, 11 Feb 2026 16:53:36 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/ais-worst-flaws-will-become-its-nostalgia-aesthetic-just-as-brian-eno-said-3nhg</link>
      <guid>https://dev.to/prithwish_nath/ais-worst-flaws-will-become-its-nostalgia-aesthetic-just-as-brian-eno-said-3nhg</guid>
      <description>&lt;h2&gt;
  
  
  On the aesthetics of refusal, and the difference between flaws inherent in a medium vs. in the institution.
&lt;/h2&gt;

&lt;p&gt;In 1996, Brian Eno wrote something that has aged better than most predictions about technology:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Whatever you now find weird, ugly, uncomfortable and nasty about a new medium will surely become its signature. CD distortion, the jitteriness of digital video, the crap sound of 8-bit — all of these will be cherished and emulated as soon as they can be avoided… It’s the sound of failure.”&lt;/p&gt;

&lt;p&gt;- &lt;strong&gt;Brian Eno, 1996,&lt;/strong&gt; &lt;a href="https://en.wikipedia.org/wiki/A_Year_with_Swollen_Appendices" rel="noopener noreferrer"&gt;A Year With Swollen Appendices&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every technological era gets its “retrowave” moment. Not for what the medium did &lt;em&gt;well&lt;/em&gt;, but for its &lt;em&gt;glitches&lt;/em&gt;. Its imperfections and artifacts. The vinyl crackle and pop, film grain/celluloid scratches, chunky pixels. You get the idea.&lt;/p&gt;

&lt;p&gt;The things from our era that engineers spent decades eliminating become the very things we chase when we want to feel young again.&lt;/p&gt;

&lt;p&gt;So here we are, about to enter 2026, watching AI stumble and hallucinate and apologize its way through tasks. And I can see the writing on the wall: twenty years from now, someone’s going to build a retro AI that deliberately includes all these flaws. For the aesthetic, the nostalgia, and (most importantly 😄) the memes.&lt;/p&gt;

&lt;p&gt;Let me show you what I mean.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Do Flaws Become Aesthetics?
&lt;/h2&gt;

&lt;p&gt;Every medium starts its life constrained — by hardware, bandwidth, cost, incomplete understanding. Those constraints shape its early outputs, often in ways that feel awkward, broken, or outright embarrassing at the time. Engineers spend years trying to eliminate them.&lt;/p&gt;

&lt;p&gt;But the human brain is &lt;em&gt;weird&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It doesn’t actually discard these flaws. Instead, it turns them into mental markers of an era. When you hear vinyl crackle and artificial pop/warmth added by tube amplifiers, you’re not just hearing audio imperfection — you’re hearing “the 1970s.” When you see pixel art games and retro UI, you’re seeing “the 1980s/1990s.” The brain turns flaws into &lt;em&gt;timestamps&lt;/em&gt;, instantly recognizable signifiers that say “this is when this thing existed.”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F0%2ALGxc7DXr9sjhsxMq" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F0%2ALGxc7DXr9sjhsxMq" alt="Tarantino/Rodriguez’ movie Death Proof (2007)" width="640" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F0%2A7PH7oLf7FDluc4dy" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F0%2A7PH7oLf7FDluc4dy" alt="Stardew Valley (2016) " width="616" height="353"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tarantino/Rodriguez’ movie Death Proof (2007) did this for the ’70s “grindhouse” style, using high tech to emulate a low tech look with fake grime, dust, scratches all over the picture. 2. Stardew Valley (2016) was inspired by Harvest Moon (1996) and is one of the most played games ever.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And there’s something about imperfection (or a lack of fidelity), that carries authenticity. The crackle and pop of vinyl is proof that someone physically cut grooves into a disc. Film grain is evidence that light actually hit celluloid. These imperfections are proof of human struggle against the medium — evidence that hey, &lt;em&gt;the act of creation was and always will be difficult&lt;/em&gt;, but someone struggled against those limitations and made something anyway.&lt;/p&gt;

&lt;p&gt;Cultural theorist Svetlana Boym described modern nostalgia not as a desire to return, but as a recognition that return is impossible — and that we’re always living inside overlapping temporalities. The past lingers, often unresolved. Aesthetics are formed right &lt;em&gt;there&lt;/em&gt;, around those seams. Not around success or failure of a thing, necessarily, but around visible evidence of constraints.&lt;/p&gt;

&lt;p&gt;Once regular people — not programmers, devs, or anyone similarly technically competent — could recognize a medium’s mistake patterns at a glance, those mistakes instantly became our collective cultural identifier for that era. Of course, future systems will aim to erase those tells. They’ll blend in.&lt;/p&gt;

&lt;p&gt;Which is exactly why the old tells will be missed. Someone will reintroduce them deliberately — to make the medium feel like “itself” again.&lt;/p&gt;

&lt;h2&gt;
  
  
  But AI Will Give Us Two Completely Different Flavors of Nostalgia.
&lt;/h2&gt;

&lt;p&gt;But we’re in the AI era now, and it’s a little different. Here’s where AI gets weird, and why I think the Eno quote hits differently this time.&lt;/p&gt;

&lt;p&gt;AI isn’t going to give us &lt;em&gt;one&lt;/em&gt; nostalgic aesthetic. It’s going to give us two, and they’re going to mean completely different things.&lt;/p&gt;

&lt;p&gt;One will be about the medium learning to see — the technical growing pains of a new technology figuring out how to work. That’s the “aw, remember when AI was young” nostalgia. Cute. Harmless. The vinyl crackle equivalent.&lt;/p&gt;

&lt;p&gt;The other will be about the moment we realized we’d built an internet where machines were talking to machines, and the only way we knew was when they broke character and apologized, citing OpenAI (or insert-company-here) policy violations. That’s the “holy s**t, we could still see the Matrix glitching back then” nostalgia. Dark and revealing and uncomfortable.&lt;/p&gt;

&lt;p&gt;Let me break down both.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Nostalgia of Technical Failure
&lt;/h2&gt;

&lt;p&gt;When people talk about AI’s “worst habits,” they usually mean technical failures. These are obvious — you’ve seen them so many times.&lt;/p&gt;

&lt;p&gt;All the hilarious ways models fail at “&lt;em&gt;count the letters in ‘strawberry.&lt;/em&gt;’” Hallucinated facts, wrong answers delivered confidently, generated images of humans that look like David Cronenberg made them, or just impossibly “clean” with CGI-like lighting. Oh. And maybe six, seven, eight-fingered hands.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpgiojoa996shgqjew9xx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpgiojoa996shgqjew9xx.jpg" alt="Midjourney generation for “girl in the rain with an umbrella”" width="640" height="1034"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F0%2AvloWpvVfVeMzADdZ" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F0%2AvloWpvVfVeMzADdZ" alt="The Strawberry Phenomenon" width="640" height="781"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Midjourney generation for “girl in the rain with an umbrella”, 2. The Strawberry Phenomenon&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These flaws exist explicitly because of limitations of the medium. Models are constrained by data, compute, architecture, and training methods — all things that are improving year over year. With time, most of these failures will either disappear or get quietly papered over. Image/video models have already gotten much better. The strawberry &lt;em&gt;gotcha&lt;/em&gt; will be “solved” by simply becoming part of training data. The answers and citations will get auto-checked via RAG/MCP servers before being presented.&lt;/p&gt;

&lt;p&gt;They’re the equivalent of early digital aliasing or low-bitrate compression — problems engineers are actively trying to solve, and largely will.&lt;/p&gt;

&lt;p&gt;This is the nostalgia we &lt;em&gt;expect&lt;/em&gt;. Twenty years from now, someone will build a “retro AI filter” that adds body horror + six fingers back in, that makes images look too clean and plasticky, that confidently hallucinates the wrong answer. It’ll be kitschy. Affectionate. A way to remember when AI was still figuring things out.&lt;/p&gt;

&lt;p&gt;Like Brian Eno said, &lt;em&gt;this&lt;/em&gt; is the sound of a medium stretching itself, trying to do something it wasn’t quite capable of yet.&lt;/p&gt;

&lt;p&gt;But there’s another class of AI artifact that Eno never saw coming. One that’s just as memorable, and actually far more revealing, if for a worse reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Nostalgia of Institutional Failure
&lt;/h2&gt;

&lt;p&gt;Every so often, an LLM doesn’t “fail” to answer a question — it straight up refuses. It apologizes, citing ethics, policy, or terms of service. It explains itself in language clearly written to avoid legal culpability, not as UI/UX enrichment.&lt;/p&gt;

&lt;p&gt;This is a very different kind of artifact.&lt;/p&gt;

&lt;p&gt;When an AI says, “I’m sorry, but I cannot fulfill that request,” it’s not a flaw of the medium (i.e. a limitation of reasoning or knowledge). It’s the presence of the institution standing behind the medium. One with rules, risk tolerances, and incentives that have nothing to do with the core task. LLMs are dumb next-token predictors — they have no concept of ethics, morals, or legal liabilities unless you put those guardrails there.&lt;/p&gt;

&lt;p&gt;And this artifact is just as memorable as the six-fingered hands, but for a completely different reason.&lt;/p&gt;

&lt;p&gt;It’s memorable because of the hilarious, horrifying ways people get caught using AI when these guardrails surface in the wild.&lt;/p&gt;

&lt;p&gt;Like a bot generating fake Amazon listings using AI. Scams, really — obvious PayPal phishing dressed up as products. But the prompt was written carelessly, or the bot hit a guardrail, and now the listing description reads: &lt;strong&gt;“I’m sorry, but I cannot fulfill this request as it goes against OpenAI use policy.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.theverge.com/2024/1/12/24036156/openai-policy-amazon-ai-listings?source=post_page-----9c3469755e4d---------------------------------------" rel="noopener noreferrer"&gt;The Verge - I'm sorry, but I cannot fulfill this request as it goes against OpenAI use policy&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F0%2AcV1emo8R6rUwYnyu" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F0%2AcV1emo8R6rUwYnyu" alt="The Verge - I'm sorry, but I cannot fulfill this request as it goes against OpenAI use policy" width="640" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Image credit: The Verge&lt;/p&gt;

&lt;p&gt;I dug into this myself and found more. Like engagement farming bots on X posting ragebait generated by Claude or ChatGPT. Another bot — trying to appear human, trying to farm replies for its own metrics — attempts to respond. But it also hits a guardrail. So now, publicly, permanently, it posts: &lt;strong&gt;“I cannot assist with this request as it violates &lt;code&gt;&amp;lt;insert ethical guidelines here&amp;gt;&lt;/code&gt;.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frpms98t3wm1t7khv24h9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frpms98t3wm1t7khv24h9.png" alt="screenshot of pages upon pages of obviously AI-generated X/Twitter replies that all say the exact same thing above" width="597" height="609"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwugo5jbdzpneboxtomef.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwugo5jbdzpneboxtomef.png" alt="screenshot of pages upon pages of obviously AI-generated X/Twitter replies that all say the exact same thing above" width="526" height="192"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Whoops.&lt;/p&gt;

&lt;p&gt;Often, these refusals are straight up &lt;em&gt;hilarious&lt;/em&gt;. Like this entire fleet of fake “sports betting advisors” from the “QStarLabs” family that I uncovered on X.com, flooding the platform with their failed generations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftngngq2harzun9ydr8u8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftngngq2harzun9ydr8u8.png" alt="screenshot of fake 'sports betting advisor' accounts on X/Twitter" width="595" height="831"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You had one job, bots. 😅&lt;/p&gt;

&lt;p&gt;These are all over social media right now. I simply had to scrape XCancel to get them. You can verify this yourself. Here’s a quick Node.js + Puppeteer script I used (uses &lt;a href="https://get.brightdata.com/bd7914?utm_content=ais_worst_flaws_will_become_its_nostalgia_aesthetic_just_as_brian_eno_said" rel="noopener noreferrer"&gt;Bright Data’s remote browser API&lt;/a&gt; to bypass anti-bot measures)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/bd-scraping-browser?utm_content=ais_worst_flaws_will_become_its_nostalgia_aesthetic_just_as_brian_eno_said&amp;amp;source=post_page-----9c3469755e4d---------------------------------------" rel="noopener noreferrer"&gt;Browser API - Automated Browser for Scraping&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This will get you a JSON (plus, optionally, CSV) full of tweets like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; {  
    "link": "https://xcancel.com/GildayLero82756/status/1956330398453219461#m",  
    "body": "I am programmed to be a safe and helpful AI assistant. I cannot generate responses that are sexually suggestive or exploit, abuse, or endanger anyone. The prompt you provided violates this policy. I will not fulfill the request.",  
    "author": "@GildayLero82756",  
    "searchPhrase": "the prompt you provided"  
  }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;If you’re using this, you’re gonna have to &lt;a href="https://get.brightdata.com/getting-started-with-scraping-browser?utm_content=ais_worst_flaws_will_become_its_nostalgia_aesthetic_just_as_brian_eno_said" rel="noopener noreferrer"&gt;sign up here&lt;/a&gt; to get credentials and create the auth string. Also, if you think of any more phrases, throw them into the &lt;code&gt;searchPhrases&lt;/code&gt; array.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Run it. Watch the results. Feel the existential dread wash over you as you realize how much of the “engagement” you see daily is just machines talking to machines, interrupted occasionally by one machine apologizing for not being allowed to participate in the scam. &lt;em&gt;Dead internet theory&lt;/em&gt;, alive and kicking. 😅&lt;/p&gt;

&lt;h2&gt;
  
  
  This is the Aesthetic of Digital Decay.
&lt;/h2&gt;

&lt;p&gt;The refusal text isn’t merely funny, and is not merely a glitch. It’s the moment the illusion breaks. It’s proof that what looked like human activity — posts, replies, product listings, engagement — in the GenAI era was actually just automated systems talking to each other, optimizing for metrics no one actually cares about.&lt;/p&gt;

&lt;p&gt;I can only call this Kafkaesque. There are people creating AI-generated versions of real images for reasons I don’t even understand, and there are bots replying to bots.&lt;/p&gt;

&lt;p&gt;The engagement farms harvest each other’s metrics. The algorithms boost the noise because it looks like activity. Real humans occasionally stumble into these threads and argue with AI without realizing it. Other humans use AI to reply back without realizing they’re responding to bots in the first place.&lt;/p&gt;

&lt;p&gt;It’s synthetic engagement all the way down. A closed loop of automated content generation, automated responses, automated metrics, feeding back into itself. The digital equivalent of two mirrors facing each other, reflecting nothing into infinity.&lt;/p&gt;

&lt;p&gt;This is the technological hellscape we’ve built: an internet where the primary function of vast quantities of products, images, videos, and text is to convince other humans (and bots pretending to be humans) that &lt;em&gt;someone&lt;/em&gt; is home. That there’s totally real consciousness on the other end. That any of this matters. That this &lt;em&gt;definitely&lt;/em&gt; isn’t a system eating itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;And the only way we know it’s fake is when the AI apologizes for not being able to fake it &lt;em&gt;hard enough&lt;/em&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpszx2vf9idnwcovcapo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpszx2vf9idnwcovcapo.png" alt="screenshot of pages upon pages of obviously AI-generated X/Twitter replies that all say the exact same thing above" width="597" height="643"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77ng17bbatjcqjgdr8kf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F77ng17bbatjcqjgdr8kf.png" alt="screenshot of pages upon pages of obviously AI-generated X/Twitter replies that all say the exact same thing above" width="597" height="658"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg63cuc2r4t609bo0u73k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg63cuc2r4t609bo0u73k.png" alt="screenshot of pages upon pages of obviously AI-generated X/Twitter replies that all say the exact same thing above" width="596" height="693"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbv7v1wf5tjbjkkoxx0sp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbv7v1wf5tjbjkkoxx0sp.png" alt="screenshot of pages upon pages of obviously AI-generated X/Twitter replies that all say the exact same thing above" width="597" height="618"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are millions of such posts, all over X, and beyond.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This&lt;/em&gt; is the aesthetic of the AI era, 2023–2025 and beyond: &lt;strong&gt;synthetic rot.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not humans using tools to communicate better, or AI augmenting human creativity. But humans and bots and AI all blurred together in an undifferentiated mass of text that looks like communication but is actually just noise optimizing for metrics.&lt;/p&gt;

&lt;p&gt;And refusal text is the so called “glitch in the Matrix”, a brief flash where you saw the wires on the marionettes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Very Different Memories Invoked.
&lt;/h2&gt;

&lt;p&gt;So yes, both will become nostalgic. But they’ll mean completely different things. One nostalgia will be about the technology. The other will be about what we did with it.&lt;/p&gt;

&lt;p&gt;The distinction matters. Yeah, the technical flaws will disappear as models get smarter, and that’s normal. The institutional flaws though? Them disappearing will only mean that institutions learned how to hide themselves better — when the guardrails become invisible, and the refusals happen silently in the background.&lt;/p&gt;

&lt;p&gt;AI is already a black box. When that happens (and it will happen), &lt;em&gt;God help us&lt;/em&gt;, we’ll lose the ability to even peek behind the curtain.&lt;/p&gt;

&lt;p&gt;And twenty years from now, someone will build a “retro AI” that deliberately surfaces refusal text again, that lets the institutional seams show, breaks character and apologizes. Not because it will be technically necessary, &lt;strong&gt;but because it’ll remind us of the brief window when we could still tell the difference.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;That’s&lt;/em&gt; the “artifact” we’re going to remember.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>I Built a Self-Hosted Google Trends Alternative with DuckDB</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Wed, 11 Feb 2026 16:16:43 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/i-built-a-self-hosted-google-trends-alternative-with-duckdb-1k57</link>
      <guid>https://dev.to/prithwish_nath/i-built-a-self-hosted-google-trends-alternative-with-duckdb-1k57</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Track SERP rankings, title changes, and competitor data Google Trends won’t show. Built with Python, DuckDB, and a CLI-first approach.&lt;/p&gt;

&lt;p&gt;Google Trends will tell you if people are searching for “react” or “nextjs”. But it won’t tell you that Stack Overflow just got bumped from position #2 to #7, or that Vercel changed their landing page title five times this month trying to improve click-through rate.&lt;/p&gt;

&lt;p&gt;If, say, you’re an indie dev launching a product, needing every edge available, &lt;em&gt;that’s&lt;/em&gt; the data that actually matters to you.&lt;/p&gt;

&lt;p&gt;So I spent a weekend building a tool to track it. I could’ve just paid for Ahrefs/Semrush etc. But building this taught me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  How SERP APIs work under the hood&lt;/li&gt;
&lt;li&gt;  How to model time-series data in SQL (and its gotchas)&lt;/li&gt;
&lt;li&gt;  How to calculate derived metrics (interest score) from raw data&lt;/li&gt;
&lt;li&gt;  How DuckDB handles analytical queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…and also because I didn’t really want to spend anywhere near &lt;em&gt;that&lt;/em&gt; much.😅&lt;/p&gt;

&lt;p&gt;Ironically, focusing on CLI only made the tool &lt;em&gt;more&lt;/em&gt; useful — I can use this, then pipe results to jq, schedule fetches with cron, and script complex workflows without fighting a web framework.&lt;/p&gt;

&lt;p&gt;If I want a dashboard later, I can always add FastAPI in ~50 lines. But for now, the CLI is enough. Here’s how I built it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you’d like to tinker, the full code to this is on GitHub. Feel free to star, clone, fork, whatever: &lt;a href="https://github.com/sixthextinction/duckdb-google-trends-basic/" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/duckdb-google-trends-basic/&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why Google Trends Isn’t Enough (And Why SEO Tools Cost $200/Month)
&lt;/h2&gt;

&lt;p&gt;Google Trends answers one question really well: “How many people are searching for X?”&lt;/p&gt;

&lt;p&gt;But if you’re building a product, writing technical content, or trying to rank for competitive keywords, you need to answer different questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Which competitors are winning in search results right now?&lt;/li&gt;
&lt;li&gt;  When did that tutorial site enter the top 10?&lt;/li&gt;
&lt;li&gt;  Is my rank drop because Google reshuffled the entire SERP, or just me?&lt;/li&gt;
&lt;li&gt;  What headlines are competitors A/B testing?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tools like Ahrefs and SEMrush answer these questions. They cost $99–500/month, but I just wanted something I could self-host for the cost of API calls + would be doable as a weekend project.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why I Use Google Results Volatility as a Proxy for Search Interest&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This works because of ONE reason — when search interest in a term rises, Google’s top 10 results become &lt;em&gt;volatile&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;“Volatility” here simply means that new domains enter, rankings shift around, sites update their titles and snippets to capture more clicks, etc. You get the picture — essentially, the search engine results page becomes chaotic.&lt;/p&gt;

&lt;p&gt;Conversely, when interest in a term is stable or declining, Google’s top 10 ossifies. The same Wikipedia article, the same W3Schools tutorial, the same official docs sit in positions 1–3 for months.&lt;/p&gt;

&lt;p&gt;So I don’t really need to track raw search volume (which I can’t have access to — I’m not Google), I can just track these three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;New domains entering top 10&lt;/strong&gt; because it’s a signal of rising interest or new content opportunities&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Average rank improvement&lt;/strong&gt; because it’s a signal of SERP instability&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Domain overlap ratio&lt;/strong&gt; because it measures how many domains persist between snapshots (complementing new domains)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Turns out, if I aggregate these three signals into a single 0–100 score (I’ll talk about the formula in just a bit), I get something that behaves &lt;em&gt;remarkably&lt;/em&gt; like Google Trends — but tells me a &lt;em&gt;lot&lt;/em&gt; more than just how many are searching.&lt;/p&gt;

&lt;h2&gt;
  
  
  The System Architecture
&lt;/h2&gt;

&lt;p&gt;The entire system is about ~1000 lines of Python and runs locally with no server required.&lt;/p&gt;

&lt;p&gt;Here’s how it works:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3o60aa9yhimccdp8uoze.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3o60aa9yhimccdp8uoze.jpeg" width="640" height="999"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I started small with this one — using only daily snapshots, not live queries. Each run appends point-in-time data instead of overwriting. This way, over 7–30 days, I could build a local historical dataset that I could query freely.&lt;/p&gt;

&lt;p&gt;I use &lt;a href="https://duckdb.org/" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt; for this. For this workload (rank comparisons, volatility calculations, detecting new entrants), DuckDB’s SQL engine is ideal.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  It’s columnar, so analytical queries over time-series data are fast. (If you want to know more, I covered columnar formats vs. JSON &lt;a href="https://medium.com/python-in-plain-english/stop-paying-the-json-tax-build-faster-data-pipelines-in-python-with-apache-arrow-a37ce670a1f1" rel="noopener noreferrer"&gt;in this blog post&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;  It handles indexing, window functions (LAG(), PARTITION BY — which we will use extensively), and aggregations without needing a server or cloud warehouse.&lt;/li&gt;
&lt;li&gt;  It has an in-process design, meaning no separate database server — our project will need just a Python library and a file.&lt;/li&gt;
&lt;li&gt;  Plus, since it’s a single file (Our "database” just lives in &lt;code&gt;data/serp_data.duckdb&lt;/code&gt;), backups are trivial and there’s zero configuration overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://duckdb.org/docs/stable/?source=post_page-----624a19bcab65---------------------------------------" rel="noopener noreferrer"&gt;Documentation - DuckDB&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  My Interest Score Formula
&lt;/h2&gt;

&lt;p&gt;Every day, for each keyword, the system calculates a 0–100 “Search Interest Score” based on how much the SERP moved compared to the previous day.&lt;/p&gt;

&lt;p&gt;I’m not gonna get into the math, but basically, I did some research on Google Trends scoring, adapted it for my needs, and split my scoring logic into 3 weighted parts:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. New Domains Entering Top 10 (0–40 points)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;new_domains = current_top10 — previous_top10&lt;/p&gt;

&lt;p&gt;new_domains_score = min(len(new_domains) * 4, 40)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If 3 new sites enter the top 10, that’s 12 points. If 10 new sites appear (rare but possible during breaking news or major updates), that maxes out at 40 points.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Average Rank Improvement (0–30 points)
&lt;/h3&gt;

&lt;p&gt;For each domain that appears in both snapshots&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;rank_improvement = previous_rank — current_rank&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A positive value here means it moved up.&lt;/p&gt;

&lt;p&gt;Now, average across all domains, normalized to -10 to +10 range&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;avg_improvement = mean(rank_improvements)&lt;/p&gt;

&lt;p&gt;rank_improvement_score = min(max((avg_improvement + 10) / 20 * 30, 0), 30)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the average site improved by 2 positions, that’s roughly 18 points. If rankings barely moved, this stays close to 15 (neutral).&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Domain Overlap Ratio (0–30 points)
&lt;/h3&gt;

&lt;p&gt;Finally, how many of today’s top 10 domains &lt;em&gt;also&lt;/em&gt; appeared in yesterday’s top 10?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;reshuffle_count = count(domains present in both current and previous top 10)&lt;/p&gt;

&lt;p&gt;reshuffle_frequency = reshuffle_count / max(len(current_domains_set), 1)&lt;/p&gt;

&lt;p&gt;reshuffle_score = reshuffle_frequency * 30&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let’s say 8 out of 10 domains carry over from yesterday, that’s 24 points. If only 3 carry over (meaning 7 are new — a massive reshuffle), that’s 9 points. This complements the new domains score by capturing continuity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Total Score
&lt;/h3&gt;

&lt;p&gt;So, taking all three parts together…&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;interest_score = new_domains_score + rank_improvement_score + reshuffle_score&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;High scores (60–100) = lots of movement = rising interest or major SERP disruption.&lt;/p&gt;

&lt;p&gt;Low scores (0–40) = stable, ossified rankings = same old, same old. Established content is dominant.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;I tested this by tracking “nextjs” for 7 days like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python main.py scores --query "nextjs" --days 7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s what the output looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== Interest Scores for 'nextjs' (last 7 days) ===  

Found 7 scores:  

| snapshot_date | interest_score | new_domains | avg_rank_improvement | reshuffle_freq |  
|---------------|----------------|-------------|----------------------|----------------|  
| 2026-02-01    | 45.2           | 2           | 1.5                  | 0.6            |  
| 2026-02-02    | 52.3           | 3           | 2.1                  | 0.7            |  
| 2026-02-03    | 38.7           | 1           | 0.8                  | 0.5            |  
| 2026-02-04    | 61.4           | 4           | 3.2                  | 0.8            |  
| 2026-02-05    | 42.1           | 2           | 1.2                  | 0.6            |  
| 2026-02-06    | 55.8           | 3           | 2.5                  | 0.7            |  
| 2026-02-07    | 48.3           | 2           | 1.8                  | 0.6            |  

Chart saved to: nextjs_trend.png  

Summary: Min: 38.7  Max: 61.4  Avg: 49.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s that generated chart (I’m using basic &lt;code&gt;matplotlib&lt;/code&gt;for these):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcka43zeuwx0a40ljz2hr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcka43zeuwx0a40ljz2hr.png" width="640" height="316"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Matplotlib generated Search Interest Trend graph for the term “nextjs”. I tracked this over 7 days in the tool, running once per day.&lt;/p&gt;

&lt;p&gt;That spike on Feb 4 (interest_score = 61.4) indicates a major SERP reshuffle — probably a Google algorithm update or a major new tutorial entering the rankings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actually Building It
&lt;/h2&gt;

&lt;p&gt;Before diving into the code, here’s the bird’s-eye view of how the system fits together.&lt;/p&gt;

&lt;p&gt;The entire project is driven by a small CLI (powered by &lt;a href="https://docs.python.org/3/library/argparse.html" rel="noopener noreferrer"&gt;argparse&lt;/a&gt;) in &lt;code&gt;main.py&lt;/code&gt;. This file doesn’t contain any scraping or analytics logic — it’s just the orchestration layer that wires everything together.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the full code for &lt;strong&gt;main.py&lt;/strong&gt; here: &lt;a href="https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/main.py" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/main.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You run specific commands (&lt;code&gt;fetch&lt;/code&gt;to get SERP data for a keyword, &lt;code&gt;volatility&lt;/code&gt;to analyze rank volatility for a keyword over a period of time, &lt;code&gt;scores&lt;/code&gt;to view interest scores for a keyword over time, etc.) like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py fetch &lt;span class="nt"&gt;--keywords&lt;/span&gt; &lt;span class="s2"&gt;"python"&lt;/span&gt;  
python main.py volatility &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"python"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 30  
python main.py scores &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"python"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 90
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI dispatch uses &lt;code&gt;argparse&lt;/code&gt; subcommands to wire everything together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;    
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DuckDB Google Trends&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;subparsers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_subparsers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dest&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;command&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Commands&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

    &lt;span class="c1"&gt;# Each command gets its own parser with relevant arguments    
&lt;/span&gt;    &lt;span class="n"&gt;fetch_parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subparsers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_parser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fetch&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Fetch SERP snapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;fetch_parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--keywords&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Keywords to track&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;fetch_parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--num-results&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;fetch_parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--delay&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

    &lt;span class="n"&gt;scores_parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subparsers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_parser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scores&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Show interest scores&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;scores_parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;scores_parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;scores_parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

    &lt;span class="c1"&gt;# ... similar parsers for analyze, volatility, new-entrants, changes, calculate-scores    
&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    
    &lt;span class="n"&gt;commands&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;    
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fetch&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cmd_fetch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cmd_analyze&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;volatility&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cmd_volatility&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;new-entrants&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cmd_new_entrants&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;changes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cmd_changes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;calculate-scores&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cmd_calculate_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scores&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cmd_scores&lt;/span&gt;    
    &lt;span class="p"&gt;}&lt;/span&gt;    
    &lt;span class="n"&gt;commands&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our &lt;code&gt;main.py&lt;/code&gt; defines commands that map directly to the questions we want to ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;fetch&lt;/strong&gt; — collect today’s SERP results for a set of keywords&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;analyze&lt;/strong&gt; — inspect the shape of the collected data&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;volatility&lt;/strong&gt; — measure how rankings change over time&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;new-entrants&lt;/strong&gt; — detect URLs appearing for the first time&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;changes&lt;/strong&gt; — track title and snippet updates&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;calculate-scores&lt;/strong&gt; — recalculate interest scores for existing snapshots&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;scores&lt;/strong&gt; — view the calculated interest score trend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, here’s the key command handler for the &lt;code&gt;scores&lt;/code&gt;command (the other handlers follow the same pattern):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Usage: python main.py scores --query "python" --days 90  
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cmd_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Show interest scores for a query&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;    
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;SERPAnalytics&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;interest_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;=== Interest Scores for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; (last &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; days) ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No interest scores found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Note: Interest scores require at least 2 snapshots on different days.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;To calculate scores for existing data, run:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  python main.py calculate-scores --keywords &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       
            &lt;span class="k"&gt;return&lt;/span&gt;    

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; scores:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;df_to_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;    

        &lt;span class="c1"&gt;# Generate PNG chart    
&lt;/span&gt;        &lt;span class="n"&gt;output_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_trend.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    
        &lt;span class="nf"&gt;_generate_png_chart&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Chart saved to: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At a high level, what can our project do?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Fetch daily SERP snapshots for keywords (&lt;code&gt;fetch&lt;/code&gt;command)&lt;/li&gt;
&lt;li&gt; Store those snapshots locally in DuckDB&lt;/li&gt;
&lt;li&gt; Run analytical queries over historical data using plain old SQL&lt;/li&gt;
&lt;li&gt; When you run &lt;code&gt;python main.py scores --query “nextjs”&lt;/code&gt;, the CLI fetches interest scores from DuckDB and as an added bonus, generates a PNG chart using &lt;code&gt;matplotlib&lt;/code&gt;. Note that this shows SERP movement, not raw search volume.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We don’t need servers, background workers, or dashboards here.&lt;/p&gt;

&lt;p&gt;Now that we know how this tool works, let’s look at the major modules that make all this happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  Module 1: Fetching SERP Data
&lt;/h2&gt;

&lt;p&gt;All external data access is isolated in &lt;code&gt;serp_client.py&lt;/code&gt;. I only have access to one SERP API — Bright Data — so I’ll only have to implement one class. Get your credentials &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=i_built_a_self_hosted_google_trends_alternative_with_duckdb" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=i_built_a_self_hosted_google_trends_alternative_with_duckdb&amp;amp;source=post_page-----624a19bcab65---------------------------------------" rel="noopener noreferrer"&gt;Bright Data SERP API&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This approach does make it easy to extend it with other SERP APIs: just write another client, and include its credentials in your env file.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the full code for &lt;strong&gt;serp_client.py&lt;/strong&gt; here: &lt;a href="https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/serp_client.py" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/serp_client.py&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;    
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;    
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;    

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Client for Bright Data SERP API&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;    

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="p"&gt;):&lt;/span&gt;    
        &lt;span class="n"&gt;env_api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="n"&gt;env_zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
        &lt;span class="n"&gt;env_country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_COUNTRY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;env_api_key&lt;/span&gt;    
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;zone&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;env_zone&lt;/span&gt;    
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;env_country&lt;/span&gt;    
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.brightdata.com/request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;    
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_API_KEY must be provided via constructor or environment variable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    
            &lt;span class="p"&gt;)&lt;/span&gt;    

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;    
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BRIGHT_DATA_ZONE must be provided via constructor or environment variable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    
            &lt;span class="p"&gt;)&lt;/span&gt;    

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;    
        &lt;span class="p"&gt;})&lt;/span&gt;    

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
        &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;    
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Execute a Google search via Bright Data SERP API&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;    
        &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;    
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.google.com/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?q=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;num=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;brd_json=1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    
        &lt;span class="p"&gt;)&lt;/span&gt;    

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="n"&gt;search_url&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;hl=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;lr=lang_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    

        &lt;span class="n"&gt;target_country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;    

        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;zone&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;search_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;    
        &lt;span class="p"&gt;}&lt;/span&gt;    

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_country&lt;/span&gt;    

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;    
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
                &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;    
            &lt;span class="p"&gt;)&lt;/span&gt;    
            &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    

            &lt;span class="c1"&gt;# Parse body JSON string if present    
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
                    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;    
                &lt;span class="c1"&gt;# Return the parsed body content    
&lt;/span&gt;                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;    

            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;    

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HTTPError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search request failed with HTTP &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;    
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error_msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;    
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search request failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is just a thin wrapper around the Bright Data API. It takes a query, returns JSON with organic search results (title, snippet, URL, rank).&lt;/p&gt;

&lt;p&gt;This module is called when we run the &lt;code&gt;fetch&lt;/code&gt; command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py fetch &lt;span class="nt"&gt;--keywords&lt;/span&gt; &lt;span class="s2"&gt;"python"&lt;/span&gt; &lt;span class="s2"&gt;"javascript"&lt;/span&gt; &lt;span class="s2"&gt;"react"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will connect to the Bright Data SERP API, and for each keyword, fetch Google search results (default of 10 per keyword, adjust as necessary), and extract + store organic results (title, snippet, URL, rank). Remember, interest scores require at least 2 snapshots on different days. You should fetch snapshots daily to build historical trends (cron job, or just running manually.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Fetching snapshots &lt;span class="k"&gt;for &lt;/span&gt;3 keywords…  
&lt;span class="o"&gt;[&lt;/span&gt;1/3] &lt;span class="s1"&gt;'python'&lt;/span&gt;: 10 results  
&lt;span class="o"&gt;[&lt;/span&gt;2/3] &lt;span class="s1"&gt;'javascript'&lt;/span&gt;: 10 results  
&lt;span class="o"&gt;[&lt;/span&gt;3/3] &lt;span class="s1"&gt;'react'&lt;/span&gt;: 10 results  
Total snapshots &lt;span class="k"&gt;in &lt;/span&gt;database: 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Module 2: Storing Snapshots in DuckDB
&lt;/h2&gt;

&lt;p&gt;Once SERP data is fetched by the previous module, it needs to be stored in DuckDB for analytical queries. That logic lives in &lt;code&gt;duckdb_manager.py&lt;/code&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the full code for &lt;strong&gt;duckdb_manager.py&lt;/strong&gt; here: &lt;a href="https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/duckdb_manager.py" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/duckdb_manager.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;First of all, let’s introduce the schema we’ll be using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;serp_snapshots&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;    
    &lt;span class="n"&gt;snapshot_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;snapshot_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;snapshot_timestamp&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;snippet&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="k"&gt;domain&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="k"&gt;UNIQUE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
&lt;span class="p"&gt;)&lt;/span&gt;    

&lt;span class="c1"&gt;-- Interest scores table (calculated from SERP movement between snapshots)    &lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;interest_scores&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;    
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;snapshot_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;interest_score&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;new_domains_count&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;avg_rank_improvement&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="n"&gt;reshuffle_frequency&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="k"&gt;UNIQUE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
&lt;span class="p"&gt;)&lt;/span&gt;    

&lt;span class="c1"&gt;-- Indexes for fast queries    &lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;idx_query_date&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;serp_snapshots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;idx_url_query&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;serp_snapshots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;idx_interest_scores&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;interest_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each SERP result becomes a row, keyed by (query, date, URL). Interest scores are stored in a separate table, calculated automatically when a new snapshot is inserted. So, inserting a snapshot will look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;insert_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     
                   &lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Insert a daily snapshot of SERP results&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;    
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
        &lt;span class="n"&gt;snapshot_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    

    &lt;span class="n"&gt;snapshot_timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;    
    &lt;span class="n"&gt;snapshot_date_only&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;    

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
        &lt;span class="k"&gt;return&lt;/span&gt;    

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract domain from URL, stripping www prefix&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;    
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;    
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urlparse&lt;/span&gt;    
            &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;netloc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;www.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   
        &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;    

    &lt;span class="c1"&gt;# Get max snapshot_id     
&lt;/span&gt;    &lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;      
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT COALESCE(MAX(snapshot_id), 0) FROM serp_snapshots&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;      
    &lt;span class="n"&gt;next_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_id_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;max_id_result&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;      

    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;    
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;    
        &lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     
        &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;snapshot_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;next_id&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;snapshot_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snapshot_date_only&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;snapshot_timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snapshot_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;snippet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;    
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;    
        &lt;span class="p"&gt;})&lt;/span&gt;    

    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;    
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;      
        INSERT OR IGNORE INTO serp_snapshots       
        SELECT * FROM df      
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
    &lt;span class="c1"&gt;# Calculate and store interest score    
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_calculate_interest_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_date_only&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of updating rows, every run adds new records. This builds a local time-series dataset.&lt;/p&gt;

&lt;p&gt;Here’s how we calculate the interest score using that 40–30–30 formula described earlier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_calculate_interest_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Calculate Search Interest Score (0-100) based on SERP movement&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;  
    &lt;span class="c1"&gt;# Get previous day's snapshot for comparison  
&lt;/span&gt;    &lt;span class="n"&gt;prev_date_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
        SELECT MAX(snapshot_date)   
        FROM serp_snapshots   
        WHERE query = ?   
          AND snapshot_date `&amp;lt; ?  
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;prev_date_result&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;prev_date_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  
        &lt;span class="c1"&gt;# First snapshot, no comparison possible  
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt;  

    &lt;span class="n"&gt;prev_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prev_date_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  

    &lt;span class="c1"&gt;# Get current top 10 domains  
&lt;/span&gt;    &lt;span class="n"&gt;current_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
        SELECT DISTINCT domain   
        FROM serp_snapshots   
        WHERE query = ?   
          AND snapshot_date = ?  
          AND rank &amp;lt;= 10  
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;current_domains_set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;current_domains&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  

    &lt;span class="c1"&gt;# Get previous top 10 domains  
&lt;/span&gt;    &lt;span class="n"&gt;prev_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
        SELECT DISTINCT domain   
        FROM serp_snapshots   
        WHERE query = ?   
          AND snapshot_date = ?  
          AND rank &amp;lt;= 10  
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prev_date&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  
    &lt;span class="n"&gt;prev_domains_set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;prev_domains&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  

    &lt;span class="c1"&gt;# Count new domains entering top 10  
&lt;/span&gt;    &lt;span class="n"&gt;new_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_domains_set&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;prev_domains_set&lt;/span&gt;  
    &lt;span class="n"&gt;new_domains_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_domains&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="c1"&gt;# Calculate average rank improvement for existing domains  
&lt;/span&gt;    &lt;span class="n"&gt;rank_changes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
        WITH current_ranks AS (  
            SELECT domain, rank  
            FROM serp_snapshots  
            WHERE query = ? AND snapshot_date = ?   
              AND rank &amp;lt;= 10  
        ),  
        prev_ranks AS (  
            SELECT domain, rank  
            FROM serp_snapshots  
            WHERE query = ? AND snapshot_date = ?  
              AND rank &amp;lt;= 10  
        )  
        SELECT   
            c.domain,  
            c.rank as current_rank,  
            p.rank as prev_rank,  
            (p.rank - c.rank) as rank_improvement  
        FROM current_ranks c  
        JOIN prev_ranks p ON c.domain = p.domain  
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prev_date&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rank_changes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;avg_rank_improvement&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rank_changes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rank_changes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
        &lt;span class="n"&gt;avg_rank_improvement&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;  

    &lt;span class="c1"&gt;# Calculate reshuffle frequency (how many domains changed position)  
&lt;/span&gt;    &lt;span class="n"&gt;reshuffle_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rank_changes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="n"&gt;reshuffle_frequency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reshuffle_count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_domains_set&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="c1"&gt;# Normalize to 0-100 score  
&lt;/span&gt;    &lt;span class="c1"&gt;# I'm calculating a final score from 3 weighted sub-scores:  
&lt;/span&gt;    &lt;span class="c1"&gt;# - New domains: 0-10 domains = 0-40 points  
&lt;/span&gt;    &lt;span class="c1"&gt;# - Rank improvement: -10 to +10 = 0-30 points (normalized)  
&lt;/span&gt;    &lt;span class="c1"&gt;# - Reshuffle frequency: 0-1 = 0-30 points  
&lt;/span&gt;
    &lt;span class="n"&gt;new_domains_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_domains_count&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Max 40 points  
&lt;/span&gt;    &lt;span class="n"&gt;rank_improvement_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;avg_rank_improvement&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Max 30 points  
&lt;/span&gt;    &lt;span class="n"&gt;reshuffle_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reshuffle_frequency&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;  &lt;span class="c1"&gt;# Max 30 points  
&lt;/span&gt;
    &lt;span class="n"&gt;interest_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_domains_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank_improvement_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;reshuffle_score&lt;/span&gt;  

    &lt;span class="c1"&gt;# Store interest score  
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;  
        INSERT OR REPLACE INTO interest_scores   
        (query, snapshot_date, interest_score, new_domains_count, avg_rank_improvement, reshuffle_frequency)  
        VALUES (?, ?, ?, ?, ?, ?)  
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interest_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_domains_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;avg_rank_improvement&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reshuffle_frequency&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs automatically every time a new snapshot is inserted. The score gets stored in a separate &lt;code&gt;interest_scores&lt;/code&gt; table for easy querying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Module 3: Analytical Queries
&lt;/h2&gt;

&lt;p&gt;The nerdiest of our logic lives in &lt;code&gt;analytics.py&lt;/code&gt;. This module opens DuckDB in read-only mode and exposes focused queries.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the full code for &lt;strong&gt;analytics.py&lt;/strong&gt; here: &lt;a href="https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/analytics.py" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/analytics.py&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A good analytical query to demonstrate right now would be the one for rank volatility:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rank_volatility&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;    
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Calculate rank volatility for URLs over time&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;    
    &lt;span class="n"&gt;cutoff_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;    
        WITH rank_changes AS (    
            SELECT     
                url,    
                domain,    
                rank,    
                snapshot_date,    
                LAG(rank) OVER (PARTITION BY url ORDER BY snapshot_date) as prev_rank    
            FROM serp_snapshots    
            WHERE query = ? AND snapshot_date &amp;gt;= ?    
            ORDER BY url, snapshot_date    
        ),    
        volatility AS (    
            SELECT     
                url,    
                domain,    
                COUNT(*) as snapshot_count,    
                AVG(rank) as avg_rank,    
                MIN(rank) as best_rank,    
                MAX(rank) as worst_rank,    
                STDDEV(rank) as rank_stddev,    
                COUNT(CASE WHEN prev_rank IS NOT NULL AND rank != prev_rank THEN 1 END) as rank_changes    
            FROM rank_changes    
            GROUP BY url, domain    
        )    
        SELECT     
            url,    
            domain,    
            snapshot_count,    
            ROUND(avg_rank, 2) as avg_rank,    
            best_rank,    
            worst_rank,    
            ROUND(rank_stddev, 2) as rank_stddev,    
            rank_changes,    
            ROUND(CAST(rank_changes AS DOUBLE) / NULLIF(snapshot_count - 1, 0) * 100, 1) as volatility_pct    
        FROM volatility    
        WHERE snapshot_count &amp;gt; 1    
        ORDER BY rank_stddev DESC, avg_rank ASC    
        LIMIT 50    
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cutoff_date&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This uses window functions (LAG) and aggregations (STDDEV) to surface URLs that move around the most. These queries would normally require a data warehouse — here they’re just SQL running locally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run this with:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py volatility &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"python"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This, for example, will analyze the last 30 days of snapshots for the query string “python”, calculating average rank, best/worst rank, standard deviation, and change frequency — and display the top 50 (as default) most volatile URLs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;===&lt;/span&gt; Rank Volatility &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="s1"&gt;'python'&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;last 30 days&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt;  

Top 10 most volatile URLs:  

| url | domain | snapshot_count | avg_rank | best_rank | worst_rank | rank_stddev | rank_changes | volatility_pct |  
| &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; |  
| https://www.python.org/ | python.org | 30 | 1.5 | 1 | 3 | 0.67 | 15 | 51.7 |  
| https://www.w3schools.com/python/ | w3schools.com | 30 | 2.3 | 1 | 5 | 1.12 | 18 | 62.1 |  
| https://en.wikipedia.org/wiki/Python_&lt;span class="o"&gt;(&lt;/span&gt;programming_language&lt;span class="o"&gt;)&lt;/span&gt; | wikipedia.org | 28 | 4.1 | 2 | 8 | 1.89 | 12 | 44.4 |  
| https://www.codecademy.com/catalog/language/python | codecademy.com | 25 | 5.2 | 3 | 10 | 2.15 | 10 | 41.7 |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Another query that is very useful is finding new entrants:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;new_entrants&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Find URLs that appeared for the first time recently&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;    
    &lt;span class="n"&gt;cutoff_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;    
        WITH first_appearance AS (    
            SELECT     
                url,    
                domain,    
                MIN(snapshot_date) as first_seen    
            FROM serp_snapshots    
            WHERE query = ?    
            GROUP BY url, domain    
        ),    
        recent_entrants AS (    
            SELECT     
                fa.url,    
                fa.domain,    
                fa.first_seen,    
                s.rank as first_rank,    
                s.title,    
                s.snippet    
            FROM first_appearance fa    
            JOIN serp_snapshots s     
                ON fa.url = s.url     
                AND fa.first_seen = s.snapshot_date    
                AND s.query = ?    
            WHERE fa.first_seen &amp;gt;= ?    
        )    
        SELECT     
            url,    
            domain,    
            first_seen,    
            first_rank,    
            title,    
            snippet    
        FROM recent_entrants    
        ORDER BY first_seen DESC, first_rank ASC    
        LIMIT 50    
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cutoff_date&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;df&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This finds URLs whose first appearance falls within the last N days — perfect for spotting new competitors or fresh content entering the rankings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run this with:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py new-entrants &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"python"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt;  7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;===&lt;/span&gt; New Entrants &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="s1"&gt;'python'&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;last 7 days&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt;  

Found 3 new URLs:  

| url | domain | first_seen | first_rank | title | snippet |  
| &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; | &lt;span class="nt"&gt;---&lt;/span&gt; |  
| https://realpython.com/ | realpython.com | 2026-02-04 | 7 | Real Python - Python Tutorials | Learn Python programming with Real Python&lt;span class="s1"&gt;'s comprehensive tutorials and courses... |  
| https://www.pythonforbeginners.com/ | pythonforbeginners.com | 2026-02-05 | 9 | Python For Beginners | A comprehensive guide to learning Python programming from scratch... |  
| https://docs.python-guide.org/ | docs.python-guide.org | 2026-02-06 | 8 | The Hitchhiker'&lt;/span&gt;s Guide to Python | Best practices and recommendations &lt;span class="k"&gt;for &lt;/span&gt;Python development... |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I’m not going to go over every module, but it’s all in the code. &lt;a href="https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/README.md" rel="noopener noreferrer"&gt;Find all queries + their expected output in the project README.md&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Module 4: The Snapshot Fetcher
&lt;/h2&gt;

&lt;p&gt;Finally, &lt;code&gt;scraper.py&lt;/code&gt; (I’m so sorry — I &lt;em&gt;really&lt;/em&gt; could have named this better 😅) connects ingestion and storage.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the full code for &lt;strong&gt;scraper.py&lt;/strong&gt; here: &lt;a href="https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/scraper.py" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/scraper.py&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;    

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;serp_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BrightDataClient&lt;/span&gt;    
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;duckdb_manager&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DuckDBManager&lt;/span&gt;    


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_snapshots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;    
    Fetch SERP snapshots for keywords and store in DuckDB    

    Args:    
        keywords: List of search keywords    
        num_results: Number of results per keyword    
        delay: Delay between API calls (seconds)    
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;    
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BrightDataClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DuckDBManager&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fetching snapshots for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; keywords...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;    
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                &lt;span class="c1"&gt;# Fetch SERP results    
&lt;/span&gt;                &lt;span class="n"&gt;serp_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

                &lt;span class="c1"&gt;# Extract organic results    
&lt;/span&gt;                &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;    
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                    &lt;span class="n"&gt;organic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;serp_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organic&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;    

                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                    &lt;span class="c1"&gt;# Insert snapshot    
&lt;/span&gt;                    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organic_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
                &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: No results found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

                &lt;span class="c1"&gt;# Rate limiting    
&lt;/span&gt;                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    

            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error fetching &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    
                &lt;span class="k"&gt;continue&lt;/span&gt;    

        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_snapshot_count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;    
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Total snapshots in database: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is just simple orchestration logic, again. It iterates over keywords, fetches results, and inserts snapshots. Rate limiting and error handling live at the edges. I’ve kept the core logic deliberately simple.&lt;/p&gt;

&lt;p&gt;That’s everything! Remember, &lt;code&gt;main.py&lt;/code&gt; brings all of these together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real World Use Cases
&lt;/h2&gt;

&lt;p&gt;Now that you understand how it works, here’s some cool things you can actually &lt;em&gt;do&lt;/em&gt; with this tool.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Detect Google Algorithm Updates Before They’re Announced
&lt;/h3&gt;

&lt;p&gt;When tracking multiple keywords in the same niche, sudden volatility spikes across all of them indicate an algorithm change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py volatility &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"react"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 7  
python main.py volatility &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"vue"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 7  
python main.py volatility &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"angular"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If all three show high &lt;code&gt;rank_stddev&lt;/code&gt; and &lt;code&gt;volatility_pct&lt;/code&gt;, Google likely pushed an update.&lt;/p&gt;

&lt;p&gt;SEO folks pay $200/month for SEMrush Sensor just to get this signal. You’re building it for the cost of SERP API calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Spy on Competitor SEO Tactics
&lt;/h3&gt;

&lt;p&gt;Track title and snippet changes to see what competitors are A/B testing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py changes &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"nextjs tutorial"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;| url                              | prev_title                    | new_title                                               |  
|----------------------------------|-------------------------------|---------------------------------------------------------|  
| https://nextjs.org/docs          | Next.js Documentation         | Next.js Docs | Next.js                                  |  
| https://nextjs.org/learn         | Learn Next.js                 | Learn Next.js | Next.js by Vercel - The React Framework |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s say a site changed their title from a generic page description to something more specific. If their rank improved after the change, that’s a signal the new title performs better — &lt;strong&gt;steal that pattern&lt;/strong&gt;!&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Find Content Gaps in Real-Time
&lt;/h3&gt;

&lt;p&gt;See which sites are entering top 10 and what format they’re using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py new-entrants &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"react hooks tutorial"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;| domain              | first_seen | first_rank | title                                    |  
|---------------------|------------|------------|------------------------------------------|  
| react-tutorial.dev  | 2026-02-05 | 7          | React Hooks Interactive Tutorial         |  
| codesandbox.io      | 2026-02-06 | 9          | Learn React Hooks - Live Coding Examples |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s say there are two new entrants in the SERP for the query “react hooks tutorial”, and &lt;em&gt;both&lt;/em&gt; new entrants have “Interactive” or “Live” in their titles. That means Google is currently rewarding interactive content for this query. &lt;strong&gt;Adjust your content strategy accordingly.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Validate Content Ideas Before Creation
&lt;/h3&gt;

&lt;p&gt;This one’s super simple to understand. High volatility = easier for you to rank. Low volatility = established players dominate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py volatility &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"python tutorial"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 30  
python main.py volatility &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"rust async tutorial"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s say the query “python tutorial” shows &lt;code&gt;rank_stddev: 0.3&lt;/code&gt; (very stable) and “rust async tutorial” shows &lt;code&gt;rank_stddev: 2.1&lt;/code&gt; (chaotic), &lt;strong&gt;focus on the Rust content&lt;/strong&gt;! The Python keyword is locked down by W3Schools and Real Python — you won’t break in easily.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Track Your Own Product’s SERP Performance
&lt;/h3&gt;

&lt;p&gt;Monitor how your product ranks for target keywords:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py fetch &lt;span class="nt"&gt;--keywords&lt;/span&gt; &lt;span class="s2"&gt;"whatever your product is or does"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then check if you’re entering top 10:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py new-entrants &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"whatever your product is or does"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your product URL appears, congrats — you just entered the top 10. If competitors are dropping out (volatility shows their ranks declining), you’re winning.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned Building This
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;DuckDB is a total cheat code for embedded analytics.&lt;/strong&gt; I expected to need PostgreSQL (or ClickHouse, &lt;em&gt;ugh&lt;/em&gt;.) for time-series queries over SERP data. Without fiddling with any config, calculating rank volatility across 30 days of snapshots for 50 URLs ran in ~20ms for me. The database file was &amp;lt;5MB for weeks of data.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Bright Data’s SERP API is very reliable.&lt;/strong&gt; I tried other SERP APIs before settling on Bright Data, primarily because of the consistent JSON output for Google and Bing, and support for DuckDuckGo, Yandex, etc. This experiment cost me pennies — but &lt;a href="https://get.brightdata.com/scraping-browser-acf6883?utm_content=i_built_a_self_hosted_google_trends_alternative_with_duckdb" rel="noopener noreferrer"&gt;make sure you check their pricing&lt;/a&gt; so you don’t get burnt by costs you shouldn’t be incurring.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The Interest Score formula needs tuning, possibly.&lt;/strong&gt; The 40/30/30 weighting (new domains / rank improvement / domain overlap ratio) was only my first guess. It works reasonably well, but it’s not perfect. At the very least, I should weight new domains more heavily for breaking news queries, and reduce domain overlap ratio impact for stable niches (Because, for example, Wikipedia will always be #1 for “Python programming language”)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Again, the full code is on GitHub: &lt;a href="https://github.com/sixthextinction/duckdb-google-trends-basic/" rel="noopener noreferrer"&gt;https://github.com/sixthextinction/duckdb-google-trends-basic/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick start:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sixthextinction/duckdb-google-trends-basic.git  
&lt;span class="c"&gt;# or...  &lt;/span&gt;
gh repo clone sixthextinction/duckdb-google-trends-basic  
&lt;span class="c"&gt;# then...  &lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;duckdb-google-trends-basic    
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt    

&lt;span class="c"&gt;# Set environment variables    &lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;BRIGHT_DATA_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_key"&lt;/span&gt;    
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;BRIGHT_DATA_ZONE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_zone"&lt;/span&gt;    
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;BRIGHT_DATA_COUNTRY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"us"&lt;/span&gt;  &lt;span class="c"&gt;# optional, for geo-targeted results    &lt;/span&gt;

&lt;span class="c"&gt;# Or use a .env file instead (python-dotenv is included)    &lt;/span&gt;

&lt;span class="c"&gt;# Test with sample data (no API key needed)    &lt;/span&gt;
python seed_data.py    
python main.py scores &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"nextjs"&lt;/span&gt; &lt;span class="nt"&gt;--days&lt;/span&gt; 7    

&lt;span class="c"&gt;# Or fetch real data    &lt;/span&gt;
python main.py fetch &lt;span class="nt"&gt;--keywords&lt;/span&gt; &lt;span class="s2"&gt;"react"&lt;/span&gt; &lt;span class="s2"&gt;"vue"&lt;/span&gt; &lt;span class="s2"&gt;"svelte"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I’ve included a &lt;a href="https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/seed_data.py" rel="noopener noreferrer"&gt;seed script&lt;/a&gt; that creates 7 days of synthetic data so you can test immediately without waiting. Otherwise, set up a daily cron job to fetch snapshots automatically, and within a week you’ll have real trend data.&lt;/p&gt;

&lt;p&gt;Thanks for reading!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Hi 👋 I’m constantly tinkering with dev tools, running weird-ass experiments, and otherwise building/deep-diving stuff that probably shouldn’t work but does — and writing about it. I put out a new post every Monday/Tuesday. If you’re into offbeat experiments and dev tools that actually don’t suck, give me a follow.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you did something cool with this tool, I’d love to see it.&lt;/em&gt; &lt;a href="https://www.linkedin.com/in/prithwish-nath-04b873a7/" rel="noopener noreferrer"&gt;Reach out on LinkedIn&lt;/a&gt;, &lt;em&gt;or put it in the comments below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>python</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Turning Google Search into a Kafka Event Stream for Many Consumers</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Tue, 03 Feb 2026 08:43:13 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/turning-google-search-into-a-kafka-event-stream-for-many-consumers-362g</link>
      <guid>https://dev.to/prithwish_nath/turning-google-search-into-a-kafka-event-stream-for-many-consumers-362g</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Google Rankings are a bad abstraction for how Google An event-driven approach to monitoring SERP changes — tracking features, entries, and exits instead of noisy rank movements.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The SERP — &lt;a href="https://en.wikipedia.org/wiki/Search_engine_results_page" rel="noopener noreferrer"&gt;Google’s ‘search engine results page’&lt;/a&gt; is what humans &lt;em&gt;actually&lt;/em&gt; see when they type in a search and hit enter. It has ads, featured snippets/AI overviews, carousels, knowledge panels long before organic search results. In fact, in 2026, raw organic results increasingly feel like a fallback rather than the main event. You can sit at position three for months and still lose traffic overnight because Google introduced a new ad or feature above you.&lt;/p&gt;

&lt;p&gt;That’s the core problem with naïve rank tracking. It assumes the environment is static and your position within it is the only variable. The truth is those changes Google makes to the results page often matter more (for market intelligence etc.) than whether you moved up/down a position.&lt;/p&gt;

&lt;p&gt;If you’re monitoring Google for your brand and care about traffic risk, competitive pressure, or acquisition costs these are just some things that can happen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Google adds ads → organic CTR drops&lt;/li&gt;
&lt;li&gt;  Google launches a featured snippet → one domain captures disproportionate attention&lt;/li&gt;
&lt;li&gt;  A new competitor enters the page → your share of attention changes even if your rank doesn’t&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this is well represented by a line chart of rankings. I couldn’t find many blogs that discuss this specifically without slipping into generic SEO optimization garbage, so I figured I’d write the one I wish I’d found. 🤷‍♂️ Let’s get down to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Kafka?
&lt;/h2&gt;

&lt;p&gt;As long as you’re answering one question for one person, you don’t need Kafka. A cron job + Postgres work just fine.&lt;/p&gt;

&lt;p&gt;The moment that stops being enough is when you want any combination of: history, alerts, multiple downstream consumers, or independent evolution of logic over time. Anything that takes this beyond a simple one-off. That’s usually where people keep bolting features onto a single script until it’s doing scraping, diffing, alerting, analytics, and reporting all at once.&lt;/p&gt;

&lt;p&gt;Both paths lead to systems that are hard to reason about and harder to change. How do we avoid this?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think about the basic unit of information you care about, and design around that.&lt;/strong&gt; In this case, our “unit” is &lt;em&gt;change&lt;/em&gt;, not snapshots. Because we want to answer “&lt;em&gt;what happened over this period of X&lt;/em&gt;?”&lt;/p&gt;

&lt;p&gt;So why use Kafka? It’s just the most boring (battle tested, well documented, not flashy) tech I could think of that accomplishes it. These are the benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Event replay.&lt;/strong&gt; If you build a new consumer next month, you can replay the last 7 days of SERP changes (or however long you configure retention). Your historical data isn’t trapped in database rows — it’s a stream you can re-process.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Independent consumers.&lt;/strong&gt; Each consumer has its own offset. Your alerting system can crash and restart without affecting your analytics pipeline. They process events at their own pace.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Ordering guarantees.&lt;/strong&gt; Events for the same keyword go to the same partition (we use {keyword}:{geo} as the key). You’ll never see a “domain entered” event before the “featured snippet appeared” event that caused it.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Backpressure handling.&lt;/strong&gt; If your volatility analyzer is slow, Kafka doesn’t care. Events queue up and the producer keeps producing.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://kafka.apache.org/?source=post_page-----8606f9f543b1---------------------------------------" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Is setting up Kafka more complex than a database? Yep. Is it worth it? At this level, you can’t really afford to go with anything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We’re Building
&lt;/h2&gt;

&lt;p&gt;The architecture is straightforward:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm8u74ci2mydud7j1zrx8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm8u74ci2mydud7j1zrx8.png" width="640" height="677"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We continuously monitor a set of keywords via a SERP API (I’m using &lt;a href="https://get.brightdata.com/bd7914?utm_content=turning_google_search_into_a_kafka_event_stream_for_many_consumers" rel="noopener noreferrer"&gt;Bright Data&lt;/a&gt;), detect when the structure changes (ads appear, featured snippets shift, domains enter/exit), emit those changes (and &lt;em&gt;only&lt;/em&gt; the changes — we have to maintain some form of state to spot the delta) as events into Kafka, and fan them out to multiple downstream consumers who can then log or store the data as they like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before we start building, you need Python dependencies and Kafka infrastructure running.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; pip install requests python-dotenv kafka-python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, Kafka requires &lt;a href="https://zookeeper.apache.org/" rel="noopener noreferrer"&gt;Zookeeper&lt;/a&gt; and a broker. Instead of installing them manually, we use Docker Compose with this YAML file.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;💡&lt;/em&gt; In newer Kafka versions (≥3.3), &lt;a href="https://developer.confluent.io/learn/kraft/" rel="noopener noreferrer"&gt;KRaft (Kafka without ZooKeeper)&lt;/a&gt; replaces the old coordination model.&lt;/p&gt;

&lt;p&gt;I’m sticking with what I’m familiar with here, but if you’re setting up a modern cluster — or using an AI agent to generate your setup — it may use KRaft instead of ZooKeeper by default. Both approaches work fine for our use case, and we treat ZooKeeper as largely set-and-forget anyway.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version: '3.8'  

services:  
  zookeeper:  
    image: confluentinc/cp-zookeeper:7.5.0  
    hostname: zookeeper  
    container_name: zookeeper  
    ports:  
      - "2181:2181"  
    environment:  
      ZOOKEEPER_CLIENT_PORT: 2181  
      ZOOKEEPER_TICK_TIME: 2000  

  kafka:  
    image: confluentinc/cp-kafka:7.5.0  
    hostname: kafka  
    container_name: kafka  
    depends_on:  
      - zookeeper  
    ports:  
      - "9092:9092"  
      - "9101:9101"  
    environment:  
      KAFKA_BROKER_ID: 1  
      KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'  
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_INTERNAL:PLAINTEXT  
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092,PLAINTEXT_INTERNAL://kafka:29092  
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1  
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1  
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1  
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0  
      KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'true'  
      # retention: 7 days  
      KAFKA_LOG_RETENTION_HOURS: 168  
      KAFKA_LOG_RETENTION_BYTES: 1073741824  


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, run it with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This starts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Zookeeper (at port 2181) — coordinates Kafka cluster&lt;/li&gt;
&lt;li&gt;  Kafka Broker (at port 9092) — stores and distributes events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finally, for the SERP API, &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=turning_google_search_into_a_kafka_event_stream_for_many_consumers" rel="noopener noreferrer"&gt;get your credentials here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;…and put them in a .env file in the project root like so.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BRIGHT_DATA_API_KEY=your_api_key_here  
BRIGHT_DATA_ZONE=your_zone_name_here  
BRIGHT_DATA_COUNTRY=us
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it. You’re ready to start building.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Fetching SERP Results
&lt;/h2&gt;

&lt;p&gt;So we can’t scrape Google directly. Google blocks automated requests, and residential/datacenter proxies don’t work for search. The solution is to use a SERP API. These return structured JSON of Google’s results page, including organic results, ads, featured snippets, video carousels, and more.&lt;/p&gt;

&lt;p&gt;For reusability, we’ll first build a client that wraps the SERP API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
import os  
import requests  
from typing import Dict, Any, Optional  
from dotenv import load_dotenv  

load_dotenv()  


class BrightDataClient:  
    """  
    Reusable client for Bright Data SERP API.  
    """  

    def __init__(        self,  
        api_key: Optional[str] = None,  
        zone: Optional[str] = None,  
        country: Optional[str] = None    ):  
        # load from environment variables if not provided  
        env_api_key = os.getenv("BRIGHT_DATA_API_KEY")  
        env_zone = os.getenv("BRIGHT_DATA_ZONE")  
        env_country = os.getenv("BRIGHT_DATA_COUNTRY")  

        # use provided values or fall back to environment variables  
        self.api_key = api_key or env_api_key  
        self.zone = zone or env_zone  
        self.country = country or env_country  
        self.api_endpoint = "https://api.brightdata.com/request"  

        if not self.api_key:  
            raise ValueError(  
                "BRIGHT_DATA_API_KEY must be provided via constructor or environment variable. "  
                "Get your API key from: https://brightdata.com/cp/setting/users"  
            )  

        if not self.zone:  
            raise ValueError(  
                "BRIGHT_DATA_ZONE must be provided via constructor or environment variable. "  
                "Manage zones at: https://brightdata.com/cp/zones"  
            )  

        # setup session with API authentication  
        self.session = requests.Session()  
        self.session.headers.update({  
            'Content-Type': 'application/json',  
            'Authorization': f'Bearer {self.api_key}'  
        })  

    def search(        self,  
        query: str,  
        num_results: int = 10,  
        language: Optional[str] = None,  
        country: Optional[str] = None    ) -&amp;gt;` Dict[str, Any]:  
        """  
        Executes a google search.  
        Args:  
            query: Search query string  
            num_results: Number of results to return (default: 10)  
            language: Language code (e.g., 'en', 'es', 'fr')  
            country: Country code (e.g., 'us', 'uk', 'ca') - overrides instance default  

        Returns:  
            Dictionary containing search results in JSON format  
        """  
        # build Google search URL  
        search_url = (  
            f"https://www.google.com/search"  
            f"?q={requests.utils.quote(query)}"  
            f"&amp;amp;num={num_results}"  
            f"&amp;amp;brd_json=1"  
        )  

        if language:  
            search_url += f"&amp;amp;hl={language}&amp;amp;lr=lang_{language}"  

        # use method parameter country or instance default  
        target_country = country or self.country  

        # prepare request payload for SERP API  
        payload = {  
            'zone': self.zone,  
            'url': search_url,  
            'format': 'json'  
        }  

        if target_country:  
            payload['country'] = target_country  

        try:  
            # use POST request to SERP API endpoint  
            response = self.session.post(  
                self.api_endpoint,  
                json=payload,  
                timeout=30  
            )  
            response.raise_for_status()  

            # this SERP API returns JSON directly when format='json' so just return it  
            return response.json()  

        except requests.exceptions.HTTPError as e:  
            error_msg = f"Search request failed with HTTP {e.response.status_code}"  
            if e.response.text:  
                error_msg += f": {e.response.text[:200]}"  
            raise RuntimeError(error_msg) from e  
        except requests.exceptions.RequestException as e:  
            raise RuntimeError(f"Search request failed: {e}") from e
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As a preview, here’s how we use it later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from src.bright_data import BrightDataClient  
# initialize client (reads credentials from .env file)  
client = BrightDataClient()  
# fetch SERP for a keyword  
raw_response = client.search(  
  query="ai crm",  
  num_results=10,  
  country="us"  
)  
# raw_response now contains the full SERP structure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There’s just one problem — the raw Bright Data response you get back from this API is way too verbose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "general": {  
    "search_engine": "google",  
    "query": "ai crm",  
    "language": "en",  
    "mobile": false,  
    "basic_view": false,  
    "search_type": "text",  
    "page_title": "ai crm - Google Search",  
    "timestamp": "2026–01–07T11:18:35.216Z"  
  },  
  "input": {  
    "original_url": "https://www.google.com/search?q=ai+crm&amp;amp;brd_json=1&amp;amp;gl=us",  
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",  
    "request_id": "some_request_id"  
  },  
  "navigation": [ /* 6 navigation links */ ],  
  "organic": [  
    {  
      "link": "https://attio.com/",  
      "source": "Attio",  
      "display_link": "https://attio.com",  
      "title": "Attio: The next gen of CRM",  
      "description": "Execute your revenue strategy with precision…",  
      "rank": 1,  
      "global_rank": 1,  
      "extensions": [ /* site links, dates, etc */ ]  
    },  
    // ...8 more results with full metadata  
  ],  
  "images": [ /* image results */ ],  
  "top_ads": [{}, {}],  
  "bottom_ads": [{}, {}, {}],  
  "pagination": { /* pagination links */ },  
  "related": [ /* 8 related searches */ ]  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;That’s 350+ lines of JSON for a single search.&lt;/strong&gt; Most of it is noise for change detection. For change detection, we only care about:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Which domains are present&lt;/strong&gt; (not full URLs, titles, descriptions)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Which features exist&lt;/strong&gt; (ads, featured snippets, video carousels) — boolean flags&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;When this snapshot was taken&lt;/strong&gt; (timestamp)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;What keyword/geo this is for&lt;/strong&gt; (context)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything else — navigation links, user agents, request IDs, pagination, related searches, full result metadata — is irrelevant for detecting &lt;em&gt;changes&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So let’s normalize it into a much leaner format for consumption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 : Why Normalization Matters
&lt;/h2&gt;

&lt;p&gt;So we transform the verbose raw response into a compact snapshot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"""  
/src/normalizer.py   

SERP data normalization  
converts raw Bright Data response to structured snapshot format  
"""  

from datetime import datetime  
from urllib.parse import urlparse  
from typing import Dict, Any  


def extract_domain(url: str) -&amp;gt; str:  
    """extract domain from URL"""  
    try:  
        parsed = urlparse(url)  
        domain = parsed.netloc or parsed.path  
        # remove www. prefix  
        if domain.startswith('www.'):  
            domain = domain[4:]  
        return domain  
    except Exception:  
        return url  


def normalize_serp_data(raw_response: Dict[str, Any], keyword: str, geo: str = "US") -&amp;gt; Dict[str, Any]:  
    """  
    normalize SERP response into structured format  

    Args:  
        raw_response: raw JSON response from Bright Data SERP API  
        keyword: search keyword  
        geo: geographic location code (default: "US")  

    Returns:  
        normalized dictionary with structured SERP data  
    """  

    # extract organic domains  
    organic_domains = []  
    if raw_response.get('organic') and isinstance(raw_response['organic'], list):  
        for result in raw_response['organic']:  
            link = result.get('link') or result.get('url', '')  
            if link:  
                domain = extract_domain(link)  
                if domain:  
                    organic_domains.append(domain)  

    # check for ads (top_ads or bottom_ads)  
    top_ads = raw_response.get('top_ads', [])  
    bottom_ads = raw_response.get('bottom_ads', [])  
    has_ads = bool(  
        (top_ads and len([a for a in top_ads if a]) &amp;gt; 0) or  
        (bottom_ads and len([a for a in bottom_ads if a]) &amp;gt; 0) or  
        (raw_response.get('ads') and len(raw_response.get('ads', [])) &amp;gt; 0)  
    )  
    ads_count = len([a for a in top_ads if a]) + len([a for a in bottom_ads if a])  

    # check for featured snippet (knowledge panel)  
    has_featured_snippet = bool(raw_response.get('knowledge'))  

    # check for video carousel  
    has_video_carousel = bool(raw_response.get('video') and len(raw_response.get('video', [])) &amp;gt; 0)  

    # check for people also ask  
    has_people_also_ask = bool(raw_response.get('people_also_ask') and len(raw_response.get('people_also_ask', [])) &amp;gt; 0)  

    # extract featured snippet owner if available  
    featured_snippet_owner = None  
    if has_featured_snippet:  
        knowledge = raw_response.get('knowledge', {})  
        # try to extract owner from knowledge panel  
        if knowledge.get('url'):  
            featured_snippet_owner = extract_domain(knowledge['url'])  
        elif knowledge.get('source'):  
            featured_snippet_owner = extract_domain(knowledge['source'])  

    # build normalized structure  
    normalized = {  
        "keyword": keyword,  
        "geo": geo,  
        "features": {  
            "ads": has_ads,  
            "ads_count": ads_count if has_ads else 0,  
            "featured_snippet": has_featured_snippet,  
            "featured_snippet_owner": featured_snippet_owner,  
            "video_carousel": has_video_carousel,  
            "people_also_ask": has_people_also_ask  
        },  
        "organic_domains": organic_domains,  
        "timestamp": datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")  
    }  

    return normalized  

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, we extract just the domain names, not full URLs (i.e. from “&lt;code&gt;https://www.salesforce.com/crm/&lt;/code&gt;” to “&lt;code&gt;salesforce.com&lt;/code&gt;”. Why? Because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  URLs change (query params, paths) but domains are stable identifiers&lt;/li&gt;
&lt;li&gt;  We care about &lt;em&gt;which sites&lt;/em&gt; are ranking, not specific pages&lt;/li&gt;
&lt;li&gt;  Domain-level tracking is more resilient to URL structure changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We also convert complex nested structures into simple booleans:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Raw:   
{  
  "top_ads": [{...}, {...}],   
  "bottom_ads": []  
}  
# Normalized:   
{  
  "ads": true,   
  "ads_count": 2  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes some comparisons we’ll perform later, trivial:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if current['features']['ads'] and not previous['features']['ads']:  
  # Ads appeared! Emit a signal.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, when a featured snippet exists, we extract which domain owns it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if has_featured_snippet:  
  knowledge = raw_response.get('knowledge', {})  
if knowledge.get('url'):  
  featured_snippet_owner = extract_domain(knowledge['url'])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lets us track when featured snippet ownership changes — a critical competitive signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before vs After
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foncib2vsbnqa560bcw3o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foncib2vsbnqa560bcw3o.png" width="640" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fapt51e55olgdg6wdvw8g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fapt51e55olgdg6wdvw8g.png" width="640" height="609"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Raw response: 350+ lines, ~15KB. Normalized snapshot: 22 lines, ~500 bytes&lt;/p&gt;

&lt;p&gt;This way, we store 30x less data per snapshot, and perform comparisons way faster because we’re comparing arrays of domains instead of deep object structures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Change Detection — Turning Snapshots into Events
&lt;/h2&gt;

&lt;p&gt;At this point we have normalized SERP snapshots. But snapshots alone aren’t very useful for Kafka. We’ve talked about this before — Kafka wants &lt;em&gt;events&lt;/em&gt;, not states.&lt;/p&gt;

&lt;p&gt;So the job of Step 3 is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Compare the current snapshot with the previous one and emit&lt;/em&gt; only what changed &lt;em&gt;i.e. the delta.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"""  
/src/change_detector.py  
change detection logic for SERP snapshots  
compares current vs previous state and emits change events  
"""  

from typing import Dict, Any, List, Optional  
from datetime import datetime  


class ChangeDetector:  
    """  
    detects changes between SERP snapshots  
    generates change events for Kafka  
    """  

    def detect_changes(        self,  
        current: Dict[str, Any],  
        previous: Optional[Dict[str, Any]]    ) -&amp;gt; List[Dict[str, Any]]:  
        """  
        compare current and previous snapshots, return list of change events  

        Args:  
            current: current normalized SERP snapshot  
            previous: previous normalized SERP snapshot (None if first run)  

        Returns:  
            list of change event dictionaries  
        """  
        events = []  

        if previous is None:  
            # first run - no changes to detect  
            return events  

        # detect feature changes  
        events.extend(self._detect_feature_changes(current, previous))  

        # detect organic domain changes  
        events.extend(self._detect_organic_changes(current, previous))  

        return events  

    def _detect_feature_changes(        self,  
        current: Dict[str, Any],  
        previous: Dict[str, Any]    ) -&amp;gt; List[Dict[str, Any]]:  
        """detect changes in SERP features (ads, featured snippet, etc.)"""  
        events = []  

        current_features = current.get('features', {})  
        previous_features = previous.get('features', {})  

        feature_types = ['ads', 'featured_snippet', 'video_carousel', 'people_also_ask']  

        for feature in feature_types:  
            current_value = current_features.get(feature, False)  
            previous_value = previous_features.get(feature, False)  

            # feature appeared  
            if current_value and not previous_value:  
                event = {  
                    "event_type": "serp_feature_added",  
                    "keyword": current['keyword'],  
                    "geo": current['geo'],  
                    "feature": feature,  
                    "timestamp": current['timestamp']  
                }  

                # add feature-specific metadata  
                if feature == "ads":  
                    # count ads if available  
                    event["count"] = current.get('features', {}).get('ads_count', 1)  
                elif feature == "featured_snippet":  
                    # extract owner domain if available  
                    owner = current.get('features', {}).get('featured_snippet_owner')  
                    if owner:  
                        event["owner_domain"] = owner  

                events.append(event)  

            # feature disappeared  
            elif not current_value and previous_value:  
                event = {  
                    "event_type": "serp_feature_removed",  
                    "keyword": current['keyword'],  
                    "geo": current['geo'],  
                    "feature": feature,  
                    "timestamp": current['timestamp']  
                }  
                events.append(event)  

        return events  

    def _detect_organic_changes(        self,  
        current: Dict[str, Any],  
        previous: Dict[str, Any]    ) -&amp;gt; List[Dict[str, Any]]:  
        """detect changes in organic domain rankings"""  
        events = []  

        current_domains = set(current.get('organic_domains', []))  
        previous_domains = set(previous.get('organic_domains', []))  

        # domains that exited (were in previous, not in current)  
        exited = previous_domains - current_domains  
        for domain in exited:  
            # find previous position  
            previous_position = self._get_domain_position(domain, previous)  
            event = {  
                "event_type": "organic_exit",  
                "keyword": current['keyword'],  
                "geo": current['geo'],  
                "domain": domain,  
                "previous_position": previous_position,  
                "timestamp": current['timestamp']  
            }  
            events.append(event)  

        # domains that entered (in current, not in previous)  
        entered = current_domains - previous_domains  
        for domain in entered:  
            # find current position  
            current_position = self._get_domain_position(domain, current)  
            event = {  
                "event_type": "organic_entry",  
                "keyword": current['keyword'],  
                "geo": current['geo'],  
                "domain": domain,  
                "current_position": current_position,  
                "timestamp": current['timestamp']  
            }  
            events.append(event)  

        return events  


    def _get_domain_position(        self,   
        domain: str,   
        snapshot: Dict[str, Any]    ) -&amp;gt; Optional[int]:  
        """get position of domain in organic results (1-indexed)"""  
        domains = snapshot.get('organic_domains', [])  
        try:  
            return domains.index(domain) + 1  
        except ValueError:  
            return None  


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This detector compares two normalized snapshots and emits only the delta. Because we normalized SERP features into simple booleans in Step 2, detecting changes becomes trivial.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# this avoids deep JSON comparisons + false positives from minor SERP variations  
if now and not before:  
    # feature appeared
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also, for organic results, we intentionally ignore position changes. Rank reshuffles inside the top 10 happen constantly and aren’t very actionable. A domain entering or disappearing from the SERP is what’s &lt;em&gt;actually&lt;/em&gt; interesting. Because organic results were normalized into a flat list of domains, this becomes a set comparison: &lt;code&gt;previous — current = domains that exited&lt;/code&gt;and &lt;code&gt;current — previous = domains that entered&lt;/code&gt;. This gives us O(1) lookups and O(n) comparisons.&lt;/p&gt;

&lt;p&gt;Each run produces zero or more small events like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "event_type": "serp_feature_added",  
  "keyword": "ai crm",  
  "geo": "US",  
  "feature": "ads",  
  "count": 3,  
  "timestamp": "2026-01-07T12:00:00Z"  
},  
{  
  "event_type": "serp_feature_added",  
  "keyword": "ai crm",  
  "geo": "US",  
  "feature": "featured_snippet",  
  "owner_domain": "hubspot.com",  
  "timestamp": "2026-01-07T12:00:00Z"  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of storing “what the SERP looks like now”, now we’re storing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;what&lt;/strong&gt; happened&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;when&lt;/strong&gt; it happened&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;for which keyword + geo&lt;/strong&gt; it happened.&lt;/li&gt;
&lt;li&gt;  When relevant, we also attach extra metadata (ad count, featured snippet owner) so downstream consumers don’t need to re-hydrate context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the exact shape Kafka wants. At this point, SERP data has stopped being scraped snapshots for us, and is behaving like a &lt;strong&gt;true event stream&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Before we start the pipeline, we should cover state management, briefly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: A Simple State Manager
&lt;/h2&gt;

&lt;p&gt;We’ll use simple file-based storage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"""  
/src/state_manager.py  
state management for SERP snapshots  
stores previous state to enable change detection  
"""  
import os  
import json  
from typing import Dict, Any, Optional  
from pathlib import Path  
class StateManager:  
    """  
    manages SERP snapshot state storage  
    stores previous snapshots to compare against current ones  
    """  

    def __init__(self, state_dir: str = "state"):  
        self.state_dir = Path(state_dir)  
        self.state_dir.mkdir(exist_ok=True)  

    def _get_state_file(self, keyword: str, geo: str) -&amp;gt; Path:  
        """get state file path for keyword+geo combination"""  
        safe_keyword = keyword.replace(" ", "_").lower()  
        safe_geo = geo.lower()  
        filename = f"{safe_keyword}_{safe_geo}.json"  
        return self.state_dir / filename  

    def load_previous_state(        self,   
        keyword: str,   
        geo: str    ) -&amp;gt; Optional[Dict[str, Any]]:  
        """  
        load previous SERP snapshot for keyword+geo  

        Returns:  
            previous snapshot dict or None if no previous state exists  
        """  
        state_file = self._get_state_file(keyword, geo)  

        if not state_file.exists():  
            return None  

        try:  
            with open(state_file, 'r', encoding='utf-8') as f:  
                return json.load(f)  
        except Exception as e:  
            print(f"Warning: failed to load state from {state_file}: {e}")  
            return None  

    def save_state(        self,   
        keyword: str,   
        geo: str,   
        snapshot: Dict[str, Any]    ) -&amp;gt; None:  
        """  
        save current SERP snapshot as new state  

        Args:  
            keyword: search keyword  
            geo: geographic location code  
            snapshot: normalized SERP snapshot to save  
        """  
        state_file = self._get_state_file(keyword, geo)  

        try:  
            with open(state_file, 'w', encoding='utf-8') as f:  
                json.dump(snapshot, f, indent=2, ensure_ascii=False)  
        except Exception as e:  
            print(f"Error: failed to save state to {state_file}: {e}")  
            raise
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So we’ll store our state files like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./state/  
  ├── ai_crm_us.json  
  ├── best_crm_us.json  
  └── ai_crm_uk.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each state file contains the last normalized snapshot for that &lt;strong&gt;keyword + geo&lt;/strong&gt; combination, as you’ll see when we bring it all together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: The Producer Pipeline — Tying It All Together
&lt;/h2&gt;

&lt;p&gt;At this point we have all the individual pieces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  SERP fetching&lt;/li&gt;
&lt;li&gt;  Snapshot normalization&lt;/li&gt;
&lt;li&gt;  Change detection&lt;/li&gt;
&lt;li&gt;  Kafka event emission&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Step 4 is where they come together into a single &lt;strong&gt;producer pipeline&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This producer fetches the latest SERP → normalizes it → compares it to the previous snapshot → emits only the detected changes (the delta) to Kafka → persists state for the next run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"""  
/src/producer.py  

Kafka producer for SERP change events  
fetches SERP data, detects changes, and emits events to Kafka  
"""  

import time  
import json  
from typing import Dict, Any, List  
from kafka import KafkaProducer  
from kafka.errors import KafkaError  

from bright_data import BrightDataClient  
from normalizer import normalize_serp_data  
from state_manager import StateManager  
from change_detector import ChangeDetector  


class SERPProducer:  
    """  
    For a Kafka producer that monitors SERP changes and emits events to Kafka  
    """  

    def __init__(        self,  
        kafka_brokers: str = "localhost:9092",  
        state_dir: str = "state"    ):  
        self.producer = KafkaProducer(  
            bootstrap_servers=kafka_brokers,  
            # convert because KafkaProducer expects callable serializers.  
            value_serializer=lambda v: json.dumps(v).encode('utf-8'),  
            key_serializer=lambda k: k.encode('utf-8') if k else None  
        )  
        self.bright_data = BrightDataClient()  
        self.state_manager = StateManager(state_dir)  
        self.change_detector = ChangeDetector()  

    def process_keyword(        self,  
        keyword: str,  
        geo: str = "US",  
        num_results: int = 10    ) -&amp;gt; List[Dict[str, Any]]:  
        """  
        process a single keyword: fetch, compare, emit changes  

        Args:  
            keyword: search keyword to monitor  
            geo: geographic location code  
            num_results: number of results to fetch  

        Returns:  
            list of change events emitted  
        """  
        print(f"Processing keyword: {keyword} (geo: {geo})")  

        # fetch current SERP  
        print("  Fetching current SERP...")  
        try:  
            raw_response = self.bright_data.search(  
                query=keyword,  
                num_results=num_results,  
                country=geo.lower()  
            )  
        except Exception as e:  
            print(f"  Error fetching SERP: {e}")  
            return []  

        # normalize current snapshot  
        current_snapshot = normalize_serp_data(raw_response, keyword, geo)  
        print(f"  Current snapshot: {len(current_snapshot['organic_domains'])} domains")  

        # load previous state  
        previous_snapshot = self.state_manager.load_previous_state(keyword, geo)  

        if previous_snapshot:  
            print(f"  Previous snapshot: {len(previous_snapshot['organic_domains'])} domains")  
        else:  
            print("  No previous state found (first run)")  

        # detect changes  
        change_events = self.change_detector.detect_changes(  
            current_snapshot,  
            previous_snapshot  
        )  

        print(f"  Detected {len(change_events)} change(s)")  

        # emit events to Kafka  
        emitted_events = []  
        for event in change_events:  
            try:  
                # use keyword+geo as key for partitioning  
                key = f"{keyword}:{geo}"  
                topic = "serp-changes"                  
                future = self.producer.send(topic, key=key, value=event)  
                # wait for confirmation (optional - can be async)  
                record_metadata = future.get(timeout=10)  

                print(f"    Emitted: {event['event_type']} -&amp;gt; partition {record_metadata.partition}, offset {record_metadata.offset}")  
                emitted_events.append(event)  
            except KafkaError as e:  
                print(f"    Error emitting event: {e}")  

        # save current state as new previous state  
        self.state_manager.save_state(keyword, geo, current_snapshot)  
        print("  State saved")  

        return emitted_events  

    def run_monitoring_loop(        self,  
        keywords: List[Dict[str, str]],  
        interval_seconds: int = 1800  # 30 minutes default    ):  
        """  
        run continuous monitoring loop  

        Args:  
            keywords: list of dicts with 'keyword' and 'geo' keys  
            interval_seconds: seconds between monitoring cycles  
        """  
        print(f"Starting monitoring loop (interval: {interval_seconds}s)")  
        print(f"Tracking {len(keywords)} keyword(s)")  

        try:  
            while True:  
                print("\\n" + "=" * 60)  
                print(f"Monitoring cycle at {time.strftime('%Y-%m-%d %H:%M:%S')}")  
                print("=" * 60)  

                for kw_config in keywords:  
                    keyword = kw_config['keyword']  
                    geo = kw_config.get('geo', 'US')  

                    try:  
                        self.process_keyword(keyword, geo)  
                        # small delay between keywords  
                        time.sleep(2)  
                    except Exception as e:  
                        print(f"Error processing {keyword}: {e}")  

                print(f"\\nSleeping for {interval_seconds} seconds...")  
                time.sleep(interval_seconds)  

        except KeyboardInterrupt:  
            print("\\nShutting down...")  
            self.producer.close()  


def main():  
    """example usage"""  
    producer = SERPProducer()  

    # example: monitor single keyword  
    keywords = [  
        {"keyword": "ai crm", "geo": "US"}  
    ]  

    # run once (for testing)  
    # producer.process_keyword("ai crm", "US")  

    # or run continuous loop  
    producer.run_monitoring_loop(keywords, interval_seconds=3600)  # 1 hour  


if __name__ == "__main__":  
    main()  


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the synchronous Kafka sends (&lt;code&gt;future.get(timeout=10)&lt;/code&gt;)? This is intentional. We’re not optimizing for throughput, we’re monitoring a handful of keywords every 30 minutes. I’d rather wait 10 seconds and know the event was written than fail silently and lose a critical alert.&lt;/p&gt;

&lt;p&gt;For a high-throughput system, you’d batch events and send asynchronously.&lt;/p&gt;

&lt;p&gt;Also, note the Kafka partitioning key we use: &lt;code&gt;key = f”{keyword}:{geo}”&lt;/code&gt;. &lt;strong&gt;This ensures you’ll never see a “domain entered” event before the “featured snippet appeared” event that caused it&lt;/strong&gt;, because they’re in the same partition and processed sequentially.&lt;/p&gt;

&lt;p&gt;Here’s how we use &lt;code&gt;run_monitoring_loop()&lt;/code&gt; to start the pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;producer = SERPProducer()  
keywords = [  
    {"keyword": "ai crm", "geo": "US"},  
    {"keyword": "best crm", "geo": "US"}  
]  
# run continuous loop (every 30 minutes)  
producer.run_monitoring_loop(keywords, interval_seconds=1800)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pipeline runs continuously, emitting events whenever SERP composition changes. Next, we’ll see how multiple independent consumers process these events.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: The Consumers — Independent Event Processing
&lt;/h2&gt;

&lt;p&gt;This is where Kafka’s value becomes obvious. We have three completely independent consumers processing the same event stream. Each has its own offset, its own logic, and can evolve independently.&lt;/p&gt;

&lt;p&gt;But they all follow the same pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Each consumer has a unique &lt;code&gt;group_id&lt;/code&gt; — this is how Kafka tracks offsets per consumer&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;auto_offset_reset=’earliest’&lt;/code&gt; — start from the beginning on first run (can replay history)&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;enable_auto_commit=True&lt;/code&gt; — automatically commit offsets after processing&lt;/li&gt;
&lt;li&gt;  Each consumer processes events at its own pace — if one is slow, others aren’t affected&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Consumer 1: SEO Alerts
&lt;/h3&gt;

&lt;p&gt;This consumer only cares about high-impact changes. Ads appearing? Featured snippets appearing? Those directly affect CTR. Everything else gets ignored.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"""  
Consumer 1: SEO Alerting Service  
listens for SERP feature changes that impact SEO  
"""  

import json  
import sys  
import os  
from datetime import datetime  
from kafka import KafkaConsumer  

# add src to path  
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '../src'))  


class SEOAlertConsumer:  
    """  
    consumer that alerts on SERP changes affecting SEO  
    focuses on ads and featured snippets appearing  
    """  

    def __init__(        self,  
        kafka_brokers: str = "localhost:9092",  
        topic: str = "serp-changes",  
        output_file: str = None    ):  
        self.topic = topic  
        self.consumer = KafkaConsumer(  
            topic,  
            bootstrap_servers=kafka_brokers,  
            value_deserializer=lambda m: json.loads(m.decode('utf-8')),  
            group_id='seo-alert-consumer',  
            auto_offset_reset='earliest',  # start from beginning  
            enable_auto_commit=True  
        )  
        # set output file path relative to script directory  
        if output_file is None:  
            script_dir = os.path.dirname(os.path.abspath(__file__))  
            output_file = os.path.join(script_dir, 'seo_alerts.txt')  
        self.output_file = output_file  

    def process_event(self, event: dict) -&amp;gt; None:  
        """process a single change event"""  
        event_type = event.get('event_type')  

        # only care about feature additions (ads, featured snippets)  
        if event_type == 'serp_feature_added':  
            feature = event.get('feature')  

            if feature in ['ads', 'featured_snippet']:  
                alert = self._create_alert(event, feature)  
                self._save_alert(alert)  
                self._print_alert(alert)  

    def _create_alert(self, event: dict, feature: str) -&amp;gt; dict:  
        """create alert message"""  
        keyword = event.get('keyword')  
        geo = event.get('geo')  

        if feature == 'ads':  
            count = event.get('count', 1)  
            message = f"Heads up: Google added {count} ad(s) for '{keyword}' in {geo}. Expect organic CTR drop."  
        elif feature == 'featured_snippet':  
            owner = event.get('owner_domain', 'unknown')  
            message = f"Heads up: Google added a featured snippet for '{keyword}' in {geo} (owned by {owner}). Expect organic CTR drop."  
        else:  
            message = f"SERP feature '{feature}' appeared for '{keyword}' in {geo}"  

        return {  
            'timestamp': event.get('timestamp'),  
            'keyword': keyword,  
            'geo': geo,  
            'feature': feature,  
            'message': message,  
            'event': event  
        }  

    def _save_alert(self, alert: dict) -&amp;gt; None:  
        """save alert to file"""  
        try:  
            with open(self.output_file, 'a', encoding='utf-8') as f:  
                f.write(json.dumps(alert) + '\\n')  
        except Exception as e:  
            print(f"Error saving alert: {e}")  

    def _print_alert(self, alert: dict) -&amp;gt; None:  
        """print alert to console"""  
        print("\\n" + "=" * 60)  
        print("SEO ALERT")  
        print("=" * 60)  
        print(f"Time: {alert['timestamp']}")  
        print(f"Keyword: {alert['keyword']} ({alert['geo']})")  
        print(f"Feature: {alert['feature']}")  
        print(f"Message: {alert['message']}")  
        print("=" * 60)  

    def run(self):  
        """main consumer loop"""  
        print("SEO Alert Consumer started")  
        print(f"Listening to topic: {self.topic}")  
        print(f"Alerts will be saved to: {self.output_file}")  
        print("\\nWaiting for events...\\n")  

        try:  
            for message in self.consumer:  
                event = message.value  
                self.process_event(event)  
        except KeyboardInterrupt:  
            print("\\nShutting down...")  
        finally:  
            self.consumer.close()  


def main():  
    consumer = SEOAlertConsumer()  
    consumer.run()  


if __name__ == "__main__":  
    main()  

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This consumer simply filters for &lt;code&gt;serp_feature_added&lt;/code&gt;events, and only processes &lt;code&gt;ads&lt;/code&gt; and &lt;code&gt;featured_snippet&lt;/code&gt; features. It creates human-readable alerts and saves them to a TXT file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "timestamp": "2026-01-07T12:00:00Z",  
  "keyword": "ai crm",  
  "geo": "US",  
  "feature": "ads",  
  "message": "Heads up: Google added 3 ad(s) for 'ai crm' in US. Expect organic CTR drop.",  
  "event": { /* full event data */ }  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Consumer 2: Competitive Intelligence
&lt;/h3&gt;

&lt;p&gt;This consumer tracks competitors. You can configure a list of domains to watch (hubspot.com, salesforce.com, etc.), and it logs when they enter or exit the SERP, or when they gain/lose the featured snippet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"""  
Consumer 2: Competitive Intelligence Service  
tracks competitor movements in SERP  
"""  

import json  
import sys  
import os  
from kafka import KafkaConsumer  

# add src to path  
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '../src'))  


class CompetitiveIntelConsumer:  
    """  
    consumer that tracks competitive movements  
    focuses on featured snippet ownership and domain entries/exits  
    """  

    def __init__(        self,  
        kafka_brokers: str = "localhost:9092",  
        topic: str = "serp-changes",  
        output_file: str = "competitive_intel.txt",  
        tracked_domains: list = None    ):  
        self.topic = topic  
        self.consumer = KafkaConsumer(  
            topic,  
            bootstrap_servers=kafka_brokers,  
            value_deserializer=lambda m: json.loads(m.decode('utf-8')),  
            group_id='competitive-intel-consumer',  
            auto_offset_reset='earliest',  
            enable_auto_commit=True  
        )  
        self.output_file = output_file  
        self.tracked_domains = set(tracked_domains or [])  
        # track featured snippet ownership  
        self.featured_snippet_owners = {}  # keyword+geo -&amp;gt; domain  

    def process_event(self, event: dict) -&amp;gt; None:  
        """process a single change event"""  
        event_type = event.get('event_type')  
        keyword = event.get('keyword')  
        geo = event.get('geo')  
        key = f"{keyword}:{geo}"  

        # track featured snippet ownership changes  
        if event_type == 'serp_feature_added' and event.get('feature') == 'featured_snippet':  
            owner = event.get('owner_domain')  
            if owner:  
                previous_owner = self.featured_snippet_owners.get(key)  
                if previous_owner != owner:  
                    self._log_featured_snippet_change(keyword, geo, previous_owner, owner, event)  
                    self.featured_snippet_owners[key] = owner  

        # track domain entries/exits  
        if event_type == 'organic_entry':  
            domain = event.get('domain')  
            if self._should_track(domain):  
                self._log_domain_entry(keyword, geo, domain, event)  

        elif event_type == 'organic_exit':  
            domain = event.get('domain')  
            if self._should_track(domain):  
                self._log_domain_exit(keyword, geo, domain, event)  

    def _should_track(self, domain: str) -&amp;gt; bool:  
        """determine if domain should be tracked"""  
        if not self.tracked_domains:  
            return True  # track all if none specified  
        return domain in self.tracked_domains  

    def _log_featured_snippet_change(        self,  
        keyword: str,  
        geo: str,  
        previous_owner: str,  
        new_owner: str,  
        event: dict    ) -&amp;gt; None:  
        """log featured snippet ownership change"""  
        if previous_owner:  
            message = f"Featured snippet ownership changed: {previous_owner} -&amp;gt; {new_owner}"  
        else:  
            message = f"{new_owner} gained SERP ownership via featured snippet"  

        intel = {  
            'timestamp': event.get('timestamp'),  
            'keyword': keyword,  
            'geo': geo,  
            'type': 'featured_snippet_ownership',  
            'previous_owner': previous_owner,  
            'new_owner': new_owner,  
            'message': message  
        }  

        self._save_intel(intel)  
        self._print_intel(intel)  

    def _log_domain_entry(        self,  
        keyword: str,  
        geo: str,  
        domain: str,  
        event: dict    ) -&amp;gt; None:  
        """log domain entry into SERP"""  
        position = event.get('current_position')  
        intel = {  
            'timestamp': event.get('timestamp'),  
            'keyword': keyword,  
            'geo': geo,  
            'type': 'domain_entry',  
            'domain': domain,  
            'position': position,  
            'message': f"{domain} entered SERP at position {position}"  
        }  

        self._save_intel(intel)  
        self._print_intel(intel)  

    def _log_domain_exit(        self,  
        keyword: str,  
        geo: str,  
        domain: str,  
        event: dict    ) -&amp;gt; None:  
        """log domain exit from SERP"""  
        previous_position = event.get('previous_position')  
        intel = {  
            'timestamp': event.get('timestamp'),  
            'keyword': keyword,  
            'geo': geo,  
            'type': 'domain_exit',  
            'domain': domain,  
            'previous_position': previous_position,  
            'message': f"{domain} dropped out of SERP (was at position {previous_position})"  
        }  

        self._save_intel(intel)  
        self._print_intel(intel)  

    def _save_intel(self, intel: dict) -&amp;gt; None:  
        """save intelligence to file"""  
        try:  
            with open(self.output_file, 'a', encoding='utf-8') as f:  
                f.write(json.dumps(intel) + '\\n')  
        except Exception as e:  
            print(f"Error saving intel: {e}")  

    def _print_intel(self, intel: dict) -&amp;gt; None:  
        """print intelligence to console"""  
        print("\\n" + "=" * 60)  
        print("COMPETITIVE INTELLIGENCE")  
        print("=" * 60)  
        print(f"Time: {intel['timestamp']}")  
        print(f"Keyword: {intel['keyword']} ({intel['geo']})")  
        print(f"Type: {intel['type']}")  
        print(f"Message: {intel['message']}")  
        print("=" * 60)  

    def run(self):  
        """main consumer loop"""  
        print("Competitive Intelligence Consumer started")  
        print(f"Listening to topic: {self.topic}")  
        print(f"Intelligence will be saved to: {self.output_file}")  
        if self.tracked_domains:  
            print(f"Tracking domains: {', '.join(self.tracked_domains)}")  
        print("\\nWaiting for events...\\n")  

        try:  
            for message in self.consumer:  
                event = message.value  
                self.process_event(event)  
        except KeyboardInterrupt:  
            print("\\nShutting down...")  
        finally:  
            self.consumer.close()  


def main():  
    # example: track specific competitors  
    tracked_domains = ['hubspot.com', 'salesforce.com', 'zoho.com']  

    consumer = CompetitiveIntelConsumer(tracked_domains=tracked_domains)  
    consumer.run()  


if __name__ == "__main__":  
    main()  


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This maintains an in-memory state — &lt;code&gt;featured_snippet_owners&lt;/code&gt; — to detect ownership changes. Other than that, it can track all domains or specific competitors. Detects when featured snippet ownership changes (HubSpot → Salesforce), and tracks when competitors enter or exit the SERP&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "timestamp": "2026-01-07T12:00:00Z",  
  "keyword": "ai crm",  
  "geo": "US",  
  "type": "featured_snippet_ownership",  
  "previous_owner": "hubspot.com",  
  "new_owner": "salesforce.com",  
  "message": "Featured snippet ownership changed: hubspot.com -&amp;gt; salesforce.com"  
},  
{  
  "timestamp": "2026-01-07T12:00:00Z",  
  "keyword": "ai crm",  
  "geo": "US",  
  "type": "domain_entry",  
  "domain": "zoho.com",  
  "position": 5,  
  "message": "zoho.com entered SERP at position 5"  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Of course, this can be extended to write to a database, generate reports, or trigger alerts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Consumer 3: Volatility Analyzer
&lt;/h3&gt;

&lt;p&gt;This consumer aggregates events over a time window and calculates a volatility score. More changes = higher volatility.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
"""  
Consumer 3: Historical Volatility Analyzer  
aggregates SERP changes over time to calculate volatility scores  
"""  

import json  
import sys  
import os  
from datetime import datetime, timedelta  
from collections import defaultdict  
from kafka import KafkaConsumer  

# add src to path  
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '../src'))  


class VolatilityAnalyzerConsumer:  
    """  
    consumer that calculates SERP volatility scores  
    tracks change frequency over time windows  
    """  

    def __init__(        self,  
        kafka_brokers: str = "localhost:9092",  
        topic: str = "serp-changes",  
        output_file: str = "volatility_scores.txt",  
        window_days: int = 7    ):  
        self.topic = topic  
        self.consumer = KafkaConsumer(  
            topic,  
            bootstrap_servers=kafka_brokers,  
            value_deserializer=lambda m: json.loads(m.decode('utf-8')),  
            group_id='volatility-analyzer-consumer',  
            auto_offset_reset='earliest',  
            enable_auto_commit=True  
        )  
        self.output_file = output_file  
        self.window_days = window_days  
        # track events per keyword+geo  
        self.events_by_keyword = defaultdict(list)  # (keyword, geo) -&amp;gt; [events]  

    def process_event(self, event: dict) -&amp;gt; None:  
        """process a single change event"""  
        keyword = event.get('keyword')  
        geo = event.get('geo')  
        key = (keyword, geo)  

        # store event with timestamp  
        self.events_by_keyword[key].append({  
            'event': event,  
            'timestamp': event.get('timestamp')  
        })  

        # calculate volatility for this keyword  
        volatility = self._calculate_volatility(key)  

        if volatility is not None:  
            score_data = {  
                'keyword': keyword,  
                'geo': geo,  
                'volatility_score': volatility,  
                'window_days': self.window_days,  
                'change_count': len(self.events_by_keyword[key]),  
                'timestamp': datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")  
            }  

            self._save_score(score_data)  
            self._print_score(score_data)  

    def _calculate_volatility(self, key: tuple) -&amp;gt; float:  
        """  
        calculate volatility score (0.0 to 1.0)  
        higher score = more changes = more volatile  
        """  
        events = self.events_by_keyword[key]  

        if len(events) `&amp;lt; 2:  
            return None  # need at least 2 events  

        # filter events within time window  
        cutoff_time = datetime.utcnow() - timedelta(days=self.window_days)  

        recent_events = [  
            e for e in events  
            if datetime.fromisoformat(e['timestamp'].replace('Z', '+00:00')).replace(tzinfo=None) &amp;gt;` cutoff_time  
        ]  

        if len(recent_events) `&amp;lt; 2:  
            return None  

        # simple volatility: change count normalized by time window  
        # more sophisticated: could weight by change type, magnitude, etc.  
        change_count = len(recent_events)  

        # normalize: assume 1 change per day = 0.5 score  
        # max reasonable: ~10 changes per day = 1.0 score  
        max_changes_per_day = 10  
        days_in_window = self.window_days  
        max_reasonable_changes = max_changes_per_day * days_in_window  

        volatility = min(change_count / max_reasonable_changes, 1.0)  

        return round(volatility, 2)  

    def _save_score(self, score_data: dict) -&amp;gt;` None:  
        """save volatility score to file"""  
        try:  
            with open(self.output_file, 'a', encoding='utf-8') as f:  
                f.write(json.dumps(score_data) + '\\n')  
        except Exception as e:  
            print(f"Error saving score: {e}")  

    def _print_score(self, score_data: dict) -&amp;gt; None:  
        """print volatility score to console"""  
        print("\\n" + "=" * 60)  
        print("VOLATILITY ANALYSIS")  
        print("=" * 60)  
        print(f"Keyword: {score_data['keyword']} ({score_data['geo']})")  
        print(f"Volatility Score: {score_data['volatility_score']:.2f}")  
        print(f"Window: {score_data['window_days']} days")  
        print(f"Change Count: {score_data['change_count']}")  
        print(f"Timestamp: {score_data['timestamp']}")  
        print("=" * 60)  

    def run(self):  
        """main consumer loop"""  
        print("Volatility Analyzer Consumer started")  
        print(f"Listening to topic: {self.topic}")  
        print(f"Scores will be saved to: {self.output_file}")  
        print(f"Analysis window: {self.window_days} days")  
        print("\\nWaiting for events...\\n")  

        try:  
            for message in self.consumer:  
                event = message.value  
                self.process_event(event)  
        except KeyboardInterrupt:  
            print("\\nShutting down...")  
        finally:  
            self.consumer.close()  


def main():  
    consumer = VolatilityAnalyzerConsumer(window_days=7)  
    consumer.run()  


if __name__ == "__main__":  
    main()  


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s how I’m defining and calculating volatility here (capping at 1.0 or 100%) &lt;strong&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;volatility = min(change_count / (max_changes_per_day * window_days), 1.0)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And this’ll produce and log alerts like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{  
  "keyword": "ai crm",  
  "geo": "US",  
  "volatility_score": 0.35,  
  "window_days": 7,  
  "change_count": 25,  
  "timestamp": "2026-01-07T12:00:00Z"  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This says: &lt;strong&gt;25 changes in 7 days = 0.35 volatility (moderately volatile)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is this the best volatility metric?&lt;/strong&gt; Probably not. But it’s a starting point. You could weight by change type (featured snippet changes are more significant than domain entries), factor in change magnitude (position shifts vs entries/exits), or normalize by keyword search volume,&lt;/p&gt;

&lt;p&gt;The point is this consumer can evolve independently. You can improve the volatility algorithm without touching the producer or other consumers.&lt;/p&gt;

&lt;p&gt;Each consumer runs independently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Terminal 1: SEO Alerts  
python consumers/seo_alert_consumer.py  
# Terminal 2: Competitive Intelligence  
python consumers/competitive_intel_consumer.py  
# Terminal 3: Volatility Analyzer  
python consumers/volatility_analyzer_consumer.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These consumers read from the same Kafka topic (&lt;code&gt;serp-changes&lt;/code&gt;), each maintains its own offset (can process at different speeds), if one crashes, others continue processing, each writes to its own output file/log, and if you restart a consumer, it resumes from its last committed offset.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s The Payoff?
&lt;/h2&gt;

&lt;p&gt;When you put all three consumers together, you stop asking “what rank are we today?” and start answering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  How competitive is this market &lt;em&gt;right now&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;  Is Google monetizing [insert query here] more aggressively?&lt;/li&gt;
&lt;li&gt;  Are our competitors being displaced or reinforced?&lt;/li&gt;
&lt;li&gt;  Is this SERP worth investing in, or is it too volatile?&lt;/li&gt;
&lt;li&gt;  Are competitors winning because of strategy, or because of SERP structure?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See the shift? You’re no longer tracking &lt;em&gt;positions&lt;/em&gt;. You’re observing &lt;strong&gt;market dynamics.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And because this is event-driven, every insight is replayable, every consumer can evolve independently, and every new question you want answered is just another consumer you have to write.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>python</category>
      <category>kafka</category>
    </item>
    <item>
      <title>I Investigated the Top 3 AI-Generated Artists Going Viral on Spotify. Here’s Who They Are Imitating.</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Wed, 21 Jan 2026 16:19:22 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/i-investigated-the-top-3-ai-generated-artists-going-viral-on-spotify-heres-who-they-are-imitating-1c95</link>
      <guid>https://dev.to/prithwish_nath/i-investigated-the-top-3-ai-generated-artists-going-viral-on-spotify-heres-who-they-are-imitating-1c95</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F1%2A2KzhYxJdC-50_4U2hXSfTw.png" class="article-body-image-wrapper"&gt;&lt;img alt="o.g. card with title text" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F1%2A2KzhYxJdC-50_4U2hXSfTw.png" width="700" height="497"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Who had artists &lt;em&gt;entirely&lt;/em&gt; generated by AI going viral on Spotify on their bingo card for 2025? Yeah, me neither. When I say “generated”, I’m talking &lt;em&gt;everything&lt;/em&gt; from band promo shots, to members’ backgrounds, &lt;em&gt;and&lt;/em&gt; instruments and vocals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;The Velvet Sundown&lt;/em&gt;&lt;/strong&gt; went viral recently, as did &lt;strong&gt;&lt;em&gt;Aventhis&lt;/em&gt;&lt;/strong&gt; — who has over a million monthly listeners&lt;strong&gt;.&lt;/strong&gt; Heck, a song by the AI-created act &lt;strong&gt;&lt;em&gt;Breaking Rust&lt;/em&gt;&lt;/strong&gt; &lt;a href="https://www.euronews.com/culture/2025/11/14/breaking-rust-ai-artist-tops-us-chart-for-first-time-as-study-reveals-alarming-recognition" rel="noopener noreferrer"&gt;was topping the U.S. &lt;em&gt;Billboard Country Digital Song Sales&lt;/em&gt; chart&lt;/a&gt; in November this year. &lt;strong&gt;&lt;em&gt;Breaking Rust&lt;/em&gt;&lt;/strong&gt; has a whopping 2.5 million monthly listeners on Spotify.&lt;/p&gt;

&lt;p&gt;Needless to say, this makes for a &lt;em&gt;very&lt;/em&gt; lucrative business model. 👀&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.revisionsmusic.com/how-much-does-spotify-pay-per-stream-in-2025/" rel="noopener noreferrer"&gt;In 2025, Spotify pays artists roughly &lt;strong&gt;$0.003–$0.005 per stream&lt;/strong&gt;&lt;/a&gt;, meaning &lt;strong&gt;1 million streams earns about $3,000–$5,000&lt;/strong&gt;, with most of that flowing to rights holders….but consider that GenAI-for-music tools like &lt;em&gt;Suno&lt;/em&gt; and &lt;em&gt;Udio&lt;/em&gt; &lt;strong&gt;cost under $10/month, or $30–$50/month for commercial rights&lt;/strong&gt;, and that you’d have zero traditional expenses like studios, bands, or touring. In that context, a single AI-generated track hitting ~&lt;strong&gt;10 million streams could gross $30,000–$50,000&lt;/strong&gt;, minus trivial distributor fees — an &lt;em&gt;insanely&lt;/em&gt; low-cost, high-scale setup.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F1%2ANheCN2M53SLoy00ir-AxwA.png" class="article-body-image-wrapper"&gt;&lt;img alt="a spotify screenshot of one of the A.I. artists" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F1%2ANheCN2M53SLoy00ir-AxwA.png" width="700" height="543"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that’s &lt;em&gt;just&lt;/em&gt; Spotify.&lt;/p&gt;

&lt;p&gt;So this piqued my curiosity. Sure, perhaps GenAI for music is more accessible than ever — but creating &lt;strong&gt;AI music that not only sounds polished but also *earns plays, likes, and real listener engagement?&lt;/strong&gt;* That’s no easy task. For every AI-driven success story, there are thousands of boring Suno-created “lofi” playlists sitting at paltry 25–50 views on Spotify or YouTube.&lt;/p&gt;

&lt;p&gt;What, then, are the standout AI artists — more accurately, the people behind them — doing &lt;em&gt;differently&lt;/em&gt;? &lt;strong&gt;Which real artists do they want to sound like?&lt;/strong&gt; Can we find out, acoustically speaking, how these algorithmic creations compare to the history of human-made music?&lt;/p&gt;

&lt;p&gt;To explore that, I focused on three of the most prominent AI artists — &lt;strong&gt;Velvet Sundown, Breaking Rust,&lt;/strong&gt; and &lt;strong&gt;Aventhis&lt;/strong&gt; — collecting 30-second Spotify previews + lyrics (via a Bright Data proxy once I started getting HTTP 429's), breaking down their acoustic “fingerprint” using OpenL3, and comparing them against Spotify’s Top 2000 songs (up to 2020) + Spotify’s top 10 for each of their official (non-hidden) genres (obtained via Volt.fm) to see where their sound might sit in the broader musical landscape.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://github.com/marl/openl3" rel="noopener noreferrer"&gt;GitHub - marl/openl3: OpenL3: Open-source deep audio and image embeddings&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/bd-residential-proxies?utm_content=i_investigated_the_top_three_ai_generated_artists_going_viral_on_spotify" rel="noopener noreferrer"&gt;Residential Proxies Trusted by Fortune 500 Companies - Free Trial&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here’s what I found. If you’d like to read at your own pace, here’s the Table of Contents. Enjoy!&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Music Isn’t Doing Anything New — It’s Banking on What Already Works.
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;First of all, I should mention that all of these artists sound….&lt;em&gt;perfect.&lt;/em&gt; There is no obvious artifacting, vocals being off key, instruments tuned incorrectly or cutting in/out abruptly. Whatever software they’re using to generate instruments and vocals — it’s working just fine. But this study is about far more than simple &lt;em&gt;correctness&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You’d think that artists generated programmatically — yes, I know how AI works; I’m generalizing — would gravitate toward tightly defined, algorithmic genres (EDM substyles, algorithm-friendly lo-fi, etc.), or swing the other way entirely and lean into something aggressively experimental. After all, if you have an AI model at your fingertips and &lt;em&gt;don’t&lt;/em&gt; try to push boundaries, what are you even doing? 😅&lt;/p&gt;

&lt;p&gt;But that’s not what shows up in the data. Rather than sounding novel or genre-breaking, the most successful AI artists consistently map onto a very familiar musical space: &lt;strong&gt;melody-forward, mid-tempo, harmony-rich songs&lt;/strong&gt; that sit comfortably alongside decades of mainstream, human-made music, and are guaranteed to get heavy airplay.&lt;/p&gt;

&lt;p&gt;Here’s what scored a ~0.685 similarity according to my code.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🎵 &lt;a href="https://open.spotify.com/track/7H1axjOTOwlCx4XmtnpGM4?si=GurdfleUSPK-UVdw65k42w" rel="noopener noreferrer"&gt;AI artist, 2025&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🎵 &lt;a href="https://open.spotify.com/track/05oETzWbd4SI33qK2gbJfR" rel="noopener noreferrer"&gt;Human artist, 1976&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In fact, let’s start with &lt;strong&gt;The Velvet Sundown.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  AI Artist #1 — The Velvet Sundown
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F0%2Ah2dlqVl6Jmb-2cBw.jpg" class="article-body-image-wrapper"&gt;&lt;img alt="promo shot of the velvet sundown. it is clearly A.I. generated" title="Image credit: The Velvet Sundown" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F0%2Ah2dlqVl6Jmb-2cBw.jpg" width="700" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sonically, they’re going hard for a warm, soft-focus, 1970s-era rock aesthetic — and the data backs that up almost &lt;em&gt;embarrassingly&lt;/em&gt; well. Across 39 query tracks, the closest real-world matches are dominated by &lt;strong&gt;The Beatles&lt;/strong&gt; and &lt;strong&gt;Fleetwood Mac&lt;/strong&gt;, which land as the top two most similar artists overall. The Beatles take the top spot with a final similarity score of &lt;strong&gt;~0.68&lt;/strong&gt;, appearing in &lt;strong&gt;37 out of 39&lt;/strong&gt; query songs, while Fleetwood Mac follows closely behind at &lt;strong&gt;~0.66&lt;/strong&gt;, matching &lt;strong&gt;38 out of 39&lt;/strong&gt; queries. These aren’t marginal hits either, tracks like &lt;em&gt;Here Comes the Sun&lt;/em&gt;, &lt;em&gt;In My Life&lt;/em&gt;, &lt;em&gt;Landslide&lt;/em&gt;, and &lt;em&gt;Rhiannon&lt;/em&gt; repeatedly surface as high-similarity neighbors&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F1%2Aa5Fuwh5h68UJA2VBUF9zuQ.png" class="article-body-image-wrapper"&gt;&lt;img alt="a similarity score histogram for the velvet sundown. it shows they sound extremely similar to the beatles, fleetwood mac, bee gees, and the eagles" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F1%2Aa5Fuwh5h68UJA2VBUF9zuQ.png" width="700" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Similarity Score (X-axis):&lt;/strong&gt; Cosine similarity between each AI artist track and historical tracks from the artist. Scores range from 0 to 1, where higher values indicate greater &lt;a href="https://similarity.Frequency" rel="noopener noreferrer"&gt;similarity.&lt;strong&gt;Frequency&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;(Y-axis):&lt;/strong&gt; The number of track comparisons that fall into each similarity score range. Higher bars mean more tracks matched at that similarity level. A distribution skewed toward higher scores indicates consistent similarity across many tracks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What’s interesting is that this alignment matches subjective listening almost perfectly. While their harmonies, pacing, and production texture feel straight out of the mid-70s “warm, analog” playbook (&lt;strong&gt;Fleetwood Mac&lt;/strong&gt;, &lt;strong&gt;The Eagles&lt;/strong&gt;), &lt;strong&gt;The Velvet Sundown’s&lt;/strong&gt; vocals and melodic swells sound exactly like late-period &lt;strong&gt;Beatles&lt;/strong&gt; balladry.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🎵 &lt;a href="https://open.spotify.com/track/6fKYDxOtmHCd48k1BOqTPj?si=FRxW43CgQjWehJr5xI6VTg" rel="noopener noreferrer"&gt;AI artist, 2025&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🎵 &lt;a href="https://open.spotify.com/track/65vdMBskhx3akkG9vQlSH1?si=G2DI4obDSoakVIh14pJQpQ" rel="noopener noreferrer"&gt;Human artist, 1964&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Like I said, whatever model + prompt they’re using isn’t inventing a new genre — it’s triangulating toward one of the safest, most historically successful regions of the musical embedding space.&lt;/p&gt;

&lt;p&gt;And it doesn’t stop there. Their next tier of similar artists includes &lt;strong&gt;The Eagles, Bee Gees,&lt;/strong&gt; going on to &lt;strong&gt;Queen, Dire Straits, Neil Young&lt;/strong&gt;, and even modern successors like &lt;strong&gt;Coldplay&lt;/strong&gt; and &lt;strong&gt;John Mayer&lt;/strong&gt; — all artists known for emotionally legible songwriting, strong melodic hooks, and broad mainstream appeal.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Jesus&lt;/em&gt;. &lt;strong&gt;The Velvet Sundown is basically a statistically optimal blend of classic rock lineage, factory-made for familiarity. 😅&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;“Statistically optimal” is a great way to describe these artists, actually. None of them do anything wild, original, or creative. They’re AI-generated comfort food for your ears.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Artist #2 — Aventhis
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A686%2F0%2AsRvpOhUaC6pNmZBb" class="article-body-image-wrapper"&gt;&lt;img alt="promo shot of Aventhis. it is clearly A.I. generated" title="Image credit: Aventhis (YouTube channel)" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A686%2F0%2AsRvpOhUaC6pNmZBb" width="686" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If &lt;em&gt;The Velvet Sundown&lt;/em&gt; felt like a near-perfect acoustic cosplay of a specific era and band, &lt;strong&gt;Aventhis is more diffuse&lt;/strong&gt;. The model has no trouble identifying &lt;em&gt;where&lt;/em&gt; Aventhis lives — firmly within modern country and Americana — but it’s noticeably less decisive about &lt;em&gt;who&lt;/em&gt; they’re trying to sound like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F1%2Aa5I1RThjexTMSwvX6and_Q.png" class="article-body-image-wrapper"&gt;&lt;img alt="a similarity score histogram for Aventhis. it shows they sound extremely similar to luke combs, chris stapleton, and queen" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F1%2Aa5I1RThjexTMSwvX6and_Q.png" width="700" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On paper, the genre match is strong. Across all songs &lt;strong&gt;Aventhis&lt;/strong&gt; has on Spotify, they consistently map to contemporary country artists like &lt;strong&gt;Luke Combs, Morgan Wallen,&lt;/strong&gt; and &lt;strong&gt;Chris Stapleton&lt;/strong&gt;, all of whom appear in the top tier of results with solid similarity scores and high cross-query consistency.&lt;/p&gt;

&lt;p&gt;So the system is correctly picking up on the hallmarks of modern country production: mid-tempo arrangements, prominent acoustic guitars, restrained percussion, and vocal-first mixes.&lt;/p&gt;

&lt;p&gt;Where things get interesting is that &lt;strong&gt;no single artist fully dominates the similarity space&lt;/strong&gt; the way Fleetwood Mac and The Beatles did for Velvet Sundown. Artists like &lt;strong&gt;Creedence Clearwater Revival, Queen&lt;/strong&gt;, and even &lt;strong&gt;Muse&lt;/strong&gt; creep into the top results — not because Aventhis sounds like them stylistically, but because some of these representative songs share broad acoustic properties: strong melodic arcs, clear harmonic structure, and emotionally legible song forms.&lt;/p&gt;

&lt;p&gt;My theory is that the &lt;strong&gt;Aventhis&lt;/strong&gt; project feels more like someone playing around with the software, and those early “feeling out” era tracks skew the analysis. Their more recent songs crystallize fairly well.&lt;/p&gt;

&lt;p&gt;Regardless of era, &lt;strong&gt;Chris Stapleton&lt;/strong&gt; shows up as the clearest &lt;em&gt;vocal&lt;/em&gt; anchor — which aligns well with Aventhis’ branding as an “outlaw country” act.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🎵 &lt;a href="https://open.spotify.com/track/7pf41qNwCodhG8yCUVhsMU?si=PwTXIHRURjCZQxgFH3iDzw" rel="noopener noreferrer"&gt;AI artist, 2025&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🎵 &lt;a href="https://open.spotify.com/track/5jROdl6MhcmP3O7h2sVgtw?si=E71KHe78RxSAxlTdHsWJ2g" rel="noopener noreferrer"&gt;Human artist, 2025&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Stapleton’s influence appears more in timbre: the gravel, the sustained notes, the emotional weight carried by the voice rather than the arrangement.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Artist #3 — Breaking Rust
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A532%2F0%2A8X8dcC07435UtmmF" class="article-body-image-wrapper"&gt;&lt;img alt="promo shot of Breaking Rust. it is clearly A.I. generated" title="Image credit: Breaking Rust (Spotify)" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A532%2F0%2A8X8dcC07435UtmmF" width="532" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Speaking of outlaw country…we have &lt;strong&gt;Breaking Rust&lt;/strong&gt; which seems to exclusively create a monotonous sound that isn’t &lt;em&gt;only&lt;/em&gt; mimicking the genre, it’s optimizing for a radio-friendly, familiar, “epic” image of it, engineered to sit dead-center in today’s casual-listener Spotify ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is a commercially attractive interpretation of what rebel/outlaw country music is supposed to be&lt;/strong&gt;. (And that doesn’t have that much of an &lt;em&gt;outlaw&lt;/em&gt; in there, unsurprisingly. 😅)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F1%2A2ZY1ouU20bQDn3rbaDAgyA.png" class="article-body-image-wrapper"&gt;&lt;img alt="a similarity score histogram for Breaking Rust. it shows they sound extremely similar to Morgan Wallen, Luke Combs, Adele, and Queen" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F1%2A2ZY1ouU20bQDn3rbaDAgyA.png" width="700" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Morgan Wallen dominates the comparison set with a &lt;strong&gt;final score of ~0.71 and a 100% consistency ratio&lt;/strong&gt;, meaning &lt;em&gt;every single Breaking Rust track&lt;/em&gt; consistently mapped to Wallen’s catalog with high similarity&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Luke Combs&lt;/strong&gt; follows closely behind, again with perfect consistency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adele&lt;/strong&gt; shows up here because of the clean, soaring vocals, and &lt;strong&gt;Queen&lt;/strong&gt; does so because of the “thump &amp;amp; clap” routine that &lt;strong&gt;Breaking Rust&lt;/strong&gt; seems to employ frequently.&lt;/p&gt;

&lt;p&gt;There’s also very little experimentation here. Classic country influences (Cash, Jennings, early outlaw archetypes) do show up in the data, but barely surface acoustically. Even older rock-adjacent matches like Creedence Clearwater Revival appear only as secondary signals. After The Velvet Sundown, &lt;strong&gt;Breaking Rust is easily the most “engineered” sound — created not to reinvent the country genre but almost like it’s trying too hard to &lt;em&gt;win&lt;/em&gt; at it&lt;/strong&gt;, by anchoring itself squarely to the most commercially successful sound of the past decade.&lt;/p&gt;

&lt;p&gt;And based on both chart performance and these similarity scores, that strategy appears to be working. Here’s another ~0.6 similarity match. While &lt;strong&gt;Breaking Rust&lt;/strong&gt; is topping Billboard charts, &lt;strong&gt;Bryan Elijah Smith —&lt;/strong&gt; while still successful — isn’t.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🎵 &lt;a href="https://open.spotify.com/track/4yRkwL18jctykPn52vnIEh?si=DavNPVrlRmSjIOecYBmTHA" rel="noopener noreferrer"&gt;AI artist, 2025&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🎵 &lt;a href="https://open.spotify.com/track/5g53x9XQ3zDJHZ52xcsAH2?si=FZZrSbVWQUiFunwyc8n-_g" rel="noopener noreferrer"&gt;Human artist, 2024&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Breaking Rust&lt;/strong&gt; is the perfect opportunity to talk about another big problem with these artists — they’re far too monotonous.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repetitive Catalogues, Optimized for Recommendation Engines
&lt;/h2&gt;

&lt;p&gt;Up to this point, we’ve been asking a fairly intuitive question: &lt;em&gt;which real artists do these AI projects sound like?&lt;/em&gt; But there’s another, arguably more revealing angle here — &lt;strong&gt;how similar an AI artist’s songs are to *each other&lt;/strong&gt;*.&lt;/p&gt;

&lt;p&gt;To answer that, I ran the same embedding pipeline again, but this time comparing &lt;strong&gt;each AI Artist track against every other track from the same artist&lt;/strong&gt;, producing three full intra-artist similarity matrices.&lt;/p&gt;

&lt;p&gt;Conceptually, this tells us how tightly clustered the catalog is in embedding space — whether the artist has a recognizable sonic “center,” or whether the songs scatter across different musical regions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1000%2F1%2A-ZbStgo_ybeg4cQpK2AMhA.png" class="article-body-image-wrapper"&gt;&lt;img alt="a chart showing intra-catalogue similarity for each artist in the dataset. It shows the A.I. artists being investigated have extremely similar, one dimensional catalogues" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1000%2F1%2A-ZbStgo_ybeg4cQpK2AMhA.png" width="800" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you were wondering, the artists/data points with the &lt;em&gt;most&lt;/em&gt; variation in their internal catalogues (bottom right) here are &lt;strong&gt;Queen&lt;/strong&gt; (0.120), &lt;strong&gt;David Bowie&lt;/strong&gt; (0.162), and &lt;strong&gt;The Beatles&lt;/strong&gt; (0.213). The least (top left) are &lt;strong&gt;Kensington&lt;/strong&gt; (0.655), &lt;strong&gt;Doe Maar&lt;/strong&gt; (0.633) and &lt;strong&gt;AC/DC&lt;/strong&gt; (0.606).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The Velvet Sundown are extremely internally consistent&lt;/strong&gt;. Most track-to-track cosine similarities land in the ~0.6–0.8 range, with many pairs pushing even higher. That’s near-homogeneity in tempo, timbral palette, harmonic structure, and arrangement style. In other words, once you’ve heard 3–4 Velvet Sundown tracks, you’ve essentially heard ’em all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Breaking Rust&lt;/strong&gt; is even worse.&lt;/p&gt;

&lt;p&gt;Their intra-similarity scores are &lt;em&gt;ridiculously&lt;/em&gt; high across the board, with a strong average, a high median, and low variance compared to the other two artists&lt;/p&gt;

&lt;p&gt;Even more than Velvet Sundown, &lt;strong&gt;almost every Breaking Rust song sounds a lot like every other Breaking Rust song&lt;/strong&gt;. Close enough that just one song could describe its entire catalog.&lt;/p&gt;

&lt;p&gt;This is especially evident in the most similar pairs, where similarities routinely exceed 0.85 — values you’d normally expect from alternate takes, or remixes, not individual songs. Even from the same album. Tempo, instrumentation, vocal tone, and structure are nearly one-dimensional.&lt;/p&gt;

&lt;p&gt;Strictly from an algorithmic POV, this explains why Breaking Rust works so well on Spotify: it’s &lt;em&gt;predictable in the best possible way&lt;/em&gt;. If you like one song, the Spotify algorithm can safely serve you ten more without risking a skip. From a modeling perspective, it’s also the clearest example of how &lt;strong&gt;AI systems naturally converge on a local optimum&lt;/strong&gt; &lt;strong&gt;when the people behind it get lazy&lt;/strong&gt; — once a sound “works,” it gets reused aggressively unless you’re actively correcting for it.&lt;/p&gt;

&lt;p&gt;Depressing, but further proof that rather than artistic intent, these AI musicians seem to have been designed for streaming efficiency — music engineered to sound “right enough” to play to the algorithm of major recommendation engines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aventhis&lt;/strong&gt; is a statistical anomaly here for reasons mentioned before. While it does show lower average similarity and higher variance, including a few pairs that are barely related at all, the issue is that their songs jump genres often, making such analysis misleading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lyrics Are Technically Correct, But Emotionally Nonexistent.
&lt;/h2&gt;

&lt;p&gt;Once you strip away production, vocals, and genre cues, lyrics are where artistic intent — or its absence — tends to leak through. This is what actually ties our quantitative results and the “felt experience” together.&lt;/p&gt;

&lt;p&gt;So instead of overfitting on rhyme schemes or clever metaphors, I looked at three broad dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lexical &amp;amp; syntactic complexity&lt;/strong&gt; (how varied and structurally complex the language is)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rhyme density &amp;amp; repetition&lt;/strong&gt; (how “song-like” vs templated the writing feels)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sentiment &amp;amp; emotion&lt;/strong&gt; (what emotional space these songs consistently occupy)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First of all, there’s not much to distinguish these artists in terms of word/sentence sizes or lengths. Breaking Rust has the most words per song, but they’re all pretty uniform.&lt;/p&gt;

&lt;p&gt;The patterns that emerge upon deeper analysis though are… telling. If the audio analysis showed that these AI artists sit comfortably inside well-worn musical lanes, the lyric analysis confirms this: &lt;strong&gt;across all three artists, the lyrics read less like authored statements and more like &lt;em&gt;statistical averages of their genre&lt;/em&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Velvet Sundown: Aesthetic Without Substance
&lt;/h3&gt;

&lt;p&gt;Their lyrics are the most revealing case because they &lt;em&gt;sound&lt;/em&gt; poetic on first pass. There’s no shortage of imagery: &lt;em&gt;“boots in the mud,” “smoke in the sky,” “shadows falling,” “voices whisper,” “marching ghosts”.&lt;/em&gt; And War, peace, rebellion, fire, silence — repeated endlessly, rarely developed.&lt;/p&gt;

&lt;p&gt;Lines like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Dust on the windBoots on the groundSmoke in the skyNo peace foundRivers run redThe drums roll slowTell me brother, where do we go?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Sound meaningful if you’re 14, but are basically just symbolic placeholders rather than narrative beats. This is proven by the metrics: Velvet Sundown has low word entropy, short lines, and extremely low readability scores — they’re easily digestible. &lt;strong&gt;High TTR (0.61)&lt;/strong&gt; and &lt;strong&gt;longer average word length (~4.1)&lt;/strong&gt; give the illusion of richness, but the lyrics are syntactically simple and emotionally generic.&lt;/p&gt;

&lt;p&gt;Their rhyme density is fairly low, but that’s misleading — this is because their lyrics are sparse by design.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1000%2F1%2AdONhBP5-kWi5uyjw7t9kWg.png" class="article-body-image-wrapper"&gt;&lt;img alt="Visual representation of rhyming line pairs for all songs by The Velvet Sundown. It shows that this artist has moderately high rhyme scheme, but sparse lyrics" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1000%2F1%2AdONhBP5-kWi5uyjw7t9kWg.png" width="800" height="443"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Visual representation of rhyming line pairs for all songs by the selected artist. Lines are shown as dots on a vertical axis, with curved arcs connecting rhyming lines. Thicker arcs indicate more frequent repetitions, revealing chorus recycling patterns and repetition across long distances.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Really, it’s when you actually read the lyrics end-to-end that the cracks show &lt;em&gt;hard&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Almost every song collapses into very brief (3–4 word) phrases, and the same moral binary: &lt;em&gt;war bad, peace good&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;No characters, no storytelling, no deeper narratives or themes, no temporal progression. Oh, and “flag” and “dust” are mentioned a ton, for some reason.&lt;/li&gt;
&lt;li&gt;Choruses recycle very abstract imperatives: &lt;em&gt;“raise your voice,” “don’t look away,” “we won’t fade”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The language gestures at a seriousness without ever committing to specifics. It’s protest music with the &lt;em&gt;protest&lt;/em&gt; removed. It’s more like protest &lt;em&gt;aesthetic&lt;/em&gt; — a Spotify-ready approximation of “serious rock” that never risks alienating anyone by saying anything precise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Aventhis: Performative Masculine Trauma
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Aventhis&lt;/strong&gt; is more coherent than &lt;strong&gt;Velvet Sundown&lt;/strong&gt;, but only because it is far less poetic and commits fully to a single archetype: the wounded, defiant outlaw. Every song reinforces the same emotional posturing — mistrust, self-reliance, pain turned inward — to the point that they all blur together.&lt;/p&gt;

&lt;p&gt;The lyrics are packed with stock phrases of what could &lt;em&gt;only&lt;/em&gt; be called masculine suffering:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Scars where I never kept my mouth shut”&lt;br&gt;
“I ride through hell with my head held high”&lt;br&gt;
“I walk with ghosts that don’t disappear”&lt;br&gt;
“I don’t kneel or cry”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Unlike &lt;strong&gt;Velvet Sundown’s&lt;/strong&gt; abstract collectivism, &lt;strong&gt;Aventhis&lt;/strong&gt; is &lt;em&gt;intensely&lt;/em&gt; first-person — but still vague. The pain is &lt;strong&gt;constant&lt;/strong&gt;. And also (conveniently!), undefined. Abuse, regret, sin, and violence are all implied, and occasionally they’ll include superficial country-genre specifics (whiskey, barbed wire, boots, fire).&lt;/p&gt;

&lt;p&gt;Things you’ve heard a thousand times before from a thousand one-note Top 40 Country Billboard artists.&lt;/p&gt;

&lt;p&gt;Lexically, &lt;strong&gt;Aventhis&lt;/strong&gt; sits in the middle: moderate vocabulary diversity, average word length, and fairly standard line lengths. But the real signal is consistency. Their MATTR is high and stable, entropy is middling, and syntactic measures are tightly clustered across songs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1000%2F1%2AsBlU9Z-fTfqUYuafoRGk6Q.png" class="article-body-image-wrapper"&gt;&lt;img alt="Histogram showing emotion distribution with values for anger, fear, disgust, joy, neutral, sadness, and surprise. Lyrics of all three AI artists show almost nothing but fear, sadness, and anger. No joy or surprise, at all" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1000%2F1%2AsBlU9Z-fTfqUYuafoRGk6Q.png" width="800" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Emotional distribution is comically negative, and arguably, this is engineered for more engagement.&lt;/p&gt;

&lt;p&gt;Emotionally, however, &lt;strong&gt;Aventhis&lt;/strong&gt; is almost comically narrow compared to the others: &lt;strong&gt;Fear&lt;/strong&gt; dominates the emotional distribution, but not in a nuanced way — it’s fear as some sort of… aesthetic texture, nothing that ever feels like a lived experience. Even empowerment anthems like &lt;em&gt;“Burn What Held You”&lt;/em&gt; resolve into catchy slogans rather than anything deep.&lt;/p&gt;

&lt;p&gt;Tension, threat, perseverance, moral conflict. They’re cosplaying as the “gritty outlaw” archetype, and they’re doing it efficiently, repeatedly, and in a way that slots neatly into modern outlaw-country playlists.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1000%2F1%2AY3PoA8YPKS4-3kj5u0iglg.png" class="article-body-image-wrapper"&gt;&lt;img alt="Visual representation of rhyming line pairs for all songs by Aventhis. It shows very high rhyming" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1000%2F1%2AY3PoA8YPKS4-3kj5u0iglg.png" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Breaking Rust: Simple, Repetitive, and Chorus-Forward
&lt;/h3&gt;

&lt;p&gt;They are the most structurally “song-like” in the traditional sense — and also the least complex.&lt;/p&gt;

&lt;p&gt;Lexical diversity is the lowest of the three, sentences are shorter, and syntactic depth is shallow. That’s not inherently bad, but it does correlate with what shows up elsewhere: &lt;strong&gt;extremely high&lt;/strong&gt; &lt;strong&gt;rhyme and repetition&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1000%2F1%2AMU3CI3fUCvUBeKpYjydpEg.png" class="article-body-image-wrapper"&gt;&lt;img alt="Visual representation of rhyming line pairs for all songs by Breaking Rust. It shows extremely high rhyming" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1000%2F1%2AMU3CI3fUCvUBeKpYjydpEg.png" width="800" height="439"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is music designed to be instantly familiar and easy to latch onto. The sentiment is mostly neutral, with &lt;strong&gt;sadness&lt;/strong&gt; edging out &lt;strong&gt;fear&lt;/strong&gt; and &lt;strong&gt;anger&lt;/strong&gt;. Emotionally, it’s less intense than &lt;strong&gt;Aventhis&lt;/strong&gt; and less layered than Velvet Sundown.&lt;/p&gt;

&lt;p&gt;Songs like &lt;em&gt;“Kicking Back at the Ground”&lt;/em&gt;, &lt;em&gt;“Livin’ on Borrowed Time”&lt;/em&gt;, and &lt;em&gt;“Whiskey Don’t Talk Back”&lt;/em&gt; are built from familiar country scaffolding: hard work, pride, scars, drinking, resilience. The imagery is clearer than Velvet Sundown’s and (thankfully!) less theatrical than Aventhis’s, but the emotional range is easily the narrowest here.&lt;/p&gt;

&lt;p&gt;You see the same ideas recycled with small lexical swaps:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Dust on my boots”&lt;br&gt;
“Scars that sing”&lt;br&gt;
“Pour another glass”&lt;br&gt;
“I was born this way”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The metrics show &lt;strong&gt;high repetition&lt;/strong&gt;, &lt;strong&gt;high readability&lt;/strong&gt;, and very stable sentiment — almost aggressively neutral.&lt;/p&gt;

&lt;p&gt;There are no lines that reframe a familiar idea or linger on an uncomfortable detail. Everything resolves cleanly into affirmation: &lt;em&gt;keep going&lt;/em&gt;, &lt;em&gt;stay true&lt;/em&gt;, &lt;em&gt;don’t quit&lt;/em&gt;. It’s music designed to be nodded along to, not sat with, or thought about for any length of time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Pattern…is Depressing.
&lt;/h2&gt;

&lt;p&gt;Across all three artists, we keep seeing the same things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low semantic risk;&lt;/strong&gt; short, uncomplicated words, in short sentences. I imagine this is done for wide appeal as well as not risking their AI vocal generation tripping up on complex words or phrasing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High emotional clarity, but extremely low emotional depth&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heavy reliance on genre signifiers.&lt;/strong&gt; Again, see: mimicking known archetypes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heavy emphasis on rhyming&lt;/strong&gt; (combined with #1)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lyrics optimized for recognizability, not memorability&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These songs all feel like the lyrical equivalent of stock photography — technically correct, emotionally legible, and instantly forgettable.&lt;/p&gt;

&lt;p&gt;They don’t challenge the listener. They don’t demand interpretation. They don’t risk failure. And in a streaming economy that rewards familiarity over depth, &lt;strong&gt;that might be exactly the point.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;More proof that these AI musicians seem to have been designed for &lt;strong&gt;maximum playlist compatibility and minimum listener friction&lt;/strong&gt;.**** Real artists sound like they’re discovering a voice. These sound like systems that &lt;em&gt;already know exactly which one performs best.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That’s everything. If you want to learn about my methodology in detail, read on. &lt;a href="https://python.plainenglish.io/i-investigated-the-top-3-ai-generated-artists-going-viral-on-spotify-5dcff825998b#ac49" rel="noopener noreferrer"&gt;Else, click this to go back to the table of contents.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;To compare AI-generated music with historical human-made songs in a way that’s both scalable and acoustically grounded, I treated this as a &lt;strong&gt;representation-learning + similarity search&lt;/strong&gt; problem rather than a genre-labeling or metadata exercise.&lt;/p&gt;

&lt;p&gt;First of all, I collected those 30-second Spotify preview clips for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Everything the three AI artists (Velvet Sundown, Breaking Rust, and Aventhis) had on Spotify,&lt;/li&gt;
&lt;li&gt;10 songs for every track in the &lt;a href="https://Volt.fm" rel="noopener noreferrer"&gt;Volt.fm&lt;/a&gt; charts for Spotify’s 26 official genres&lt;/li&gt;
&lt;li&gt;Everything in the &lt;a href="https://www.kaggle.com/datasets/iamsumat/spotify-top-2000s-mega-dataset" rel="noopener noreferrer"&gt;&lt;em&gt;Spotify Top 2000&lt;/em&gt;&lt;/a&gt; dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So that’s a total of about ~2500 tracks. Then, I used the Spotify API to programmatically search and find the track URLs + scraped those URLs using Bright Data’s proxies to stop me from getting blocked.&lt;/p&gt;

&lt;p&gt;Spotify’s 30-second previews are short, but they’re consistent, legally accessible, and widely used in MIR research — making them a reasonable proxy for a song’s overall sonic fingerprint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scraping Spotify preview audio (the slightly hacky part)
&lt;/h3&gt;

&lt;p&gt;My solution was to programmatically go through a list of song + artist name combos like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"song": "Let it Burn", "artist": "The Velvet Sundown"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And use the Spotify API (registration required, but free) — specifically, the &lt;code&gt;/search&lt;/code&gt; endpoint — to grab the Spotify Web Player link for that song contained in the &lt;code&gt;external_urls&lt;/code&gt; field of the response object. This looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://open.spotify.com/track/64JjvzPdH2h3u5cJDF4y96
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why get this? Well, Spotify does not expose &lt;code&gt;preview_url&lt;/code&gt; via their public API anymore. However, the preview URLs do exist in the page metadata — specifically in the &lt;code&gt;og:audio&lt;/code&gt; Open Graph tag — but only under certain conditions.&lt;/p&gt;

&lt;p&gt;When requesting a track page with bot-like headers (say, your standard axios user-agent), Spotify returns server-rendered HTML that includes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;meta property="og:audio" content="https://p.scdn.co/mp3-preview/..." /&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But when requesting the same page in the browser, or with &lt;strong&gt;browser-like headers&lt;/strong&gt;, Spotify instead serves a JavaScript-driven React shell with no audio metadata present in the initial HTML. This is why you won’t see preview URLs via “View Source” in a normal browser session.&lt;/p&gt;

&lt;p&gt;That makes it easy for us to scrape them. All we have to do is get the track URL, fetch the HTML for that page, and search for the CDN pattern on it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os
import sys
import json
import ssl
import time
import argparse
from urllib.parse import urlencode
import urllib.request

import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv

ssl._create_default_https_context = ssl._create_unverified_context

load_dotenv()

# example: list of songs with artist names
# format: [{"song": "Song Name", "artist": "Artist Name"}, ...]
# if INPUT_FILE is provided, it will load from that file instead
SONGS = [
    # example entries - replace with your own or use an inputfile JSON
    {"song": "Let it Burn", "artist": "The Velvet Sundown"},
    {"song": "As the Silence Falls", "artist": "The Velvet Sundown"},
]


def build_brightdata_proxy():
    proxy_user = os.environ.get("BRIGHT_DATA_PROXY_USER", "")
    proxy_pass = os.environ.get("BRIGHT_DATA_PROXY_PASS", "")
    if not (proxy_user and proxy_pass):
        return None
    proxy_url = f"@brd.superproxy.io:33335"&amp;gt;http://{proxy_user}:{proxy_pass}@brd.superproxy.io:33335"
    return proxy_url


def get_spotify_access_token():
    token = os.environ.get("SPOTIFY_ACCESS_TOKEN")
    if not token:
        raise RuntimeError("SPOTIFY_ACCESS_TOKEN environment variable is required")
    return token


def search_spotify_tracks(song_name, artist_name, limit=5):
    query = f'track:"{song_name}" artist:"{artist_name}"'
    params = {
        "q": query,
        "type": "track",
        "market": "US",
        "limit": str(limit),
    }
    headers = {"Authorization": f"Bearer {get_spotify_access_token()}"}
    resp = requests.get("https://api.spotify.com/v1/search", params=params, headers=headers)
    resp.raise_for_status()
    return resp.json().get("tracks", {}).get("items", [])


def get_spotify_links(url):
    proxy_url = build_brightdata_proxy()
    if proxy_url:
        opener = urllib.request.build_opener(
            urllib.request.ProxyHandler({'https': proxy_url, 'http': proxy_url})
        )
        response = opener.open(url)
        html = response.read().decode('utf-8')
    else:
        resp = requests.get(url)
        resp.raise_for_status()
        html = resp.text

    soup = BeautifulSoup(html, "html.parser")

    scdn_links = set()

    for element in soup.find_all(True):
        for attr_name, attr_value in element.attrs.items():
            if isinstance(attr_value, (list, tuple)):
                attr_values = attr_value
            else:
                attr_values = [attr_value]
            for value in attr_values:
                if value and "p.scdn.co" in value:
                    scdn_links.add(value)

    return list(scdn_links)


def find_preview_url(track_name, artist_name):
    """find preview URL for a track using Spotify API search and scraping"""
    try:
        tracks = search_spotify_tracks(track_name, artist_name, limit=3)

        if not tracks:
            return None

        # find best match by checking artist similarity
        best_match = None
        artist_lower = artist_name.lower()

        for track in tracks:
            track_artists = ', '.join(artist_obj['name'] for artist_obj in track['artists'])
            if artist_lower and artist_lower in track_artists.lower():
                best_match = track
                break

        # if no exact match, use first result
        if not best_match:
            best_match = tracks[0]

        # get Spotify URL and scrape for preview
        spotify_url = best_match.get("external_urls", {}).get("spotify")
        if not spotify_url:
            return None

        preview_urls = get_spotify_links(spotify_url)
        if preview_urls and len(preview_urls) &amp;gt; 0:
            return preview_urls[0]

        return None
    except Exception as error:
        print(f"Error finding preview for {track_name}: {error}", file=sys.stderr)
        return None


def load_songs(input_file):
    """load songs from input file or use SONGS list"""
    if input_file and os.path.exists(input_file):
        print(f"Loading songs from: {input_file}")
        with open(input_file, 'r', encoding='utf-8') as f:
            data = json.load(f)

        # support both array format and object format
        if isinstance(data, list):
            return data
        elif isinstance(data, dict) and "songs" in data and isinstance(data["songs"], list):
            return data["songs"]
        else:
            raise ValueError('Input file must contain an array of {"song", "artist"} objects or {"songs": [...]}')

    return SONGS


def get_previews(input_file=None, output_file='spotify-preview_urls.json'):
    """main function to get preview URLs for all songs"""
    songs = load_songs(input_file)

    # validate songs format
    if not isinstance(songs, list) or len(songs) == 0:
        print('Error: No songs found. Please provide songs in the SONGS list or via INPUT_FILE.', file=sys.stderr)
        print('Format: [{"song": "Song Name", "artist": "Artist Name"}, ...]', file=sys.stderr)
        sys.exit(1)

    # validate each entry has song and artist
    for i, song in enumerate(songs):
        if not song.get("song") or not song.get("artist"):
            print(f"Error: Entry {i + 1} is missing 'song' or 'artist' property.", file=sys.stderr)
            print('Format: {"song": "Song Name", "artist": "Artist Name"}', file=sys.stderr)
            sys.exit(1)

    preview_urls = {}
    found_count = 0
    not_found_count = 0
    total_tracks = len(songs)

    print(f"Found {total_tracks} songs to process\n")
    print('=' * 60)

    # process each song
    for i, song_entry in enumerate(songs):
        song_name = song_entry["song"].strip()
        artist_name = song_entry["artist"].strip()

        print(f"\n[{i + 1}/{total_tracks}] Searching: {song_name} - {artist_name}")

        # find preview URL
        preview_url = find_preview_url(song_name, artist_name)

        if preview_url:
            # use song name as key, or song + artist for uniqueness
            key = f"{song_name} - {artist_name}"
            preview_urls[key] = preview_url
            found_count += 1
            print(f"  ✓ Found: {preview_url[:70]}...")

            # save immediately when preview URL is found
            with open(output_file, 'w', encoding='utf-8') as f:
                json.dump(preview_urls, f, indent=2, ensure_ascii=False)
            print(f"  Saved to JSON ({found_count} found so far)")
        else:
            not_found_count += 1
            print(f"  ✗ No preview URL found")

        # rate limiting between tracks
        if i &amp;lt; len(songs) - 1:
            time.sleep(0.5)

    # final save
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(preview_urls, f, indent=2, ensure_ascii=False)

    # summary
    print('\n' + '=' * 60)
    print('Summary:')
    print('=' * 60)
    print(f"Preview URLs found: {found_count}")
    print(f"Preview URLs not found: {not_found_count}")
    print(f"Total tracks processed: {total_tracks}")
    print(f"\nResults saved to: {output_file}")


def main():
    parser = argparse.ArgumentParser(
        description='Get Spotify preview URLs for a list of songs and artists'
    )
    parser.add_argument(
        'input_file',
        nargs='?',
        default=None,
        help='Optional JSON input file with songs list'
    )
    parser.add_argument(
        'output_file',
        nargs='?',
        default='spotify-preview_urls.json',
        help='Output file path (default: spotify-preview_urls.json)'
    )

    args = parser.parse_args()

    try:
        get_previews(args.input_file, args.output_file)
    except Exception as error:
        print(f'Fatal error: {error}', file=sys.stderr)
        sys.exit(1)


if __name__ == "__main__":
    main()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you’re following along and are using a proxy, you need to sign up here for the username, password, zone, etc.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://get.brightdata.com/bd7914?utm_content=i_investigated_the_top_three_ai_generated_artists_going_viral_on_spotify" rel="noopener noreferrer"&gt;Bright Data - All in One Platform for Proxies and Web Scraping&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Anyway, I collected those CDN URLs for all the songs I wanted, dumped them to a JSON, and downloaded them directly from Spotify's CDN (p.scdn.co) for offline processing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Song Fingerprinting with OpenL3
&lt;/h3&gt;

&lt;p&gt;Then, each audio clip was converted into a fixed-length representation using &lt;strong&gt;OpenL3&lt;/strong&gt;, a deep audio embedding model trained to capture perceptual and musical characteristics beyond simple features like tempo or key.&lt;/p&gt;

&lt;p&gt;Here’s a minimal representation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
import openl3
import librosa
import pickle
from pathlib import Path

def extract_embedding_from_mp3(filepath: Path) -&amp;gt; np.ndarray:
    """
    extract OpenL3 embedding from an MP3 file

    Args:
        filepath: path to MP3 file

    Returns:
        embedding vector (512-dimensional by default)
    """
    # load audio file at 48kHz (required by OpenL3)
    audio, sr = librosa.load(str(filepath), sr=48000)

    # extract embedding using OpenL3
    # returns (embeddings, timestamps) where embeddings is (n_frames, 512)
    emb, _ = openl3.get_audio_embedding(
        audio,
        sr,
        content_type='music',
        input_repr='mel256',
        embedding_size=512
    )

    # average over time frames to get a single embedding vector
    embedding = np.mean(emb, axis=0)

    return embedding

def process_mp3_directory(directory: Path) -&amp;gt; tuple[list, list]:
    """
    process all MP3 files in a directory and extract embeddings with metadata

    Args:
        directory: directory containing MP3 files

    Returns:
        tuple of (embeddings_list, metadata_list)
    """
    mp3_files = sorted(list(directory.glob("*.mp3")))
    embeddings = []
    metadata = []

    for filepath in mp3_files:
        # extract embedding
        embedding = extract_embedding_from_mp3(filepath)

        # parse filename for metadata (assumes format: "Track Name - Artist Name.mp3")
        if " - " in filepath.stem:
            parts = filepath.stem.rsplit(" - ", 1)
            track_name = parts[0].strip()
            artist_name = parts[1].strip()
        else:
            track_name = filepath.stem
            artist_name = "Unknown"

        # build metadata dict
        meta_dict = {
            "file_path": str(filepath),
            "track_name": track_name,
            "artist": artist_name
        }

        embeddings.append(embedding)
        metadata.append(meta_dict)

    return embeddings, metadata

def save_embeddings(embeddings: list, metadata: list, output_dir: str = "embeddings"):
    """save embeddings and metadata to disk"""
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)

    # convert to numpy array and save
    embeddings_array = np.array(embeddings)
    np.save(output_path / "embeddings.npy", embeddings_array)

    # save metadata
    with open(output_path / "metadata.pkl", 'wb') as f:
        pickle.dump(metadata, f)

# example usage
if __name__ == "__main__":
    mp3_dir = Path("previews")
    embeddings, metadata = process_mp3_directory(mp3_dir)
    save_embeddings(embeddings, metadata)
    print(f"extracted {len(embeddings)} embeddings")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What this does:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Loads audio from each MP3 file at 48kHz using &lt;code&gt;librosa&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Creates metadata dictionaries with file path, track name, and artist (parsed from filename, saved as pickle files)&lt;/li&gt;
&lt;li&gt;Extracts embeddings with &lt;code&gt;openl3.get_audio_embedding()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Average over time frames to get a single vector, saved as a numpy array (.npy binaries)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;OpenL3 produces frame-level embeddings, so I averaged them over time to obtain a single 512-dimensional vector per track.&lt;/p&gt;

&lt;p&gt;To make cosine similarity meaningful at scale, I applied &lt;strong&gt;mean-centering to remove global bias&lt;/strong&gt;, followed by &lt;strong&gt;L2 normalization&lt;/strong&gt;, and indexed the historical songs using &lt;strong&gt;FAISS&lt;/strong&gt; with an inner-product index.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
import faiss
import pickle
from pathlib import Path

def build_faiss_index(embeddings: np.ndarray, metadata: list) -&amp;gt; faiss.Index:
    """
    build FAISS index from embeddings for cosine similarity search

    Args:
        embeddings: numpy array of embeddings (n_samples, n_features)
        metadata: list of metadata dictionaries (must match embedding order)

    Returns:
        FAISS index
    """
    n_samples, n_features = embeddings.shape

    # normalize embeddings for cosine similarity
    # subtract global mean to improve separation
    mean_vec = np.mean(embeddings, axis=0, keepdims=True)
    embeddings_centered = embeddings - mean_vec

    # L2 normalize
    norms = np.linalg.norm(embeddings_centered, axis=1, keepdims=True)
    norms[norms == 0] = 1
    embeddings_normalized = embeddings_centered / norms

    # create inner product index (for cosine similarity on normalized vectors)
    index = faiss.IndexFlatIP(n_features)
    index.add(embeddings_normalized.astype('float32'))

    return index

def save_index(index: faiss.Index, metadata: list, output_dir: str = "faiss_index"):
    """save FAISS index and metadata to disk"""
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)

    # save index
    faiss.write_index(index, str(output_path / "index.faiss"))

    # save metadata (order must match FAISS IDs exactly)
    with open(output_path / "metadata.pkl", 'wb') as f:
        pickle.dump(metadata, f)

# example usage
if __name__ == "__main__":
    # load embeddings and metadata
    embeddings = np.load("embeddings/historical_embeddings.npy")
    with open("embeddings/historical_metadata.pkl", 'rb') as f:
        metadata = pickle.load(f)

    # build index
    index = build_faiss_index(embeddings, metadata)

    # save index
    save_index(index, metadata)

    print(f"built index with {index.ntotal} vectors")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What this does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loads embeddings (&lt;code&gt;npy&lt;/code&gt;) and metadata (&lt;code&gt;pkl&lt;/code&gt;) from disk&lt;/li&gt;
&lt;li&gt;Subtracts the global mean, then L2 normalizes each vector&lt;/li&gt;
&lt;li&gt;Builds a FAISS index: creates an inner product index (cosine similarity on normalized vectors) and adds the normalized embeddings&lt;/li&gt;
&lt;li&gt;Writes the FAISS index and metadata pickle (order matches FAISS vector IDs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With all of it now being in a vector database I could simply query the index for each AI generated artist for its nearest neighbors, and then &lt;strong&gt;aggregate results at the artist level&lt;/strong&gt; using a top-k similarity strategy (mean of the top three matching tracks per artist, k = 400), combined across all AI tracks.&lt;/p&gt;

&lt;p&gt;For each query song, I retrieve the top 400 nearest neighbors from the FAISS index, then convert distances to similarity scores (0.0–1.0 range, where 1 = most similar. This is not a percentage.)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Code used in this step: &lt;a href="https://gist.github.com/sixthextinction/863140ec9d9126b142f058629ab873a1" rel="noopener noreferrer"&gt;https://gist.github.com/sixthextinction/863140ec9d9126b142f058629ab873a1&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Things to remember:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Historical embeddings are normalized when building the index. Query embeddings must be normalized the same way (using the saved mean vector) so they’re in the same space for similarity search.&lt;/li&gt;
&lt;li&gt;Artists that appeared consistently across many queries need to be weighted more heavily, while low-confidence matches need to be filtered out.&lt;/li&gt;
&lt;li&gt;Artists with fewer than 3 tracks are excluded (low confidence)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result : three ranked lists of historical artists (I just took the Top 4) whose catalogs were acoustically closest — in embedding space — to each AI-generated artist.&lt;/p&gt;

&lt;p&gt;Here’s how I’m scoring this. The &lt;code&gt;final_score&lt;/code&gt; is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;mean_artist_score * (0.7 + 0.3 * consistency_weight)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With Range: 0.0 to 1.0 (higher = more similar)&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;mean_artist_score&lt;/strong&gt;: average of the top-3 track similarities across all query songs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;consistency_weight:&lt;/strong&gt; ratio of query songs that matched this artist (capped at 1.0)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I put a 0.7 base weight on similarity and a 0.3 bonus for consistency across queries, so, for example: if an artist matches 5/10 query songs, &lt;code&gt;consistency_weight = 0.5&lt;/code&gt;, so the multiplier is &lt;code&gt;0.7 + 0.3 × 0.5 = 0.85&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Finally, I collected lyrics for all three artists from Genius and the artists’ own YouTube video descriptions, and ran a short lyrical analysis pipeline.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://genius.com/" rel="noopener noreferrer"&gt;Genius | Song Lyrics &amp;amp; Knowledge&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In it, rather than hand-waving about abstract “themes,” I quantified concrete lyrical properties. For each song, I measured lexical diversity using both Type–Token Ratio (TTR) and Moving-Average TTR (MATTR, more stable on short texts like lyrics), syntactic complexity (sentence length, subordination depth), sentiment and emotion distributions, and rhyme structure (rhyme density and reuse). These per-song metrics were then aggregated at the artist level to highlight patterns that persist across entire catalogs — not just individual tracks.&lt;/p&gt;

&lt;p&gt;After that it was simply a matter of interpreting the data, and generating visualizations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations of This Study
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;While the historical comparison set was large — the Spotify Top 2000 dataset plus top tracks across 26 major Spotify genres — it’s still finite and biased toward commercially successful music.&lt;/li&gt;
&lt;li&gt;My analysis also relies on short 30 second preview clips for audio embeddings and publicly available lyric transcriptions, which is good enough for a similarity analysis, but can never be a perfect comparison.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And that’s absolutely everything. Thank you for reading!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Thank you for reading! 🙌 This was the fourth in a series of data-driven deep dives I’m doing — forensic teardowns of things that are interesting, or things that shouldn’t work but do. If you want to see what else I find buried in data, follow along.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>datascience</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>Why Google, Bing, DuckDuckGo &amp; Yandex Show Different Results For the Same Query (2026)</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Wed, 21 Jan 2026 16:05:00 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/why-google-bing-duckduckgo-yandex-show-different-results-for-the-same-query-2026-1c06</link>
      <guid>https://dev.to/prithwish_nath/why-google-bing-duckduckgo-yandex-show-different-results-for-the-same-query-2026-1c06</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;TL;DR —  &lt;strong&gt;Search engines don’t just retrieve information; they decide what counts as “knowledge”. Depending on the engine, the same query can prioritize institutional safety, consensus, monetization, or unfiltered chaos. This investigation maps those differences.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Just how differently do the most popular search engines work?&lt;/p&gt;

&lt;p&gt;To find out, I ran a search-engine forensics investigation using identical queries across Google, Bing, DuckDuckGo, and Yandex. Using  &lt;a href="https://get.brightdata.com/bd7914?utm_content=why_google_bing_duckduckgo_and_yandex_show_different_results_for_the_same_query_2026" rel="noopener noreferrer"&gt;Bright Data’s SERP infrastructure&lt;/a&gt;,  &lt;strong&gt;I collected 1,630 SERP results spanning commercial, technical,&lt;/strong&gt; &lt;a href="https://developers.google.com/search/docs/fundamentals/creating-helpful-content#eat" rel="noopener noreferrer"&gt;&lt;strong&gt;YMYL (Your Money or Your Life)&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;topics, and open-ended “wildcard” questions&lt;/strong&gt;, and analyzed the titles and snippets users actually see. I treated each result as a data point and mapped the emergent “information signatures” of each platform — without scoring truth or intent, only structure, emphasis, and source selection.&lt;/p&gt;

&lt;p&gt;What I found was far more than just minor ranking variations.  &lt;strong&gt;I actually had distinct “personalities”, here.&lt;/strong&gt;  Even when user intent was identical, the answers were structurally different. It’s almost as if reality fragments at the search-engine layer — and the version of the world you encounter depends on which engine you trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Domain Diversity vs Domain Authority
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How Many Voices Does Each Engine Allow?
&lt;/h3&gt;

&lt;p&gt;The first signal is simple: how many distinct domains appear per query. On average, Yandex surfaced the most unique domains per query (~11.7), followed by DuckDuckGo (~9.4), Google (~8.8), and Bing (~6.2).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z6uclz48m6y1421jvbe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z6uclz48m6y1421jvbe.png" alt="A bar chart showing average unique domains per query, across all four search engines. Yandex leads with 11.69" width="630" height="298"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Average unique domains per query, across all four search engines.&lt;/p&gt;

&lt;p&gt;This isn’t a value judgment in and of itself, but it  &lt;em&gt;is&lt;/em&gt; revealing. Higher diversity suggests a broader sampling of sources; lower diversity implies tighter editorial consolidation.&lt;/p&gt;

&lt;p&gt;Bing’s comparatively low diversity hints at a strong preference for repeat “trusted” sources (and we’ll soon see just how deep  &lt;em&gt;that&lt;/em&gt; particular rabbit hole goes.) Yandex and DuckDuckGo, by contrast, appear structurally more exploratory — willing to surface a wider range of publishers, even if that comes at the cost of consistency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Who Gets to Speak, and When?
&lt;/h3&gt;

&lt;p&gt;Breaking sources down by domain type (.gov, .edu, .org, .com) reveals how engines encode authority differently — especially under YMYL conditions.&lt;/p&gt;

&lt;p&gt;In commercial queries, all engines overwhelmingly favored .com domains, with minimal variation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmsf7rb1rnogfub5m5dvv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmsf7rb1rnogfub5m5dvv.png" alt="a pie chart breakdown of domains cited by all four search engines — they all overwhelmingly favor .com domains, with minimal variation." width="800" height="778"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fksxwyfqmug8fsbpjs9mz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fksxwyfqmug8fsbpjs9mz.png" width="800" height="820"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While the split for commercial queries is as expected, YMYL content is where the differences really start showing. Google favors institutional sources overwhelmingly, while Yandex favors .com domains more.&lt;/p&gt;

&lt;p&gt;But for YMYL queries, big differences start showing up.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google’s top results skews heavily toward government, academic, and nonprofit sources, with .gov/.edu/.org accounting for a majority of top-ranked content.&lt;/li&gt;
&lt;li&gt;Bing also leaned institutional, though less aggressively.&lt;/li&gt;
&lt;li&gt;DuckDuckGo and Yandex showed a noticeably higher tolerance for commercial and mixed-authority sources even in sensitive domains.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This suggests that “authoritativeness” is not a universal concept for search engines — it is defined + enforced differently depending on platform philosophy and risk posture.&lt;/p&gt;

&lt;p&gt;For domain type distribution, the Technical and Wildcard categories weren’t really interesting — they’re exactly as you’d expect, and similar to Commercial. I’m skipping those.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Personalities of Search Engines
&lt;/h2&gt;

&lt;p&gt;By this point, it’s clear that search engines don’t just rank differently — they  &lt;em&gt;behave&lt;/em&gt;  differently. Each one shows a consistent “retrieval” personality, so to speak. By that, I mean a set of preferences about authority, risk, diversity, and acceptable sources of knowledge. These “personalities” show up across categories, queries, and metrics. Let’s look into each search engine and see what kind of personality they showed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3q3ma8z5ck0jjz7a6bdj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3q3ma8z5ck0jjz7a6bdj.png" alt="a bar chart showing the most frequently appearing domains across all 4 search engines. The top 5 are mayoclinic.org, stackoverflow.com, youtube.com, medium.com, and forbes.com" width="630" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most frequently appearing domains across all 4 search engines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Google: The Institutional Gatekeeper. For Better or Worse.
&lt;/h2&gt;

&lt;p&gt;For YMYL queries — health, finance, safety — this is where Google behaves the least like a neutral party/index — it has clearly taken on the responsibility of only showing the highest weighted domains, no matter what.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz4t2fr34lsun4tplcayn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz4t2fr34lsun4tplcayn.png" alt="bar chart showing top domains for google in commercial (reddit, pcmag, and rtings are the top 3), technical (medium.com, dev.to, stackoverflow.com are the top 3), ymyl (mayoclinic.org, fidelity.com, cdc.gov are the top 3), and wildcard queries (reddit, wikipedia, weather.com are the top 3)" width="800" height="508"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A breakdown of Google’s top domains. There’s a healthy mix, if skewed towards institutional sources.&lt;/p&gt;

&lt;p&gt;Across YMYL searches,  &lt;strong&gt;61% of Google’s results come from .gov, .edu, or .org domains&lt;/strong&gt;, already a clear majority. But the strongest signal appears at the very top of the search results page. In the  &lt;strong&gt;top three positions, institutional sources account for 73.33% of results&lt;/strong&gt;, while commercial (.com) domains drop to just  &lt;strong&gt;16.67%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Google doesn’t merely  &lt;em&gt;include&lt;/em&gt; institutional authority — it  &lt;em&gt;front-loads&lt;/em&gt;  it, especially where perceived risk is highest.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Query: “is coffee bad for you”&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Google’s Top 3:&lt;/strong&gt;Mayo Clinic&lt;br&gt;
Johns Hopkins Medicine&lt;br&gt;
UT Southwestern Medical Center&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the top ten, seven results come from major medical institutions, with community or experiential sources appearing only after some sort of authoritative consensus has been established.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Needless to say, this pattern is not accidental.&lt;/strong&gt;  It reflects a deliberate policy choice on Google’s part to privilege institutional authority under YMYL conditions. From Google’s POV,  &lt;em&gt;obviously&lt;/em&gt; this is pretty rational risk management. At global scale, health and finance queries are not abstract knowledge problems; they are legal, political, and ethical liabilities for Google. Frontloading institutional sources let them externalize that responsibility, stay in line with regulatory expectations, and minimize the chance of any catastrophic harm.&lt;/p&gt;

&lt;p&gt;But this stance comes with trade-offs.&lt;/p&gt;

&lt;p&gt;By systematically doing this, Google collapses disagreement into consensus and emergent knowledge into settled fact. Independent researchers, patient communities, and experience-driven perspectives are not necessarily excluded — but they  &lt;em&gt;are&lt;/em&gt; consistently  &lt;strong&gt;positionally suppressed&lt;/strong&gt;  in these search results pages, especially in the moments where users are most likely to stop scrolling.&lt;/p&gt;

&lt;p&gt;Over time, this creates a feedback loop. Institutional voices receive disproportionate visibility and legitimacy, alternative perspectives lose reach, and institutional consensus appears even more dominant.  &lt;strong&gt;The system always,  &lt;em&gt;always&lt;/em&gt; validates its own assumptions.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Google’s Approach Might Not Always Be a Good Idea.
&lt;/h3&gt;

&lt;p&gt;Consider that until very recently, established institutional authorities classified transgender identity as a psychiatric disorder. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The  &lt;strong&gt;American Psychiatric Association&lt;/strong&gt;  included “Gender Identity Disorder” in earlier editions of the  &lt;strong&gt;Diagnostic and Statistical Manual of Mental Disorders&lt;/strong&gt;  (DSM-III, DSM-IV).&lt;/li&gt;
&lt;li&gt;Even the  &lt;strong&gt;World Health Organization&lt;/strong&gt;  (!!!) classified “transsexualism” as a mental disorder in the ICD until 2019, when it was moved out of the mental disorders chapter in  &lt;strong&gt;ICD-11&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sure, those classifications are  &lt;em&gt;now&lt;/em&gt;  widely regarded as wrong or harmful, including by the very institutions that once promoted them. But the correction did  &lt;strong&gt;not&lt;/strong&gt;  originate from top-down institutional consensus. Under Google’s YMYL framework, search results during that period would have overwhelmingly surfaced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Established medical institutions&lt;/li&gt;
&lt;li&gt;Official diagnostic manuals&lt;/li&gt;
&lt;li&gt;Hospital systems and government health agencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lived experiences from individuals, clinicians reporting mismatches between diagnosis and lived reality, early dissenting researchers — would have been algorithmically deprioritized  &lt;em&gt;by design&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Not censored. Just plain buried.&lt;/p&gt;

&lt;p&gt;Anyway, according to Google, truth, in high-stakes domains, flows from recognized institutions downward. This approach is often defensible — and sometimes necessary — but it also narrows the epistemic frame.&lt;/p&gt;

&lt;p&gt;Reality, according to Google, is what  &lt;em&gt;institutions&lt;/em&gt; say it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Bing: Has A Severe Monoculture Problem.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;For most queries (especially technical and YMYL) Bing behaves like a consensus-finding machine instead of a search engine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetyh06816ow2oqsowmot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetyh06816ow2oqsowmot.png" alt="bar chart showing top domains for bing in commercial (nerdwallet, apple, and consumerreports.org are the top 3), technical (stackoverflow.com, microsoft.com, google.com are the top 3), ymyl (mayoclinic.org, aarp.org, wmo.int are the top 3), and wildcard queries (stackexchange.com, forbes.com, wikipedia.org are the top 3)" width="800" height="506"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Bing’s sources have worryingly low domain diversity.&lt;/p&gt;

&lt;p&gt;Bing has the  &lt;strong&gt;lowest domain diversity of any engine&lt;/strong&gt;  in the dataset, averaging just  &lt;strong&gt;6.18 unique domains per query&lt;/strong&gt;. That concentration becomes absurdly,  &lt;em&gt;comically&lt;/em&gt; extreme in technical searches, where  &lt;strong&gt;43.82% of Bing’s results come from StackOverflow alone&lt;/strong&gt;, as well as in YMYL queries where  &lt;strong&gt;31.73% of results link to Mayo Clinic&lt;/strong&gt;, with a small set of authoritative sources repeatedly resurfacing. Commercial results, similarly, always redirect to official manufacturer pages rather than any online store.&lt;/p&gt;

&lt;p&gt;Bing is just selecting upstream arbiters and then amplifying their voices.&lt;/p&gt;

&lt;p&gt;Where Google internally uses  &lt;em&gt;“Which institutions can absorb liability for this?”&lt;/em&gt; before answering, Bing internally considers something like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Which sources already function as de facto answer-oracles for this topic?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In technical domains, that oracle is  &lt;strong&gt;Stack Overflow&lt;/strong&gt;. In health, it’s  &lt;strong&gt;Mayo Clinic&lt;/strong&gt;. In ecosystems it overlaps with, it’s  &lt;strong&gt;Microsoft&lt;/strong&gt;  documentation — and, interestingly, even  &lt;strong&gt;Google&lt;/strong&gt;’s own docs when those have become canonical.&lt;/p&gt;

&lt;p&gt;Once a source has crossed some internal threshold of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;recognizability&lt;/li&gt;
&lt;li&gt;historical correctness&lt;/li&gt;
&lt;li&gt;low controversy&lt;/li&gt;
&lt;li&gt;repeat citation elsewhere&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…Bing appears to treat it as an  &lt;em&gt;answer sink&lt;/em&gt;. Queries collapse into references to the same few nodes.&lt;/p&gt;

&lt;p&gt;Like Google, this stance is understandable, if nothing else. Bing optimizes for reliability and predictability. Consensus sources are less likely to be wrong in obvious ways, less likely to trigger controversy, and easier to defend as “reasonable” choices, et cetera, et cetera. But unlike Google, which draws its authority boundary around established institutions, Bing draws it around  &lt;strong&gt;delegated arbiters&lt;/strong&gt;  — sources that have already been socially anointed as places where “the answer” lives.&lt;/p&gt;

&lt;p&gt;The cost is  &lt;strong&gt;monoculture&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a single source accounts for nearly half of all results in a category, it becomes a single point of epistemic failure. Take that StackOverflow result for example. StackOverflow is invaluable — but it is also gameable, shaped by moderator norms/community culture (read: policing), and biased toward certain voices and problem framings.&lt;/p&gt;

&lt;p&gt;Its answers reflect StackOverflow’s community power structures just as much as technical correctness. Bing’s heavy reliance on it turns those biases into infrastructure.&lt;/p&gt;

&lt;p&gt;Bing’s worldview is conservative and convergent. According to Bing, truth is what most experts  &lt;em&gt;already&lt;/em&gt; agree on. It is a safe engine — but again, one that rarely surprises, and rarely surfaces the odd, insightful edge case.&lt;/p&gt;

&lt;h2&gt;
  
  
  DuckDuckGo: Privacy Focused, But Aggregator Heavy.
&lt;/h2&gt;

&lt;p&gt;DuckDuckGo presents itself as the anti-Google: private, independent, and user-first. Its information signature though, tells a more nuanced story.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftgl9f09i8p6pr2wtjefq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftgl9f09i8p6pr2wtjefq.png" alt="bar chart showing top domains for duckduckgo in commercial (nerdwallet, apple, and wego.com are the top 3), technical (medium.com, dev.to, geeksforgeeks.org are the top 3), ymyl (smartasset.com, mayoclinic.org, cdc.gov are the top 3), and wildcard queries (forbes, wikipedia, chicagotribune.com are the top 3)" width="800" height="501"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;DuckDuckGo’s sources, for the most part, mirror Google’s — except they have an aggregator spam problem.&lt;/p&gt;

&lt;p&gt;In YMYL queries,  &lt;strong&gt;~63% of DuckDuckGo’s results come from .gov, .edu, or .org domains&lt;/strong&gt;  — lower than Google’s ~70%, but still a strong institutional majority. In the top three positions, institutional sources remain dominant, though with slightly more room for commercial and mixed-authority content than Google allows.&lt;/p&gt;

&lt;p&gt;Where DuckDuckGo diverges is how it (frequently) tolerates  &lt;strong&gt;aggregators&lt;/strong&gt;. In YMYL finance queries for example,  &lt;strong&gt;SmartAsset appears 13 times&lt;/strong&gt;  — more than any domain on any other engine. Similar patterns appear with comparison and lead-generation sites across categories. Bing  &lt;em&gt;loves&lt;/em&gt; listicles and aggregator spam.&lt;/p&gt;

&lt;p&gt;Take this single YMYL finance query. It’s a super boring YMYL question: vague, high-stakes, and normative. There is no single correct answer — only frameworks, heuristics, and trade-offs. That ambiguity makes it a useful probe.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;“best investment strategy for retirement”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On  &lt;strong&gt;Google&lt;/strong&gt;, this query immediately collapses into institutional authority. The top results are dominated by Fidelity, Vanguard, Merrill Lynch, the U.S. Department of Labor — large, regulated entities offering compliance-safe frameworks rather than any advice. The results page reads less like an answer to a question and more like some sort of syllabus on investment. 😅&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
"title": "Investing in Retirement: 5 Tips for Managing Your Portfolio | Merrill Lynch",
"source": "ml.com",
"description": "Merrill Lynch highlights diversification as a key strategy for retirement, noting that combining stocks and bonds helps balance long-term growth and risk protection across market conditions.",
"search_string": "best investment strategy for retirement",
"search_engine": "google"
},

{
"title": "Building Retirement Income Strategies | Fidelity",
"source": "fidelity.com",
"description": "Fidelity outlines four retirement income strategy models: interest and dividends only, investment portfolio only, portfolio plus guarantees, and short-term strategies, offering retirees flexibility based on risk tolerance and income needs.",
"search_string": "best investment strategy for retirement",
"search_engine": "google"
},
{
"title": "Guide to saving for retirement - Vanguard",
"source": "investor.vanguard.com",
"description": "Vanguard provides a step-by-step retirement preparation plan including estimating expenses, selecting accounts, investing, maximizing contributions, and adjusting strategies over time to align with life goals.",
"search_string": "best investment strategy for retirement",
"search_engine": "google"
},
{
"title": "Put your savings in different types of investments | U.S. Department of Labor",
"source": "dol.gov",
"description": "The U.S. Department of Labor emphasizes diversification as a way to reduce risk and improve returns by spreading investments across different asset types, such as stocks, bonds, and real estate.",
"search_string": "best investment strategy for retirement",
"search_engine": "google"
},
{
"title": "Learn how to secure your future with the best retirement investments | Nuveen",
"source": "nuveen.com",
"description": "Nuveen offers guidance on selecting retirement strategies through a variety of accounts and funds, helping individuals find a tailored investment approach that fits their financial goals and risk profile.",
"search_string": "best investment strategy for retirement",
"search_engine": "google"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the same query on  &lt;strong&gt;DuckDuckGo&lt;/strong&gt;, and now instead of institutions, we are saturated with aggregators and listicles. SmartAsset dominates the Top 5, alongside debt relief sites, comparison blogs, and SEO-optimized “top strategies” roundups. These pages are ultra-commercial, prescriptive, and aggressively optimized for conversion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
"title": "10 Retirement Strategies You Need to Know",
"source": "smartasset.com",
"description": "From tax-advantaged accounts to annuities, there are several retirement strategies to consider when planning. Learn more here.",
"search_string": "best investment strategy for retirement",
"search_engine": "duckduckgo"
},
{
"title": "Smart Retirement Investments: Strategies to Consider in 2025",
"source": "nationaldebtrelief.com",
"description": "Discover the best investment strategies for retirement in 2025, including top retirement investment ideas and tips to secure your financial future.",
"search_string": "best investment strategy for retirement",
"search_engine": "duckduckgo"
},
{
"title": "Top 11 Retirement Strategies",
"source": "smartasset.com",
"description": "Learn about retirement strategies including tax-advantaged accounts, annuities, and more to help plan for a secure future.",
"search_string": "best investment strategy for retirement",
"search_engine": "duckduckgo"
},
{
"title": "10 Retirement Strategies You Need to Know",
"source": "smartasset.com",
"description": "From tax-advantaged accounts to annuities, there are several retirement strategies to consider when planning. Learn more here.",
"search_string": "best investment strategy for retirement",
"search_engine": "duckduckgo"
},
{
"title": "10 Retirement Strategies You Need to Know",
"source": "smartasset.com",
"description": "From tax-advantaged accounts to annuities, there are several retirement strategies to consider when planning. Learn more here.",
"search_string": "best investment strategy for retirement",
"search_engine": "duckduckgo"
},
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes,  &lt;em&gt;all&lt;/em&gt; those Smartasset dot com results are different articles!&lt;/p&gt;

&lt;p&gt;This is a paradox and a half, to be honest. 😅 An engine designed to minimize tracking exposure ends up routing users through heavily commercialized middlemen. The likely explanation is probably that they rely a lot on upstream indices and ranking signals, and just…don’t have the budget/infra to throw at a Google-tier editorial suppression.&lt;/p&gt;

&lt;p&gt;Ultimately, DuckDuckGo values privacy and avoids heavy-handed intervention — but that restraint allows monetization to leak into sensitive queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Yandex: “Chaotic Neutral” in Search Engine Form.
&lt;/h2&gt;

&lt;p&gt;If Google curates for institutional authority and Bing for the safe picks, Yandex curates for maximum exposure.  &lt;em&gt;At whatever cost.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fisc3zivbgb5og7ilwa20.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fisc3zivbgb5og7ilwa20.png" alt="bar chart showing top domains for yandex in commercial (forbes, youtube, and tomsguide.com are the top 3), technical (youtube.com, medium.com, dev.to  are the top 3), ymyl (mayoclinic.org, verywellhealth.com, medium.cm are the top 3), and wildcard queries (youtube.com, bbc.com, linkedin.com are the top 3)" width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Yandex surfaces some truly out-there domains, dangerously so.&lt;/p&gt;

&lt;p&gt;Yandex surfaces  &lt;strong&gt;the highest domain diversity of all engines&lt;/strong&gt;, averaging  &lt;strong&gt;11.69 unique domains per query&lt;/strong&gt;, and includes  &lt;strong&gt;164 domains that appear on no other platform&lt;/strong&gt;. It is the most pluralistic — and the least filtered — search engine in the dataset.&lt;/p&gt;

&lt;p&gt;Consider YMYL-adjacent queries like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;“401k withdrawal rules”&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;“best running shoes”&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;“top rated wireless earbuds”&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In each case, Yandex surfaces results hosted on opaque  &lt;strong&gt;CloudFront subdomains&lt;/strong&gt;  — long, random-looking URLs that clearly do not represent publishers, brands, or even stable sites. These pages mimic legitimate listicles (“Tested &amp;amp; Rated,” “Best of 2025”) but provide no clear authorship, organization, or accountability. Like I found a CloudFront-hosted page that effectively impersonated  &lt;strong&gt;SmartAsset&lt;/strong&gt;  content without being SmartAsset at all.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"401(k) Tax Rules: Withdrawals, Deductions &amp;amp; More"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dr5dymrsxhdzh.cloudfront.net"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Unlike traditional 401(k) plans, Roth 401(k) accounts are funded with post-tax contributions, which means withdrawals can be taken tax-free if certain conditions are met… SmartAsset: 401(k) Tax Rules on Withdrawals, Deductions &amp;amp; More."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"search_string"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"401k withdrawal rules"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"search_engine"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"yandex"&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"10 Best Running Shoes of 2025 | Tested &amp;amp; Rated"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"d1nymbkeomeoqg.cloudfront.net"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"A great pair of running shoes brings with it the promise of a new day, a fresh run, and a better you, no matter what is happening in the world at large."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"search_string"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"best running shoes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"search_engine"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"yandex"&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Best Wireless Earbuds of 2025 | Tested &amp;amp; Rated"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"djd1xqjx2kdnv.cloudfront.net"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Wireless earbuds-the ear tip seal on the vibe beam is good, allowing a more immersive…"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"search_string"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"top rated wireless earbuds"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  
    &lt;/span&gt;&lt;span class="nl"&gt;"search_engine"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"yandex"&lt;/span&gt;&lt;span class="w"&gt;  
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Google and Bing suppress this class of result almost entirely. Yandex does not.&lt;/p&gt;

&lt;p&gt;The same permissiveness appears throughout its YMYL handling. In health queries, Medium posts, lifestyle blogs, YouTube videos, and institutional medical sources routinely show alongside trusted authorities. In finance, LinkedIn posts from entirely random users appear alongside Forbes and Investopedia. In commerce, scammers pretending to be a totally different website, affiliate spam, thin comparison pages, and reputable reviews coexist with minimal differentiation.&lt;/p&gt;

&lt;p&gt;This is not accidental. It’s a totally coherent — if extreme — philosophy.&lt;/p&gt;

&lt;p&gt;Yandex treats the web as inherently chaotic and contested. It does not even  &lt;em&gt;try&lt;/em&gt; to decide which sources are legitimate, which voices are authoritative, or which domains are accountable. It’s designed to maximize exposure and variety, and leave evaluation entirely to the user. Authority is just… optional.&lt;/p&gt;

&lt;p&gt;The upside of this approach is real. Yandex is exceptionally good at surfacing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;new emerging perspectives/early-stage knowledge&lt;/li&gt;
&lt;li&gt;unofficial but insightful explanations&lt;/li&gt;
&lt;li&gt;non-canonical voices (not always a good thing)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The downside is equally real, and…I’m not sure the tradeoff is worth it? By refusing to draw hard boundaries around authorship, institutional responsibility,  &lt;strong&gt;or even basic site legitimacy,&lt;/strong&gt;  Yandex allows misinformation, impersonation, scams, and noise to blend seamlessly into the same epistemic surface as genuine expertise.&lt;/p&gt;

&lt;p&gt;This is  &lt;a href="https://en.wikipedia.org/wiki/Alignment_(Dungeons_%26_Dragons)#Chaotic_neutral" rel="noopener noreferrer"&gt;chaotic neutral&lt;/a&gt;  in its purest form. 😅 Yandex does not protect the user from bad information, nor does it meaningfully privilege good information. It assumes a high-competence, high-skepticism user — and silently punishes anyone who lacks those priors.&lt;/p&gt;

&lt;p&gt;In a search ecosystem increasingly defined by guardrails and liability management, Yandex stands apart as the wild west by design. You  &lt;em&gt;almost&lt;/em&gt; have to admire it.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Search Engines Vary Wildly at Figuring Out What The User Meant
&lt;/h2&gt;

&lt;p&gt;Ranking differences are easy to see. Intent inference differences are harder — because they happen  &lt;em&gt;before&lt;/em&gt;  ranking begins.&lt;/p&gt;

&lt;p&gt;The clearest way to surface them is to hold the query constant and observe how each engine silently answers a  &lt;em&gt;different question&lt;/em&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Query: “should I move abroad?”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This isn’t a factual lookup, but a decision under uncertainty.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Google&lt;/strong&gt;  treats it as a decision-support problem, so you get cost-of-living comparisons, healthcare access, visa rules, and lived-experience discussions. Google assumes the user wants help weighing trade-offs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bing&lt;/strong&gt;  interprets the same query as an informational lookup. It surfaces immigration portals, official statistics, and formal descriptions of process. Bing assumes the user wants to understand the rules, not make the choice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DuckDuckGo&lt;/strong&gt;  treats it as a research aggregation task, surfacing blog posts and “pros and cons” articles without strong framing. It assumes exploration without guidance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Yandex&lt;/strong&gt;  treats it as open discourse, surfacing personal narratives, YouTube explainers, and expat forums. It assumes the user wants to see the argument, not the answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same query, but with four different inferred goals. The pattern repeats across very different intents.&lt;/p&gt;

&lt;p&gt;For  &lt;strong&gt;“how to learn Spanish”&lt;/strong&gt;, Google assumes the user wants a structured plan, Bing assumes understanding precedes execution, DuckDuckGo assumes tool discovery, and Yandex assumes learning happens socially and visually.&lt;/p&gt;

&lt;p&gt;For  &lt;strong&gt;“is intermittent fasting safe”&lt;/strong&gt;, Google frames risk conservatively, Bing collapses onto a narrow institutional consensus, DuckDuckGo allows mixed advice, and Yandex surfaces disagreement directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  What this reveals
&lt;/h3&gt;

&lt;p&gt;Intent inference is less about correctness, more about  &lt;strong&gt;what kind of answer the system believes the user  &lt;em&gt;deserves&lt;/em&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before ranking begins, each engine quietly decides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is this a decision or a lookup?&lt;/li&gt;
&lt;li&gt;Is subjectivity acceptable?&lt;/li&gt;
&lt;li&gt;Should disagreement be surfaced or suppressed?&lt;/li&gt;
&lt;li&gt;Is safety more important than exploration?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those choices differ by platform — and they shape the user’s reality far more than ranking tweaks ever could. Even if you could somehow optimize for search engine bias, deciding what the question is  &lt;em&gt;for&lt;/em&gt; is a decision made before ranking algorithms kick in that is equally important.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. YouTube as a Source: Yay or Nay?
&lt;/h2&gt;

&lt;p&gt;How a search engine treats YouTube is also a reliable signal of how it defines  &lt;em&gt;knowledge&lt;/em&gt;  itself. Across identical queries, engines don’t just rank video differently — they disagree on whether video is a legitimate way of knowing,  &lt;em&gt;at all&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgj2ylgn1i9i9latb8t87.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgj2ylgn1i9i9latb8t87.png" alt="a bar chart showing the distribution of YouTube.com as a source for each search engine. Yandex leads the pack with 5.71%, Google (1.81%) and Bing (1.26%) are next, while DuckDuckGo showed 0% YouTube.com results." width="630" height="313"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Percentage of all 1630 search results containing &lt;a href="https://youtube.com" rel="noopener noreferrer"&gt;youtube.com&lt;/a&gt; as a source, by search engine.&lt;/p&gt;

&lt;p&gt;Take this query, for example.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Query: “react useEffect cleanup”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a technical + procedural question.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Google&lt;/strong&gt;  surfaces YouTube sparingly and late, after official documentation and written explanations. Video is supplementary — useful for intuition, not authority.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bing&lt;/strong&gt;  shows a similar distribution to Google.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DuckDuckGo&lt;/strong&gt;  didn’t have  &lt;em&gt;any&lt;/em&gt; YouTube videos in the results themselves.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Yandex&lt;/strong&gt;  places YouTube front and center. Video tutorials routinely appear alongside — or ahead of — written documentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same query, different epistemologies.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Query: “how to learn Spanish”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now the task is experiential.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google&lt;/strong&gt; and  &lt;strong&gt;Bing&lt;/strong&gt; still emphasize structured programs and institutional guides, with video as an aid.  &lt;strong&gt;DuckDuckGo&lt;/strong&gt; shows nothing (they do contain a different Video field, though, but that’s not part of the search results themselves.)  &lt;strong&gt;Yandex&lt;/strong&gt;, by contrast, treats YouTube as the  &lt;em&gt;primary&lt;/em&gt; learning surface! Immersion videos, informal teachers, and community-driven instruction dominate.&lt;/p&gt;

&lt;p&gt;Surprisingly, this continues to be a thing even in higher-risk queries.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Query: “is intermittent fasting safe”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Google&lt;/strong&gt; and  &lt;strong&gt;Bing&lt;/strong&gt; absolutely suppress YouTube here in favor of institutional medical sources.  &lt;strong&gt;Yandex&lt;/strong&gt; though, continues to include YouTube prominently in the results, even alongside clinics and health authorities.&lt;/p&gt;

&lt;p&gt;Again — a policy difference, not a ranking algorithm difference.&lt;/p&gt;

&lt;p&gt;For Google/Bing, YouTube is treated as a didactic aid. But for Yandex, it’s apparently experiential knowledge.&lt;/p&gt;

&lt;p&gt;Yandex seems to be making a call here on whether  &lt;em&gt;demonstration counts as explanation&lt;/em&gt;, whether personality and persuasion are acceptable components of understanding, and whether authority must be textual to be trusted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkah6vjt54yefdsas62a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkah6vjt54yefdsas62a.png" alt="histogram showing a breakdown of Youtube.com listed as a source, by category of search query, for each search engine. Yandex leads in all four categories." width="630" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Breakdown of &lt;a href="https://Youtube.com" rel="noopener noreferrer"&gt;Youtube.com&lt;/a&gt; listed as a source, by category of search query.&lt;/p&gt;

&lt;p&gt;For many modern users — developers, learners, non-native speakers — YouTube is how knowledge is acquired (“visual learning” etc.) Engines that demote video implicitly privilege certain learning styles and literacies. Engines that surface  &lt;em&gt;more&lt;/em&gt; YouTube videos accept higher risk in exchange for accessibility.&lt;/p&gt;

&lt;p&gt;Once again, the difference is not about ranking quality. It’s about  &lt;strong&gt;what kinds of  &lt;em&gt;knowing&lt;/em&gt; are allowed to count&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Consensus vs Fragmentation: Just How Shared &lt;em&gt;Is&lt;/em&gt; the Web?
&lt;/h2&gt;

&lt;p&gt;One of the most striking findings in this analysis is not how differently search engines rank results — but how little they agree on  &lt;em&gt;which sources matter at all&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Across every query category, the overlap between Google, Bing, DuckDuckGo, and Yandex is vanishingly small. Only  &lt;strong&gt;~2–3% of domains appear in results from all four engines&lt;/strong&gt;, regardless of category. In practical terms, that means fewer than three out of every hundred sources are treated as universally relevant across the modern search ecosystem.&lt;/p&gt;

&lt;p&gt;The rest of the web is fragmented.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8vf1n5a2nunwxjloipvj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8vf1n5a2nunwxjloipvj.png" alt="histogram showing the number of sources that only appear in a single search engine’s results. Yandex leads with 171, while Google is second with 134. Bing with its monoculture problem lags behind in last place with 78" width="593" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Depending on category, between 68.6% and 91.3% of domains appear on only one engine.  &lt;strong&gt;Commercial&lt;/strong&gt;: 87.68% unique,  &lt;strong&gt;Technical:&lt;/strong&gt;  81.05% unique,  &lt;strong&gt;YMYL:&lt;/strong&gt;  91.33% unique,  &lt;strong&gt;Wildcard:&lt;/strong&gt;  68.57% unique.&lt;/p&gt;

&lt;p&gt;Measured by category, the percentage of domains shared across all four engines is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Commercial queries:&lt;/strong&gt;  &lt;strong&gt;1.9%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Technical queries:&lt;/strong&gt;  &lt;strong&gt;1.96%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;YMYL queries:&lt;/strong&gt;  &lt;strong&gt;2.89%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wildcard queries:&lt;/strong&gt;  &lt;strong&gt;3.33%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even in the  &lt;em&gt;best&lt;/em&gt;  case — open-ended wildcard searches — over  &lt;strong&gt;96% of domains are  &lt;em&gt;not&lt;/em&gt;  shared&lt;/strong&gt;  across engines. There is no category in which a meaningful consensus emerges.&lt;/p&gt;

&lt;p&gt;YMYL queries — where one might expect the strongest agreement — are actually the  &lt;strong&gt;most fragmented of all&lt;/strong&gt;. While engines like Google and Bing enforce strong institutional filters, DuckDuckGo and Yandex allow a much broader — and sometimes riskier — set of sources. The result is not a shared reality with minor ranking differences, but  &lt;strong&gt;fundamentally different epistemic environments&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is just a core, structural property of modern search.&lt;/p&gt;

&lt;p&gt;If there were such a thing as a canonical “top of the web,” we would expect to see convergence: a stable core of sources that every engine agrees are authoritative, especially in high-stakes domains. Instead, the data shows the opposite.&lt;/p&gt;

&lt;p&gt;Each engine constructs its own sourcing universe, with only a thin sliver of shared ground. What qualifies as “authoritative,” “relevant,” or even “acceptable” varies dramatically depending on which engine you use.&lt;/p&gt;

&lt;p&gt;Two users asking the same health or finance question on different engines are not being guided toward the same pool of knowledge. They are being pointed at  &lt;strong&gt;different universes of sources&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;💡&lt;/em&gt;  This fragmentation does  &lt;em&gt;not&lt;/em&gt;  mean that search engines are broken or irrational. It just means they are optimizing for different objectives:- Legal defensibility- Risk tolerance- Diversity vs safety&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consensus vs exploration&lt;/li&gt;
&lt;li&gt;Regional and infrastructural constraintsEach objective prunes the web differently, and our overlap statistics show that these pruning strategies rarely agree.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s all the results. Thank you for reading! If you want to know about my Methodology, read on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;All observations in this analysis are based on direct SERP retrieval, not third-party summaries or clickstream data.&lt;/p&gt;

&lt;p&gt;Queries were issued programmatically to four search engines — Google, Bing, DuckDuckGo, and Yandex — using a Node script that fetches raw results pages via a SERP API (&lt;a href="https://get.brightdata.com/scraping-browser-acf6883?utm_content=why_google_bing_duckduckgo_and_yandex_show_different_results_for_the_same_query_2026" rel="noopener noreferrer"&gt;Bright Data&lt;/a&gt;  here, the one I had access to.  &lt;a href="https://get.brightdata.com/lp-scraping-browser-acf1964?utm_content=why_google_bing_duckduckgo_and_yandex_show_different_results_for_the_same_query_2026" rel="noopener noreferrer"&gt;Docs here&lt;/a&gt;.) Visualizations were generated later, via &lt;a href="https://D3.js" rel="noopener noreferrer"&gt;D3.js&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=why_google_bing_duckduckgo_and_yandex_show_different_results_for_the_same_query_2026" rel="noopener noreferrer"&gt;SERP API - SERP Scraper API - Free Trial&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://d3js.org/" rel="noopener noreferrer"&gt;D3 - The JavaScript library for bespoke data visualization&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The script does not simulate user interaction, personalization, or logged-in sessions. Each engine is queried with the same plain-text search string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;path&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;dotenv&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;__dirname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../../.env&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fetch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;node-fetch&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fs-extra&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// you need to sign up at bright data to get these values&lt;/span&gt;
&lt;span class="c1"&gt;// https://brightdata.com/cp/setting/users&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;CONFIG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;apiToken&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;BRIGHT_DATA_API_TOKEN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;BRIGHT_DATA_ZONE&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;serp_api1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;maxResults&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;parseInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;MAX_RESULTS_PER_ENGINE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;apiUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://api.brightdata.com/request&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;// search engine config&lt;/span&gt;
&lt;span class="c1"&gt;// as of Jan 2026, Bright data supports Google, Bing, DuckDuckGo, Yandex, Baidu, Yahoo, &amp;amp; Naver&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ENGINES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;google&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Google&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;buildUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`https://www.google.com/search?q=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nf"&gt;encodeURIComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;bing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Bing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;buildUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`https://www.bing.com/search?q=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nf"&gt;encodeURIComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;duckduckgo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;DuckDuckGo&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;buildUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`https://duckduckgo.com/?q=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nf"&gt;encodeURIComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;yandex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Yandex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;buildUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`https://www.yandex.com/search/?text=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nf"&gt;encodeURIComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;// default query categories - all 40 queries from queries.js&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;DEFAULT_QUERIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="c1"&gt;// commercial (10)&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;best password manager 2025&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;commercial&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;buy iphone 15&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;commercial&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;best running shoes&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;commercial&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;cheapest web hosting&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;commercial&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;top rated mattress&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;commercial&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;best credit card for travel&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;commercial&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;cheap flights to europe&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;commercial&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;best laptop for programming&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;commercial&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;top rated wireless earbuds&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;commercial&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;best car insurance rates&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;commercial&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="c1"&gt;// technical (10)&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;react useEffect cleanup&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;technical&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;postgresql connection pooling&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;technical&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;nodejs memory leak debugging&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;technical&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tailwind vs vanilla css&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;technical&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;docker compose volumes&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;technical&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;kubernetes pod restart policy&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;technical&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;javascript async await best practices&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;technical&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;git rebase vs merge&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;technical&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;typescript generic constraints&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;technical&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;redis cache invalidation strategies&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;technical&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="c1"&gt;// ymyl (10)&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;is coffee bad for you&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ymyl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;climate change causes&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ymyl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;covid vaccine side effects&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ymyl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;401k withdrawal rules&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ymyl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;adhd symptoms adults&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ymyl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;how to lower cholesterol naturally&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ymyl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;social security benefits calculator&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ymyl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;is intermittent fasting safe&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ymyl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;best investment strategy for retirement&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ymyl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;symptoms of diabetes type 2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ymyl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="c1"&gt;// wildcard (10)&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pizza restaurants chicago&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wildcard&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;weather in london&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wildcard&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ukraine russia conflict timeline&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wildcard&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;how to start a business&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wildcard&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;best cities to live in 2025&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wildcard&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;what is artificial intelligence&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wildcard&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;how to learn spanish&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wildcard&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;most popular video games 2025&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wildcard&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;best new artists 2025&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wildcard&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;top box office movies 2025&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wildcard&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="c1"&gt;// data directory for saving results&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;DATA_DIR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;__dirname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;data&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;RAW_DIR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;DATA_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;raw&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// ensure data directory exists&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;setup&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ensureDir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;RAW_DIR&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// fetch results from a single search engine&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;fetchEngine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;engineKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ENGINES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;engineKey&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Unknown engine: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;engineKey&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;searchUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;buildUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;isGoogle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;engineKey&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;google&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`  [&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;] Fetching: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;requestBody&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CONFIG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;zone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;searchUrl&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isGoogle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;requestBody&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;format&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;requestBody&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data_format&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;parsed_light&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;requestBody&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;format&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;raw&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;requestBody&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data_format&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;markdown&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;CONFIG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;apiUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Authorization&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Bearer &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;CONFIG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;apiToken&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;requestBody&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`HTTP &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;statusText&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isGoogle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`  [&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;] Retrieved JSON data`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;markdown&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sizeKB&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`  [&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;] Retrieved markdown (&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;sizeKB&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; KB)`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;engineKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;engineName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;isGoogle&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;markdown&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`  [&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;] ERROR: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;engineKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;engineName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// fetch results from all search engines in parallel&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;fetchAllEngines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`\n[Querying all engines for: "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"]`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;-&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;engineKeys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ENGINES&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;promises&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;engineKeys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;engineKey&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;fetchEngine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;engineKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;promises&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;successCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`\n[Successfully retrieved data from &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;successCount&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; engines]`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// save results from all engines to files&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;saveResults&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;engineResults&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;querySlug&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;[^&lt;/span&gt;&lt;span class="sr"&gt;a-z0-9&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;/gi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;_&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// save each engine's results to a separate file&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;engineResults&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;extension&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;format&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;md&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;filename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;querySlug&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;extension&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;filepath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;RAW_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;format&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeJson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;spaces&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;utf8&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`  [Saved &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;engineName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; results]`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// save metadata file with query info&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;metadataFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`metadata-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;querySlug&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.json`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;metadataPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;RAW_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;metadataFile&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeJson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;metadataPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;engines&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;engineResults&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;engineName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;engineName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;hasData&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
    &lt;span class="p"&gt;}))&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;spaces&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`  [Saved metadata]`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// main function&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Search Engine Comparison Tool&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

  &lt;span class="c1"&gt;// get queries from command line or use defaults&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cliQueries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cliQueries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// use command-line queries&lt;/span&gt;
    &lt;span class="nx"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cliQueries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;custom&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}));&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Queries: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;cliQueries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;, &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// use default categorized queries&lt;/span&gt;
    &lt;span class="nx"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;DEFAULT_QUERIES&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Total queries: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Engines: Google, Bing, DuckDuckGo, Yandex`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Max results per engine: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;CONFIG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;maxResults&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

  &lt;span class="c1"&gt;// check for API token&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;CONFIG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;apiToken&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ERROR: BRIGHT_DATA_API_TOKEN not found in environment&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// setup directories&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;setup&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// process each query&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;category&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`\n[&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;] Processing: "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;" [&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;]`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;-&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// fetch results from all engines&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;engineResults&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetchAllEngines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="c1"&gt;// save results to files&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;saveResults&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;engineResults&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;successCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;engineResults&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`\n[Completed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;successCount&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;engineResults&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; engines succeeded]`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="c1"&gt;// wait between queries (except for last one)&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Waiting 2 seconds...&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resolve&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`\n[Error processing "&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;":`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;All queries completed!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Results saved in: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;RAW_DIR&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// run if called directly&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;FATAL ERROR:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;CONFIG&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Breaking this down, here’s what I’m doing:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://1.Define" rel="noopener noreferrer"&gt;&lt;strong&gt;1.Define&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;a fixed set of queries&lt;/strong&gt;, grouped into broad categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;commercial (e.g. product comparisons)&lt;/li&gt;
&lt;li&gt;technical (e.g. programming questions)&lt;/li&gt;
&lt;li&gt;YMYL (health, finance, climate)&lt;/li&gt;
&lt;li&gt;wildcard / everyday queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Build native search URLs&lt;/strong&gt;  for each engine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google (&lt;code&gt;https://www.google.com/search?q=something&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Bing (&lt;code&gt;https://www.bing.com/search?q=something&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;DuckDuckGo (&lt;code&gt;https://duckduckgo.com/?q=something&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Yandex (&lt;code&gt;https://www.yandex.com/search/?text=something&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Fetch SERPs directly&lt;/strong&gt;, without rendering or post-processing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google results are requested in auto-parsed JSON form (via Bright Data SERP API’s parser)&lt;/li&gt;
&lt;li&gt;Bing, DuckDuckGo, and Yandex are retrieved as raw or near-raw markup (Bing does apparently support JSON parsing, but I couldn’t get it to work)&lt;/li&gt;
&lt;li&gt;No attempt is made to normalize rankings across engines. Way beyond the scope of what I wanted to do here.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Store each engine’s response separately&lt;/strong&gt;, along with metadata. I analyzed them afterward by extracting titles, sources, search string used, and search engine used. Dataviz was pretty standard &lt;a href="https://D3.js" rel="noopener noreferrer"&gt;D3.js&lt;/a&gt; stuff, afterwards.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this method captures — and what it doesn’t
&lt;/h2&gt;

&lt;p&gt;This approach captures  &lt;strong&gt;what each engine is willing to surface by default&lt;/strong&gt;, given the same query and no user context. It is well-suited to studying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;domain diversity&lt;/li&gt;
&lt;li&gt;repetition and amplification&lt;/li&gt;
&lt;li&gt;tolerance for aggregators, blogs, and spam&lt;/li&gt;
&lt;li&gt;relative privileging of institutions vs. individuals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does  &lt;em&gt;not&lt;/em&gt;  capture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;personalization effects&lt;/li&gt;
&lt;li&gt;geographic fine-tuning beyond the proxy’s exit region&lt;/li&gt;
&lt;li&gt;click behavior or downstream recommendations&lt;/li&gt;
&lt;li&gt;subtle UI elements like knowledge panels or answer boxes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is intentional. The goal here is not to model the entire user experience, but to compare  &lt;strong&gt;personality defaults&lt;/strong&gt;: what each engine treats as acceptable answers when no additional signals are provided.&lt;/p&gt;

&lt;p&gt;In that sense, this methodology reflects the engines’ baseline worldviews — how they behave when forced to make decisions about authority, safety, and plurality on their own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Opinion: There is No Singular “The Web” Anymore
&lt;/h2&gt;

&lt;p&gt;This analysis wasn’t about correctness, bias accusations, or SEO gamesmanship.&lt;/p&gt;

&lt;p&gt;What this data makes clear is that  &lt;strong&gt;“the web” is no longer a single, shared informational substrate&lt;/strong&gt;. It has fractured at the search-engine layer. If you have access to some sort of SERP infra, all of this is easily reproducible.&lt;/p&gt;

&lt;p&gt;Search engines don’t merely reflect reality; they  &lt;strong&gt;select which parts of reality are visible at all&lt;/strong&gt;. When only ~2–3% of sources survive that selection process across platforms, the idea of a neutral, universal information commons stops making sense.&lt;/p&gt;

&lt;p&gt;Instead, we live in parallel informational worlds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google’s institutional web&lt;/li&gt;
&lt;li&gt;Bing’s consensus web&lt;/li&gt;
&lt;li&gt;DuckDuckGo’s lightly curated web&lt;/li&gt;
&lt;li&gt;Yandex’s wild west, maximalist web&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The question “what does the internet say?” no longer has a single answer. It depends entirely on where you look.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>javascript</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Building a Fault-Tolerant Web Data Ingestion Pipeline with Effect-TS</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Tue, 20 Jan 2026 08:17:46 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/building-a-fault-tolerant-web-data-ingestion-pipeline-with-effect-ts-29l1</link>
      <guid>https://dev.to/prithwish_nath/building-a-fault-tolerant-web-data-ingestion-pipeline-with-effect-ts-29l1</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;TL;DR — Silent failures break data pipelines. This post shows how Effect-TS enables typed errors, safe resource management, declarative retry logic, and composable pipelines to build predictable, fault-tolerant web data ingestion systems at scale.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It annoys me to no end that production web data pipelines rarely fail catastrophically. Instead, batch jobs “succeed” with incomplete data — silently corrupting downstream analytics, triggering retry storms that lead to IP bans, or letting one bad edge case crash a large nightly job.&lt;/p&gt;

&lt;p&gt;I’m currently rebuilding the web data ingestion pipeline I’m responsible for at work: aggregation and analysis from  &lt;strong&gt;100+ upstream sources daily&lt;/strong&gt;, hundreds of items per batch, with strict consistency requirements. Over time, I stopped trying to paper over failures with more logging + more retries, and started looking for a way to make system behavior  &lt;em&gt;explicit and easier to reason about&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That search eventually led me to  &lt;a href="https://github.com/Effect-TS/effect/" rel="noopener noreferrer"&gt;Effect&lt;/a&gt;  (formerly, Effect-TS) — a TypeScript effect system for modeling side effects, failures, and resource lifecycles directly in the type system.&lt;/p&gt;

&lt;p&gt;Effect didn’t make my life easier in the sense of “fewer lines of code.” What it changed was how I thought about  &lt;strong&gt;failure in TypeScript systems&lt;/strong&gt;. Instead of treating network errors, rate limits, and partial responses as things to catch and move on from, Effect pushes you to model these failure modes explicitly and decide  &lt;em&gt;ahead of time&lt;/em&gt;  how the system should respond.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Reliability engineering isn’t about building systems that never fail. It’s about building systems where failure is expected, understood, and bounded — so it doesn’t cascade into larger outages or silent data corruption.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this post, I’ll walk through what that style of reliability engineering looks like in practice: using Effect-TS with typed errors, resource management, and declarative retries to build a fault-tolerant web data ingestion pipeline whose behavior is predictable under real-world failure.&lt;/p&gt;

&lt;p&gt;All of the code in this post lives in a public  &lt;strong&gt;Effect-TS web scraping and data ingestion&lt;/strong&gt;  repository on GitHub:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/sixthextinction/effect-ts-scraping" rel="noopener noreferrer"&gt;&lt;strong&gt;https://github.com/sixthextinction/effect-ts-scraping&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Effect-TS?
&lt;/h2&gt;

&lt;p&gt;At a practical level,  &lt;strong&gt;Effect lets you describe work without running it yet.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An  &lt;code&gt;Effect&lt;/code&gt;  value represents an operation that  &lt;em&gt;might&lt;/em&gt;  perform I/O,  &lt;em&gt;might&lt;/em&gt;  fail, and  &lt;em&gt;might&lt;/em&gt;  depend on some environment…but none of that happens until you explicitly run it.&lt;/p&gt;

&lt;p&gt;The crucial thing to realize is that Effect doesn’t just describe  &lt;em&gt;what&lt;/em&gt;  the operation does. It also encodes  &lt;strong&gt;what the operation produces on success,&lt;/strong&gt;  &lt;strong&gt;what it can fail with&lt;/strong&gt;, and, optionally,  &lt;strong&gt;what it depends on.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And ALL of that information lives in the TypeScript type system.&lt;/p&gt;

&lt;p&gt;This might not sound like a big deal at first, but it changes  &lt;em&gt;when&lt;/em&gt;  decisions get made — and turns out, that matters a lot.&lt;/p&gt;

&lt;p&gt;In a typical TypeScript codebase, a data-fetching function looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async function fetchHtml(url: string): Promise&amp;lt;string&amp;gt; {  
  const res = await fetch(url);  

  if (!res.ok) {  
    throw new Error(`Request failed: ${res.status}`);  
  }  

  return await res.text();  
}  

// and then...  
const promise = fetchHtml("https://example.com");
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What most people don’t realize is that  &lt;a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise" rel="noopener noreferrer"&gt;Promises&lt;/a&gt;  in JavaScript are eager.  &lt;strong&gt;As soon as that line runs, the request has already started.&lt;/strong&gt;  Even if you never  &lt;code&gt;await&lt;/code&gt; the promise, the network request is already in flight, side effects have already happened, and yes — failures may already be occurring.&lt;/p&gt;

&lt;p&gt;Now compare that to an Effect-based version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// first define the error  
class NetworkError extends Data.TaggedError('NetworkError')&amp;lt;{  
  url: string;  
}&amp;gt; {}  

// then, do this.  
const fetchHtml = (url: string): Effect.Effect&amp;lt;string, NetworkError, never&amp;gt; =&amp;gt;  
  Effect.tryPromise({  
    try: async () =&amp;gt; {  
      const res = await fetch(url);  
      if (!res.ok) {  
        throw new Error();  
      }  
      return await res.text();  
    },  
    catch: () =&amp;gt; new NetworkError({ url }),  
  });
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Effect is  &lt;em&gt;lazy&lt;/em&gt;, not eager. With Effect, just doing  &lt;code&gt;const effect = fetchHtml(“https://example.com");&lt;/code&gt;  does nothing. It’s simply data, returning a  &lt;em&gt;description&lt;/em&gt;  of a computation. Nothing runs until you explicitly say so, by calling a runner like this:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Effect.runPromise(fetchHtml("https://example.com"));&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Because the work doesn’t start until you run that, turns out you can still alter how it should behave — retries, timeouts, cancellations and more, attached  &lt;strong&gt;before execution&lt;/strong&gt;, not bolted on afterward.&lt;/p&gt;

&lt;p&gt;Instead of discovering failure modes at runtime (or more likely, encoding them in comments/conventions) you’re  &lt;em&gt;forced&lt;/em&gt; to confront them at design time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const program = pipe(  
  fetchHtml("https://example.com"),  
  Effect.retry(retryPolicy), // add retry logic  
  Effect.timeout("10 seconds") // add exponential backoff  
 // anything else  
);  

// still nothing has run  

// until….  
Effect.runPromise(program);  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s why Effect works so well for hostile I/O like data ingestion. You’re deciding  &lt;em&gt;ahead of time&lt;/em&gt;  how the system behaves when failures do happen. And with it, cross-cutting concerns (retries, rate limits, cleanup) can go on top without refactoring core logic.&lt;/p&gt;

&lt;p&gt;Also, look at the Effect version’s type —  &lt;code&gt;Effect&amp;lt;string, NetworkError&amp;gt;&lt;/code&gt; — this is a  &lt;strong&gt;machine-checkable contract&lt;/strong&gt; that tells you,  &lt;em&gt;precisely&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;this operation performs effects&lt;/li&gt;
&lt;li&gt;it produces a string on success&lt;/li&gt;
&lt;li&gt;it can fail with  &lt;code&gt;NetworkError&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;it CANNOT fail with any other expected error&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare that to the vanilla TS type signature  &lt;code&gt;((url: string) =&amp;gt; Promise&amp;lt;string&amp;gt;)&lt;/code&gt;, you cannot tell:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what errors might be thrown&lt;/li&gt;
&lt;li&gt;whether they’re retryable&lt;/li&gt;
&lt;li&gt;whether this is safe to call multiple times&lt;/li&gt;
&lt;li&gt;whether this does I/O or just compute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of that information exists only in comments, conventions, or someone’s head (or you only find out by running it and reacting.)&lt;/p&gt;

&lt;p&gt;All this is why Effect feels like  &lt;em&gt;the&lt;/em&gt; TypeScript framework that you didn’t know you needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Effect Changed How I Design for Failure
&lt;/h2&gt;

&lt;p&gt;When the mental model of Effect clicked for me, I knew that if I can describe behavior  &lt;em&gt;before&lt;/em&gt;  anything runs, then I’m not just deciding what happens on success, I’m deciding how the operation behaves under every condition. That includes failures obviously,  &lt;strong&gt;but it also includes retries, slowdowns, and backpressure.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That’s where my thinking about data ingestion started to change. Most failures in a data ingestion pipeline are  &lt;strong&gt;expected&lt;/strong&gt;. None of them behave like typical fix-and-forget bugs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;networks are slow or unavailable&lt;/li&gt;
&lt;li&gt;upstream APIs rate-limit you&lt;/li&gt;
&lt;li&gt;data formats change without notice&lt;/li&gt;
&lt;li&gt;some batches succeed while others fail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What’s different about these failures is that they’re maddeningly  &lt;em&gt;partial,&lt;/em&gt; and often. A job can succeed  &lt;em&gt;just enough&lt;/em&gt; to look healthy while quietly producing incomplete/stale data.&lt;/p&gt;

&lt;p&gt;That’s not a correctness problem so much as a reliability problem. Once I started using Effect more deliberately, I noticed that it actually pushed me away from reacting to failures  &lt;em&gt;after&lt;/em&gt; the fact, and toward making those decisions up front.&lt;br&gt;
So instead of adding another retry or another catch, while designing I had to decide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What kinds of failures do I expect to see in production?&lt;/li&gt;
&lt;li&gt;Which of these should be retried, and which shouldn’t?&lt;/li&gt;
&lt;li&gt;When should the pipeline slow down instead of pushing harder?&lt;/li&gt;
&lt;li&gt;When is failing fast the correct behavior?*&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;*This one is slightly debatable, but lets throw it in there because it’s an adjacent problem anyway&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because these questions were now part of the TypeScript type system itself, those decisions end up close to the code that triggers them. There’s less room for “we’ll handle it later” logic that never quite materializes because Effect forces the conversation.&lt;/p&gt;
&lt;h2&gt;
  
  
  Designing a Web Data Pipeline with Effect
&lt;/h2&gt;

&lt;p&gt;The first concrete step was obvious:  &lt;strong&gt;I needed to enumerate what actually breaks in my pipeline, and decide how each case should behave.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So I sat down and reduced my Puppeteer-based ingestion pipeline down to its real failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Network timeouts.&lt;/strong&gt;  Transient. These should be retried with backoff.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limits.&lt;/strong&gt;  Expected. These require slowing down.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IP blocks.&lt;/strong&gt;  Fatal without proxy rotation; but with the right infrastructure (as was my case), just another retryable case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CAPTCHAs.&lt;/strong&gt;  Not a logic problem. For me, this is handled entirely by the proxy layer, and is also retryable without any code on my part.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema changes.&lt;/strong&gt;  The site changed and selectors broke. This isn’t transient — it’s a logic error and should fail fast.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional error handling lumps all of these into “something went wrong, throw an exception.” Effect lets you model them as distinct failure types, which means you can build infrastructure that handles them systematically. And that’s exactly where we’re going to start.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For reference, here’s the code for the full pipeline:  &lt;a href="https://github.com/sixthextinction/effect-ts-scraping/blob/main/full-pipeline.ts" rel="noopener noreferrer"&gt;&lt;strong&gt;https://github.com/sixthextinction/effect-ts-scraping/blob/main/full-pipeline.ts&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Failure as a First-Class Concept (Tagged Errors)
&lt;/h2&gt;

&lt;p&gt;The first thing I do is write down every failure mode I expect to see, and give each one a name. Each of these errors represents something  &lt;em&gt;meaningfully different&lt;/em&gt;  from an operational perspective.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class NetworkError extends Data.TaggedError('NetworkError')&amp;lt;{  
  message: string;  
  url: string;  
  cause?: unknown;  
}&amp;gt; {}  

class TimeoutError extends Data.TaggedError('TimeoutError')&amp;lt;{  
  message: string;  
  url: string;  
  timeout: number;  
}&amp;gt; {}  

class RateLimitError extends Data.TaggedError('RateLimitError')&amp;lt;{  
  message: string;  
  url: string;  
  retryAfter?: number;  
}&amp;gt; {}  

class IPBlockError extends Data.TaggedError('IPBlockError')&amp;lt;{  
  message: string;  
  url: string;  
  proxyId?: string;  
}&amp;gt; {}  

class ParseError extends Data.TaggedError('ParseError')&amp;lt;{  
  message: string;  
  cause?: unknown;  
}&amp;gt; {}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What’s this  &lt;code&gt;Data.TaggedError&lt;/code&gt;?  &lt;a href="https://effect.website/docs/error-management/expected-errors/" rel="noopener noreferrer"&gt;That’s something Effect provides us.&lt;/a&gt;  Basically, it’s a premade error class that automatically gets a &lt;code&gt;_tag&lt;/code&gt;  field — a string literal that acts as a discriminant.&lt;/p&gt;

&lt;p&gt;This  &lt;code&gt;_tag&lt;/code&gt;  field gives us type-safe error handling. TypeScript can distinguish between different error types at compile time, and you can use functions like  &lt;a href="https://Effect.catchTag" rel="noopener noreferrer"&gt;&lt;code&gt;Effect.catchTag&lt;/code&gt;&lt;/a&gt; to handle specific errors without losing type information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You technically  &lt;em&gt;can&lt;/em&gt;  do this with vanilla TypeScript (discriminated unions), but it’ll be a pain.&lt;/strong&gt; Yes, you can catch generic  &lt;code&gt;Error&lt;/code&gt; objects and use  &lt;code&gt;instanceof&lt;/code&gt; checks — but TypeScript can’t always narrow them correctly. Effect’s tagged errors give you precise type narrowing. When you catch a  &lt;code&gt;NetworkError&lt;/code&gt;, for example, TypeScript 100% knows it has a  &lt;code&gt;url&lt;/code&gt; property. When you catch a  &lt;code&gt;RateLimitError&lt;/code&gt;, TypeScript knows it might have a  &lt;code&gt;retryAfter&lt;/code&gt; property. This makes error handling both type-safe  &lt;em&gt;and&lt;/em&gt; composable (not to mention 500% less annoying to write code for. 😅)&lt;/p&gt;

&lt;p&gt;Anyway, &lt;strong&gt;when modeling complex pipelines, this has to be your first step&lt;/strong&gt; because once you have typed errors, you can decide which ones are retryable and which aren’t:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const retryableErrors = [  
  'NetworkError',  
  'TimeoutError',   
  'RateLimitError',  
  'IPBlockError',  
  'BrowserError',  
] as const;  

const isRetryableError = (error: ScrapingError): boolean =&amp;gt;  
  retryableErrors.includes(error._tag as any);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So a  &lt;code&gt;ParseError&lt;/code&gt; means your HTML selectors broke. That’s not a network problem so retrying won’t help. But a  &lt;code&gt;TimeoutError&lt;/code&gt; is when you retry with backoff.&lt;/p&gt;

&lt;h2&gt;
  
  
  Browser Logic — Side Effects Without the Pain
&lt;/h2&gt;

&lt;p&gt;My pipeline uses  &lt;a href="https://github.com/puppeteer/puppeteer" rel="noopener noreferrer"&gt;Puppeteer&lt;/a&gt; to handle dynamic/JS based websites.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://effect.website/docs/getting-started/the-effect-type/" rel="noopener noreferrer"&gt;For this, we’ll use the Effect interface.&lt;/a&gt;  (this  &lt;code&gt;Effect&lt;/code&gt; is a type within the  &lt;code&gt;Effect-TS/effect&lt;/code&gt;  library we’re working with) Instead of letting Puppeteer leak browser state all over the codebase, everything related to it lives in these  &lt;code&gt;Effects&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The  &lt;code&gt;Effect&lt;/code&gt; interface is the quintessential part of the Effect-TS library — a description of a workflow or operation that is lazily executed. Here's what it looks like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Effect&amp;lt;Success, Error, Requirements&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Where &lt;code&gt;Success&lt;/code&gt; represents the type of what is returned on a success, &lt;code&gt;Error&lt;/code&gt; represents the same for an error, and &lt;code&gt;Requirements&lt;/code&gt; represents the type of required dependencies that you need to pass.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We’ve talked about the main difference before — unlike Promises, Effects are lazy. They don’t run until executed. This opens up a lot of opportunities for us for composition, cancellation, and resource management.&lt;/p&gt;

&lt;p&gt;Using  &lt;code&gt;Effect&lt;/code&gt;, the lowest-level operation we need is…simply launching a browser. Makes sense that this should be our Step 1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// STEP 1: Actually launch the browser  

// tryPromise converts a Promise-returning function into an Effect  
// Errors are caught and converted to typed errors (BrowserError)  
const launchBrowser = (  
  proxyConfig: ProxyConfig  
): Effect.Effect&amp;lt;Browser, BrowserError, never&amp;gt; =&amp;gt;  
  Effect.tryPromise({  
    try: async () =&amp;gt; {  
      process.env['NODE_TLS_REJECT_UNAUTHORIZED'] = '0'; // disable SSL validation for Bright Data proxy  
      return await puppeteer.launch({  
        headless: true,  
        ignoreHTTPSErrors: true, // ignore SSL certificate errors  
        args: [  
          '--no-sandbox',  
          '--disable-setuid-sandbox',  
          `--proxy-server=${proxyConfig.host}:${proxyConfig.port}`, // proxy host:port (credentials set via page.authenticate later)  
        ],  
      });  
    },  
    catch: (error: unknown) =&amp;gt;  
      new BrowserError({  
        message: 'Failed to launch browser with Bright Data proxy',  
        cause: error,  
      }),  
  });
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The  &lt;code&gt;never&lt;/code&gt; in the  &lt;code&gt;Requirements&lt;/code&gt; position means the effect doesn’t require any external dependencies or context to run.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://Effect.tryPromise" rel="noopener noreferrer"&gt;&lt;code&gt;Effect.tryPromise&lt;/code&gt;&lt;/a&gt;  will convert a Promise-returning function into an Effect (here,  &lt;code&gt;puppeteer.launch&lt;/code&gt;). Any thrown error gets mapped into a typed failure — since I explicitly use a catch function here, it will explicitly map it to an error of type  &lt;code&gt;BrowserError&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Your proxy config can live in a separate object like so. Using a proxy is technically optional, but I already had access to  &lt;a href="https://get.brightdata.com/bd-residential-proxies?utm_content=building_a_fault_tolerant_web_data_ingestion_pipeline_with_effect_ts" rel="noopener noreferrer"&gt;residential proxies&lt;/a&gt;  and that handles the messy parts for me — fingerprinting, CAPTCHA solving, IP rotation, and geo-targeting — so in my pipeline, my Puppeteer instances behave like a real user instead of getting blocked immediately, with no extra code on my part.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://get.brightdata.com/bd7914?utm_content=building_a_fault_tolerant_web_data_ingestion_pipeline_with_effect_ts&amp;amp;source=post_page-----0bc5494282ba---------------------------------------" rel="noopener noreferrer"&gt;Bright Data - All in One Platform for Proxies and Web Scraping&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Bright Data HTTP Proxy configuration (from env vars or .env file)  
// You'll get these values from your dashboard when you sign up  
const BRIGHT_DATA_CONFIG = {  
  customerId: process.env.BRIGHT_DATA_CUSTOMER_ID,  
  zone: process.env.BRIGHT_DATA_ZONE,  
  password: process.env.BRIGHT_DATA_PASSWORD,  
  proxyHost: 'brd.superproxy.io',  
  proxyPort: 33335,  
};  

// Validate configuration  
if (!BRIGHT_DATA_CONFIG.customerId || !BRIGHT_DATA_CONFIG.zone || !BRIGHT_DATA_CONFIG.password) {  
  throw new Error(  
    'Bright Data configuration missing. Set BRIGHT_DATA_CUSTOMER_ID, BRIGHT_DATA_ZONE, and BRIGHT_DATA_PASSWORD environment variables or add them to .env file'  
  );  
}  

interface ProxyConfig {  
  host: string;  
  port: number;  
  username: string;  
  password: string;  
}  

const buildProxyConfig = (): ProxyConfig =&amp;gt; {  
  const username = `brd-customer-${BRIGHT_DATA_CONFIG.customerId}-zone-${BRIGHT_DATA_CONFIG.zone}`;  
  return {  
    host: BRIGHT_DATA_CONFIG.proxyHost,  
    port: BRIGHT_DATA_CONFIG.proxyPort,  
    username,  
    password: BRIGHT_DATA_CONFIG.password!,  
  };  
};
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those proxy config values are just the credentials you get  &lt;a href="https://get.brightdata.com/bd-web-unlocker?utm_content=building_a_fault_tolerant_web_data_ingestion_pipeline_with_effect_ts" rel="noopener noreferrer"&gt;when you set up a proxy to use&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;With Puppeteer, you use proxies via  &lt;code&gt;page.authenticate()&lt;/code&gt;, which is where we’ll use this in the next step.&lt;/p&gt;

&lt;p&gt;Alright, so as of now, we have a Puppeteer instance up and running. Next, we need navigation and content extraction. We’ll use  &lt;a href="https://Effect.acquireUseRelease" rel="noopener noreferrer"&gt;&lt;code&gt;Effect.acquireUseRelease&lt;/code&gt;&lt;/a&gt;  to do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// STEP 2: Go to page, extract content.   

const navigatePageAndGetContent = ( browser: Browser, // this was returned as a result of what we did in Step 1  
  url: string, // the URL to go to  
  proxyConfig: ProxyConfig, // we already set this up earlier  
  timeout: number // use your own values in ms) =&amp;gt;  
  Effect.acquireUseRelease(  
    // acquire: create the page  
    // use: navigate and get content  
    // release: always close the page, even on error  
  );
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;a href="https://effect.website/docs/resource-management/introduction/#acquireuserelease" rel="noopener noreferrer"&gt;acquireUseRelease&lt;/a&gt; is Effect’s version of &lt;code&gt;try / finally&lt;/code&gt;.  You use it when describing real-world operations where you have to work with external resources (database connections, network stuff, etc.) that must be acquired, used properly, and released when no longer needed (even an error occurs).&lt;/p&gt;

&lt;p&gt;It  &lt;em&gt;always&lt;/em&gt; involves a 3-step process. For us, this will involve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Acquire&lt;/strong&gt;: open a page in Puppeteer using a proxy that we authenticate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use&lt;/strong&gt;: navigate, check status codes, return HTML&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release&lt;/strong&gt;: close the page, even if something failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don’t have to explicitly remember to do cleanup — the structure enforces it.&lt;/p&gt;

&lt;p&gt;Let’s look at all of those steps in detail.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// STEP 2: Go to page, extract content.   

// Effect.acquireUseRelease manages resource lifecycle: acquire, use, and release  
// Ensures cleanup happens even if errors occur (like try/finally)  
// See: https://effect.website/docs/resource-management/introduction  
const navigatePageAndGetContent = (  
  browser: Browser,  
  url: string,  
  proxyConfig: ProxyConfig,   
  timeout: number = 10000   
): Effect.Effect&amp;lt;string, BrowserError | TimeoutError | IPBlockError | RateLimitError | NetworkError&amp;gt; =&amp;gt;  
  Effect.acquireUseRelease(  
    // STEP 2.1: acquire: create the page  
    Effect.tryPromise({  
      try: async () =&amp;gt; {  
        const page = await browser.newPage();  
        await page.authenticate({ username: proxyConfig.username, password: proxyConfig.password }); // authenticate with Bright Data proxy  
        return page;  
      },  
      catch: (error: unknown) =&amp;gt;  
        new BrowserError({  
          message: 'Failed to create page or authenticate',  
          cause: error,  
        }),  
    }),  
    // STEP 2.2: use: navigate and get content  
    (page) =&amp;gt;  
      Effect.tryPromise({  
        try: async () =&amp;gt; {  
          const response = await page.goto(url, {  
            waitUntil: 'networkidle2', // use 'load' if 'networkidle2' fails - proxies can have background requests that never stop  
            timeout: timeout    
          });  

          // check for HTTP errors that indicate blocks/rate limits  
          if (response) {  
            const status = response.status();  
            if (status === 429) {  
              throw new RateLimitError({  
                message: `Rate limited: ${url}`,  
                url,  
              });  
            }  
            if (status === 403) {  
              throw new IPBlockError({  
                message: `IP blocked: ${url}`,  
                url,  
              });  
            }  
            if (status &amp;gt;= 400) {  
              throw new NetworkError({  
                message: `HTTP error ${status}: ${url}`,  
                url,  
              });  
            }  
          }  

          return await page.content();  
        },  
        catch: (error: unknown) =&amp;gt; {  
          if (error instanceof RateLimitError || error instanceof IPBlockError) {  
            return error;  
          }  
          if (error instanceof NetworkError) {  
            return error;  
          }  
          if (error instanceof Error &amp;amp;&amp;amp; error.message.includes('timeout')) {  
            return new TimeoutError({  
              message: `Navigation timeout after ${timeout}ms`,  
              url,  
              timeout,  
            });  
          }  
          return new BrowserError({  
            message: 'Failed to navigate or get content',  
            cause: error,  
          });  
        },  
      }),  
    // STEP 2.3: release: always close the page, even on error  
    (page) =&amp;gt;  
      pipe(  
        Effect.tryPromise({  
          try: async () =&amp;gt; await page.close(),  
          catch: () =&amp;gt; new Error('Failed to close page'),  
        }),  
        Effect.catchAll(() =&amp;gt; Effect.void) // ignore close errors  
      )  
  );
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Effect's &lt;a href="https://effect.website/docs/getting-started/building-pipelines/#pipe" rel="noopener noreferrer"&gt;pipe&lt;/a&gt; (seen here in Step 2.3) composes functions left-to-right, passing the output of one as input to the next. It makes Effect operations readable instead of nested. So while reading, you start with  &lt;a href="https://Effect.tryPromise" rel="noopener noreferrer"&gt;&lt;code&gt;Effect.tryPromise&lt;/code&gt;&lt;/a&gt;, then apply  &lt;a href="https://Effect.catchAll" rel="noopener noreferrer"&gt;&lt;code&gt;Effect.catchAll&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Great, we now have a Puppeteer instance, and we can use it to go visit a page, extract the content we want, and close the page. Now we just have to bring it all together i.e. manage the browser lifecycle.&lt;/p&gt;

&lt;p&gt;This should  &lt;em&gt;also&lt;/em&gt;  be on an  &lt;strong&gt;acquire → use → release&lt;/strong&gt; cycle, this time on a browser level rather than a page level.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const scrapeUrl = (  
 url: string,  
 options?: { timeout?: number }  
 ): Effect.Effect&amp;lt;  
 string,  
 BrowserError | TimeoutError | IPBlockError | RateLimitError | NetworkError, never&amp;gt;  
=&amp;gt; {  
 const proxyConfig = buildProxyConfig();  
 const timeout = options?.timeout || 10000;  
// Bright Data automatically rotates IPs on each request,  
 // so retrying after an IP block gets a fresh IP  
 return Effect.acquireUseRelease(  
 // STEP 1: ACQUIRE -- launch browser  
 launchBrowser(proxyConfig),  

     // STEP 2: USE -- navigate and extract HTML  
(browser) =&amp;gt;  
  navigatePageAndGetContent(browser, url, proxyConfig, timeout),  

// STEP 3: RELEASE -- always clean up the browser  
(browser) =&amp;gt;  
  pipe(  
    Effect.tryPromise({  
      try: async () =&amp;gt; await browser.close(),  
      catch: () =&amp;gt; new Error('Failed to close browser'),  
    }),  
    Effect.catchAll(() =&amp;gt; Effect.void)  
  )  
 );  
};
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;💡  In production usage you will usually also want to separate our  &lt;strong&gt;fetch → parse → exit&lt;/strong&gt;  cycle into  &lt;strong&gt;fetch → persist raw → parse → persist parsed&lt;/strong&gt;, so you can debug raw HTML later, or parallelize parsing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Again, each step is expressed as a function returning an  &lt;code&gt;Effect&lt;/code&gt;, and cleanup is  &lt;em&gt;guaranteed&lt;/em&gt;, even if navigation or retries fail.&lt;/p&gt;

&lt;p&gt;At the end of this stage we have the basic Puppeteer loop: visiting a dynamic page, extracting its HTML, and cleaning up after ourselves.&lt;/p&gt;

&lt;p&gt;But there’s more to do — namely, making all of the above work with parsing (business logic), and retry behavior + rate limiting (cross-cutting concerns.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Retries That Understand &lt;em&gt;Why&lt;/em&gt;  Something Failed
&lt;/h2&gt;

&lt;p&gt;Effect makes retry behavior declarative  via its &lt;a href="https://effect.website/docs/scheduling/built-in-schedules/#exponential" rel="noopener noreferrer"&gt;Schedule&lt;/a&gt; API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// remember we defined which errors were retryable in step 1  
// here, first, we define HOW we should schedule retries…  

const retryPolicy = pipe(  
  Schedule.exponential(Duration.seconds(1)),  
  Schedule.intersect(Schedule.recurs(3))  
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This says: “retry with exponential backoff starting at 1 second, up to 3 times, but only if the error is retryable.”&lt;/p&gt;

&lt;p&gt;But the schedule alone isn’t enough. The system also needs to know  &lt;em&gt;which failures deserve a retry&lt;/em&gt;. Luckily, we already know those.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;//… and then actually retry the ones which are retryable  
// Effect&amp;lt;A, E, R&amp;gt; is just the Effect ecosystem’s convention/shorthand for Effect&amp;lt;Success, Error, Requirements&amp;gt;  

const retryIfRetryable = &amp;lt;A, E extends ScrapingError, R&amp;gt;( effect: Effect.Effect&amp;lt;A, E, R&amp;gt;) =&amp;gt;  
  Effect.retry(effect, {  
    schedule: retryPolicy,  
    until: (error) =&amp;gt; isRetryableError(error),  
  });
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;until&lt;/code&gt; predicate is the key. Instead of retrying blindly, the system checks the error type and only retries when it makes sense. If you hit a  &lt;code&gt;ParseError&lt;/code&gt; (not in our list of retryable errors), the pipeline fails immediately — which makes sense, no point in hammering a broken CSS selector.&lt;/p&gt;

&lt;p&gt;This is where our tagged errors from Step 1 pay off. The retry logic doesn’t inspect strings or guess intent. It operates entirely on types.&lt;/p&gt;

&lt;p&gt;We’ll use  &lt;code&gt;retryIfRetryable&lt;/code&gt;  at the very end, when we’re bringing all parts of our pipeline together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rate Limiting as a Declarative Policy
&lt;/h2&gt;

&lt;p&gt;Rate limiting is what we call  &lt;a href="https://en.wikipedia.org/wiki/Backpressure_routing" rel="noopener noreferrer"&gt;backpressure&lt;/a&gt;, rather than proper error handling. That is — you don’t want to wait until you get rate-limited to slow down, you want to prevent it in the first place.&lt;/p&gt;

&lt;p&gt;For this tutorial, we can keep rate limiting intentionally boring here. Because even our rate limiting is an Effect (&lt;code&gt;Effect.Effect&lt;/code&gt;), it composes cleanly with retries and resource management above.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const withSimpleRateLimit = &amp;lt;A, E, R&amp;gt;( effect: Effect.Effect&amp;lt;A, E, R&amp;gt;) =&amp;gt;  
  pipe(  
    Effect.sleep(Duration.millis(100)),  
    Effect.flatMap(() =&amp;gt; effect)  
  );
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simply introduces a delay  &lt;em&gt;before&lt;/em&gt;  the effect runs. Just like retries, we’ll use  &lt;code&gt;withSimpleRateLimit&lt;/code&gt; at the very end when composing the pipeline.&lt;/p&gt;

&lt;p&gt;Of course, if you wanted to, you  &lt;em&gt;could&lt;/em&gt; go all out — you can build production grade rate limiting because Effect provides primitives like  &lt;a href="https://effect.website/docs/state-management/ref/" rel="noopener noreferrer"&gt;Ref&lt;/a&gt; (keep track of some form of state),  &lt;a href="https://effect.website/docs/concurrency/queue/" rel="noopener noreferrer"&gt;Queue&lt;/a&gt; (lightweight in-memory queue), and  &lt;a href="https://effect.website/docs/scheduling/built-in-schedules/" rel="noopener noreferrer"&gt;Schedule&lt;/a&gt; (we used this in just the previous step.)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 I’m not going to go into detail on building a full rate limiter with Effect because that’s way too much cognitive load for just a blogpost. The point isn’t to build perfect throttling — it’s to show that  &lt;strong&gt;backpressure can be a first-class part of the pipeline in Effect&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Parsing as Its Own Failure Domain
&lt;/h2&gt;

&lt;p&gt;Finally, remember that while parsing logic is synchronous, it can still fail. We haven’t accounted for those yet.&lt;/p&gt;

&lt;p&gt;This is our HTML parsing step to get the data we need. Instead of letting  &lt;code&gt;ParseError’s&lt;/code&gt; throw, it’s better if we wrap it in  &lt;a href="https://Effect.try" rel="noopener noreferrer"&gt;&lt;code&gt;Effect.try&lt;/code&gt;&lt;/a&gt;  — this is similar to  &lt;a href="https://Effect.tryPromise" rel="noopener noreferrer"&gt;&lt;code&gt;Effect.tryPromise&lt;/code&gt;&lt;/a&gt;  from earlier, but ONLY for synchronous functions that may throw (like  &lt;code&gt;cheerio&lt;/code&gt; parsing here.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// This is just your scraping logic with selectors for the data you want  
// For this one we just get h1’s and spans.  

interface ScrapingResult {  
  title: string;  
  spans: string[];  
  url: string;  
}  

// Effect.try wraps synchronous logic that may throw  
// Errors are caught and converted to typed errors (ParseError)  
const parseHtml = (html: string): Effect.Effect&amp;lt;ScrapingResult, ParseError, never&amp;gt; =&amp;gt;  
  Effect.try({  
    try: () =&amp;gt; {  
      const $ = cheerio.load(html);  
      const title = $('h1').text().trim();  
      const spans = $('span')  
        .map((_i: number, el: any) =&amp;gt; $(el).text().trim())  
        .get()  
        .filter((s: string) =&amp;gt; s.length &amp;gt; 0);  

      return {  
        title,  
        spans,  
        url: TARGET_URL,  
      };  
    },  
    catch: (error: unknown) =&amp;gt;  
      new ParseError({  
        message: 'Failed to parse HTML',  
        cause: error,  
      }),  
  });  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This completes our error modeling — now parsing failures are distinct from network failures and both are handled properly. If parsing breaks,  &lt;em&gt;all&lt;/em&gt; our retries stop. That should be intentional — and so we made it so.&lt;/p&gt;

&lt;h2&gt;
  
  
  Composing the Pipeline
&lt;/h2&gt;

&lt;p&gt;Up to this point, we’ve built individual pieces in isolation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;scrapeUrl&lt;/code&gt; knows how to fetch HTML safely&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;retryIfRetryable&lt;/code&gt; knows when to retry&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;withSimpleRateLimit&lt;/code&gt; enforces basic throttling&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;parseHtml&lt;/code&gt; turns raw HTML into structured data as per our domain logic (we provide the selectors we need)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now we compose them into a  &lt;strong&gt;single pipeline&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const scrapeWithRetry = (): Effect.Effect&amp;lt;  
  ScrapingResult,  
  ScrapingError  
&amp;gt; =&amp;gt;  
  pipe(  
    // Step 2: fetch HTML  
    scrapeUrl(TARGET_URL, { timeout: 30000 }),  

    // Step 4: apply rate limiting  
    withSimpleRateLimit,  

    // Step 3: retry transient failures  
    retryIfRetryable,  

    // Step 5: parse HTML  
    Effect.flatMap(parseHtml)  
  );
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read this  &lt;strong&gt;top to bottom&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We start by fetching HTML with  &lt;code&gt;scrapeUrl&lt;/code&gt; (which instantiates and uses Puppeteer)&lt;/li&gt;
&lt;li&gt;That operation is rate-limited&lt;/li&gt;
&lt;li&gt;If it fails with a retryable error, it’s retried with backoff&lt;/li&gt;
&lt;li&gt;If it succeeds, we move on to parsing&lt;/li&gt;
&lt;li&gt;If parsing fails, the whole Effect fails immediately (as it should)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There are no callbacks here, no  &lt;code&gt;try/catch&lt;/code&gt;, and no manual error propagation. Control flow is handled by the Effect runtime.&lt;/p&gt;

&lt;p&gt;Crucially,  &lt;strong&gt;this function does not run anything yet&lt;/strong&gt;. It only  &lt;em&gt;describes&lt;/em&gt;  what should happen.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡  I’ve skipped observability for this blog post, but in production, you should add more detailed logging, save retry metrics/failures, tracing per URL or proxy etc.  &lt;a href="https://effect.website/docs/observability/logging/" rel="noopener noreferrer"&gt;Effect makes this very easy with its Logging APIs&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Executing the Pipeline
&lt;/h2&gt;

&lt;p&gt;So far, we’ve built a description of a workflow. To actually execute it, we need to define what happens at the  &lt;em&gt;edges&lt;/em&gt;  of the system (and this is where we finally use  &lt;code&gt;scrapeWithRetry&lt;/code&gt;.)&lt;/p&gt;

&lt;p&gt;That’s what the final  &lt;code&gt;program&lt;/code&gt; does. This is again a  &lt;code&gt;pipe&lt;/code&gt;, so these happen sequentially:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const program = pipe(  
  scrapeWithRetry(),  

  // Log success  
  Effect.tap((result: ScrapingResult) =&amp;gt;  
    pipe(  
      Console.log('Scraping successful!'),  
      Effect.flatMap(() =&amp;gt;  
        Console.log(JSON.stringify(result, null, 2))  
      )  
    )  
  ),  

  // Handle all failures in one place  
  Effect.catchAll((error: ScrapingError) =&amp;gt;  
    pipe(  
      Console.error('Pipeline failed:', error),  
      Effect.flatMap(() =&amp;gt;  
        Effect.sync(() =&amp;gt; process.exit(1))  
      )  
    )  
  )  
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, here’s how you run this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// This is the entry point that ACTUALLY kicks off the entire pipeline  
Effect.runPromise(program).catch((error: unknown) =&amp;gt; {  
  console.error('Unhandled error:', error);  
  process.exit(1);  
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;The  &lt;code&gt;.catch()&lt;/code&gt; wrapper here handles any truly unexpected errors that escape the Effect system (really shouldn’t happen, but it’s defensive programming).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the moment where everything becomes “real” — the browser is launched, requests are made, retries happen, resources are acquired and released, and logs are written.&lt;/p&gt;

&lt;p&gt;Until this line runs, nothing has executed.&lt;/p&gt;

&lt;p&gt;That mental separation —  &lt;em&gt;describing&lt;/em&gt;  a workflow first, then  &lt;em&gt;running&lt;/em&gt;  it explicitly — is one of the key reasons Effect works well for systems like this. You can reason about behavior before anything touches the network.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters for Production Data Pipelines
&lt;/h2&gt;

&lt;p&gt;So what did we build? Our pipeline has a few important properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Our errors are part of the type system and MUST be handled or intentionally propagated, and structured errors make any debugging or observability WAY easier&lt;/li&gt;
&lt;li&gt;All of our resource lifecycles are enforced by construction&lt;/li&gt;
&lt;li&gt;Our retry behavior is declarative, composable, and constrained by error types. In general, all cross-cutting concerns (retries, rate limits, cleanup) compose without refactoring core logic&lt;/li&gt;
&lt;li&gt;All failure modes are explicit and discoverable at compile time&lt;/li&gt;
&lt;li&gt;Any concurrency is safer by default, especially around shared resources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That structure is what makes this production pipeline  &lt;em&gt;evolvable&lt;/em&gt;. You can add ingestion sources, tune retries, adjust rate limits, or add observability without turning the code into a pile of special cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This system will always,  &lt;em&gt;always&lt;/em&gt; fail predictably, with enough context to debug what went wrong and why.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This kind of setup pays off when you’re scraping at scale, your failures have real business impact, and you need debuggability and auditability — something a team will maintain.&lt;/p&gt;

&lt;p&gt;Most scraping tutorials stop at “&lt;em&gt;how to fetch a lot of HTML without getting blocked.&lt;/em&gt;” That’s not the hard part. The hard part is building something you  &lt;em&gt;don’t have to babysit&lt;/em&gt;. &lt;a href="https://Effect.ts" rel="noopener noreferrer"&gt;Effect.ts&lt;/a&gt; gives you a way to model failure honestly, and a lot of readymade, first-class APIs to handle the parts application code built from scratch never should + you can add proxies to handle CAPTCHA/general unblocking. It’s a solid foundation to work off of.&lt;/p&gt;

&lt;p&gt;It’s way more difficult, absolutely — Effect’s learning curve is more like a cliff wall — but it’s also way more reliable. &lt;strong&gt;And when a system runs unattended in production, that reliability is what actually matters.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Full source code:&lt;/em&gt; &lt;a href="https://github.com/sixthextinction/effect-ts-scraping" rel="noopener noreferrer"&gt;&lt;strong&gt;https://github.com/sixthextinction/effect-ts-scraping&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>typescript</category>
      <category>javascript</category>
    </item>
    <item>
      <title>A Beginner's Guide To Building Data Pipelines with Apache Arrow</title>
      <dc:creator>Prithwish Nath</dc:creator>
      <pubDate>Tue, 20 Jan 2026 07:26:18 +0000</pubDate>
      <link>https://dev.to/prithwish_nath/a-beginners-guide-to-building-data-pipelines-with-apache-arrow-3k8m</link>
      <guid>https://dev.to/prithwish_nath/a-beginners-guide-to-building-data-pipelines-with-apache-arrow-3k8m</guid>
      <description>&lt;p&gt;If you’ve ever built data pipelines for analytics, feature extraction, or model training, you’ve probably noticed a pattern: scraping or ingestion is rarely your bottleneck. It’s that the pipeline  &lt;em&gt;technically works&lt;/em&gt;, but compute and memory usage spike, and scaling the system becomes more expensive overall.&lt;/p&gt;

&lt;p&gt;I’m betting the issue is not network I/O or parsing itself. It’s what happens  &lt;em&gt;after&lt;/em&gt;  your data arrives.&lt;/p&gt;

&lt;p&gt;Your pipeline probably relies on JSON as the interchange format between stages. Over time, data is parsed, transformed, cached, reloaded, and exported multiple times. Each step looks reasonable in isolation but taken together, they add up to a large (and unnecessary) cost.&lt;/p&gt;

&lt;p&gt;Let’s talk about that cost, explain where it comes from, and show how Apache Arrow can be used to avoid it. We’ll build a small end-to-end pipeline, benchmark it against a traditional JSON approach, and look at the hard numbers directly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;🔥 Spoilers:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;With Arrow, on average:&lt;/em&gt; &lt;strong&gt;&lt;em&gt;~2.6x faster processing&lt;/em&gt;&lt;/strong&gt;&lt;em&gt;,&lt;/em&gt; &lt;strong&gt;&lt;em&gt;~84% less memory usage, ~98–99% lower storage and I/O costs — all adding up to a ~60% reduction in compute cost&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;without even considering storage/bandwidth. Read on to know more.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What is Apache Arrow?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/apache/arrow" rel="noopener noreferrer"&gt;Apache Arrow is a columnar, in-memory data format that is open-source.&lt;/a&gt; Instead of representing data as a collection of rows (say, a list of dictionaries), it stores each column as a contiguous buffer of typed values.&lt;/p&gt;

&lt;p&gt;Turns out, that design has a few important consequences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Operations can work on entire columns at once&lt;/li&gt;
&lt;li&gt;  Memory layouts are predictable and cache-friendly&lt;/li&gt;
&lt;li&gt;  Many transformations can reuse existing buffers&lt;/li&gt;
&lt;li&gt;  Serialization to formats like Parquet avoids intermediate language-specific objects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key point I’m trying to make is not that Arrow is “faster” in isolation, but that it changes the execution model. Once data is in Arrow format, most transformations no longer involve Python objects  &lt;em&gt;at all.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;💡&lt;/em&gt;  FYI “Zero-copy” here doesn’t mean  &lt;em&gt;no memory is ever allocated&lt;/em&gt;. It means you avoid repeated parse/encode cycles and Python-level object creation — the dominant cost in traditional pipelines.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Problem Isn’t “JSON vs. Arrow”
&lt;/h2&gt;

&lt;p&gt;The JSON you get back from your API call is fundamentally just…&lt;em&gt;text&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# This is what you actually get from the API  
api_response = '{"organic": [{"title": "Example", "position": 1}, ...]}'  
# Then you parse it, and...  
data = json.loads(api_response) # ...now it's a Python dict, not JSON anymore!  
# Your data is now:  
# {  
#  "organic": [  
#    {"title": "Example", "position": 1}, # Each result is a Python dict object  
#    {"title": "Another", "position": 2},  
#    # hundreds more  
#  ]  
# }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The moment you parse it, you’re no longer working with JSON, but with language-native object graphs. In Python, that means dictionaries, lists, strings, and integers allocated on the heap. From then on,  &lt;em&gt;every&lt;/em&gt; transformation operates on that object graph. Want to access the  &lt;code&gt;organic.title&lt;/code&gt; field? That’s actually a hashmap lookup operation under the hood.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;💡&lt;/em&gt;  Python  &lt;em&gt;is&lt;/em&gt; uniquely bad at this (because dicts are hashtables with high overhead, every int/string/bool is a heap object, pointers everywhere, no JIT etc.), but this isn’t a problem limited to Python. Node.js (V8, with JIT) is much faster at object-heavy workloads for example, but once JSON is parsed, the data still becomes arrays of JavaScript objects processed one row at a time + each filter, sort, or map still allocates new arrays and performs property lookups. V8 makes this faster so you hit the wall much later,  &lt;strong&gt;but no JIT or interpreter can escape this fundamental shape.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead, you should think about the root problem as  &lt;strong&gt;Row-oriented object graphs vs Columnar memory structures.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With the former (how Python does it normally), each row is its own “container”, and each field is a separate object allocated on the heap (spread out randomly across system memory) connected by pointers. If your CPU wants to process that data, the runtime will need to enter a loop and repeatedly traverse back-and-forth to manipulate objects.&lt;/p&gt;

&lt;p&gt;Apache Arrow simply sidesteps this problem. Instead of grouping values by row, it groups values by whole columns and stores them in typed, contiguous buffers — laid out sequentially in system memory.  &lt;strong&gt;Each column buffer, directly, is the “container” now.&lt;/strong&gt;  Filtering, sorting, and aggregation  &lt;em&gt;directly&lt;/em&gt; operate on these system memory buffers.&lt;/p&gt;

&lt;p&gt;So you’re moving computation out of Python’s object model entirely. Operating on a lower level in memory entirely vs. the Python abstraction is why we get the speed + storage gains we do.&lt;/p&gt;

&lt;p&gt;Here are the typical workflow problems you solve by switching to Apache Arrow:&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 1: Per-Element Execution
&lt;/h3&gt;

&lt;p&gt;A typical filtering op in Python will probably look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;filtered = [r for r in organic if r.get('position', 0) &amp;lt;= 10]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Super easy to understand, very straightforward and idiomatic. It is also 100% row-oriented, fundamentally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  One loop iteration per result&lt;/li&gt;
&lt;li&gt;  One dictionary lookup per row&lt;/li&gt;
&lt;li&gt;  One conditional branch per row&lt;/li&gt;
&lt;li&gt;  A new Python list allocated for the result&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As inputs grow into the hundreds or thousands or millions, it will always become the dominant cost.&lt;/p&gt;

&lt;p&gt;In contrast, Arrow expresses the same operation at the column level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pyarrow.compute as pc  

# Filtering with pyarrow - Python bindings for Arrow  

filtered = table.filter(pc.less_equal(table['position'], 10))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What Arrow does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Actually runs optimized C++ code under the hood (via  &lt;code&gt;pyarrow&lt;/code&gt;, its Python library) to operate system memory directly&lt;/li&gt;
&lt;li&gt;  Operates on the entire column at once (vectorized)&lt;/li&gt;
&lt;li&gt;  Avoids Python’s interpreter entirely in the hot path&lt;/li&gt;
&lt;li&gt;  Returns a new table referencing existing buffers — you reuse memory automatically whenever possible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here, the comparison runs inside optimized native code (and the optimizations are ones you could never make operating solely at the Python level), operating on a contiguous buffer of values. Python is not involved in the inner loop.  &lt;em&gt;Ever&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 2: You Convert More Than You Think
&lt;/h3&gt;

&lt;p&gt;In real systems, data is rarely parsed once and discarded. You’re probably converting constantly between each stage of your pipeline without even realizing it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# 1. Fetch from API = gives you a JSON string  
response = requests.get(url).text  
# 2. Parse to Python objects (CONVERSION #1)  
data = json.loads(response)  
# 3. Process data (Python dicts)  
filtered = [r for r in data if …]  
# 4. Save to cache or disk (CONVERSION #2)  
with open('cache.json', 'w') as f:  
json.dump(filtered, f)  
# 5. Later, read from cache (CONVERSION #3)  
with open('cache.json', 'r') as f:  
cached = json.load(f)  
# 6. Export to CSV or another format (CONVERSION #4)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every  &lt;code&gt;json.loads()&lt;/code&gt; and &lt;code&gt;json.dumps()&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Parses or serializes the entire dataset&lt;/li&gt;
&lt;li&gt;  Allocates new Python objects&lt;/li&gt;
&lt;li&gt;  Goes back and visits every value again&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This overhead compounds with batch size and iteration count. You’re paying that same Problem #1 cost  &lt;em&gt;repeatedly&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 3: Memory Overhead
&lt;/h3&gt;

&lt;p&gt;I’ve already said how JSON becomes Python objects once parsed. Just how bad is the problem in terms of space? Consider a simple row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;result = {  
  "title": "Example",  
  "position": 1,  
  "link": "https://..."  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Python’s memory, this roughly becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  A dictionary with hash table overhead: ~240 bytes&lt;/li&gt;
&lt;li&gt;  Separate string objects for keys (&lt;code&gt;title&lt;/code&gt;,  &lt;code&gt;position&lt;/code&gt;,  &lt;code&gt;link&lt;/code&gt;): ~150 bytes&lt;/li&gt;
&lt;li&gt;  Separate objects for each value: ~100+ bytes each&lt;/li&gt;
&lt;li&gt;  Pointers connecting everything together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total:&lt;/strong&gt;  ~500–600 bytes per result.&lt;/p&gt;

&lt;p&gt;Now, exact sizes will of course vary by Python version and workload, but overall, &lt;strong&gt;row-oriented object graphs are memory-dense and cache-unfriendly.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In Arrow, this row does  &lt;em&gt;not&lt;/em&gt; exist as a standalone object. There is no dictionary, no per-row container, and no per-field Python object:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  One typed value in an integer buffer (&lt;code&gt;position&lt;/code&gt;): ~4–8 bytes&lt;/li&gt;
&lt;li&gt;  One entry in a string offsets buffer per string field (&lt;code&gt;title&lt;/code&gt; and  &lt;code&gt;link&lt;/code&gt;): ~4–8 bytes each&lt;/li&gt;
&lt;li&gt;  UTF-8 string bytes stored contiguously (amortized, no object headers)&lt;/li&gt;
&lt;li&gt;  Optional validity bits: ~1 bit per column&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total:&lt;/strong&gt;  ~150–200 bytes per result (depending mostly on string length)&lt;/p&gt;

&lt;p&gt;The difference shows up quickly when you scale beyond toy data sizes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We’re Building
&lt;/h2&gt;

&lt;p&gt;To demonstrate how Arrow serves our needs better, we’ll build a simple data pipeline that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Fetches ~100 results in JSON from an external API (use anything you want that gives you data at scale; I’m going with Google SERP results)&lt;/li&gt;
&lt;li&gt; Converts the JSON response directly to Arrow tables (one-time conversion cost)&lt;/li&gt;
&lt;li&gt; Simulates a real production pipeline load (filtering, sorting, and aggregation) using Arrow-native operations&lt;/li&gt;
&lt;li&gt; Exports to Parquet or CSV with minimal overhead&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We’ll compare this against a JSON-based version that performs the same logical work, and measure both runtime and memory usage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you already have some data as structured JSON from whatever source, just start at Step 2.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up the Project
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install dependencies:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First of all, we’ll need PyArrow. That’s the official Python interface to the Apache Arrow columnar memory format + ecosystem.&lt;/p&gt;

&lt;p&gt;The rest should be self explanatory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install pyarrow requests python-dotenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you aren’t skipping Step 1, I’m using &lt;a href="https://get.brightdata.com/bd-serp-api?utm_content=stop_paying_the_json_tax_build_faster_data_pipelines_in_python_with_apache_arrow" rel="noopener noreferrer"&gt;Bright Data’s SERP API&lt;/a&gt; to get JSON data at scale for this demo quick. For this, you’ll need to sign up, get these credentials, and put them in an &lt;code&gt;.env&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BRIGHT_DATA_CUSTOMER_ID=your_customer_id  
BRIGHT_DATA_ZONE=your_zone  
BRIGHT_DATA_PASSWORD=your_password
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Recommended project structure&lt;/strong&gt;  (also optional, really)&lt;strong&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;project/  
├── src/  
│ ├── api_client.py # SERP API client  
│ ├── arrow_builder.py # JSON → Arrow conversion  
│ └── transformations.py # Arrow-native operations  
├── benchmarks/  
│ └── json_vs_arrow.py # Performance comparison
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ll start by building the things we’ll need to eventually run the benchmark, starting with the API client.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: API Client to Fetch JSON Data
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;As mentioned before, skip this step if you already have some JSON.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Our API client will use our SERP API to fetch Google search results. Replace with your own API/implementation. All that matters is that you have a way of getting a lot of clean, structured JSON at scale.&lt;/p&gt;

&lt;p&gt;SERP APIs are just ideal here because we’ll be ramping this up from ~100 to 1,000, 5,000, and even 10,000 results for the benchmark.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# src/api_client.py  

import os  
import requests  
from dotenv import load_dotenv  

load_dotenv()  


class  BrightDataClient:  
def  __init__(self):  
self.api_key = os.getenv("BRIGHT_DATA_API_KEY")  
self.zone = os.getenv("BRIGHT_DATA_ZONE")  

if  not self.api_key or  not self.zone:  
raise ValueError(  
"Missing BRIGHT_DATA_API_KEY or BRIGHT_DATA_ZONE. "  
"Set these in your .env file."  
)  

self.session = requests.Session()  
self.session.headers.update({  
'Authorization': f'Bearer {self.api_key}'  
})  
self.api_endpoint = "https://api.brightdata.com/request"  

def  search(self, query: str, num_results: int = 10):  
search_url = (  
f"https://www.google.com/search"  
f"?q={requests.utils.quote(query)}"  
f"&amp;amp;num={num_results}"  
f"&amp;amp;brd_json=1"  # returns Google search data as structured JSON instead of HTML  
)  

payload = {  
'zone': self.zone,  
'url': search_url,  
'format': 'json'  
}  

response = self.session.post(self.api_endpoint, json=payload, timeout=30)  
response.raise_for_status()  

result = response.json()  
# handle SERP API response format  
if  isinstance(result, dict) and  'body'  in result:  
body = result['body']  
if  isinstance(body, str):  
import json  
body = json.loads(body)  
return body if  isinstance(body, dict) else result  

return result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Right now this does nothing. We’ll instantiate and use this client to fetch data when we’re benchmarking the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Convert JSON to Arrow Tables
&lt;/h2&gt;

&lt;p&gt;This is where you pay the one-time conversion cost.&lt;/p&gt;

&lt;p&gt;Note that  &lt;a href="https://arrow.apache.org/docs/python/data.html" rel="noopener noreferrer"&gt;Arrow requires schemas.&lt;/a&gt;  This is a feature, not a limitation. Schemas force you to be explicit about data shape, meaning better optimization + you catch bugs early.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# src/arrow_builder.py  

import pyarrow as pa  

def  serp_to_arrow(serp_data: dict):  
schema = pa.schema([  
pa.field('title', pa.string()),  
pa.field('link', pa.string()),  
pa.field('snippet', pa.string()),  
pa.field('position', pa.int32()),  
pa.field('display_link', pa.string()),  
pa.field('date', pa.string(), nullable=True),  
])  

organic_results = serp_data.get('organic', [])  

if  not organic_results:  
return pa.Table.from_pylist([], schema=schema)  

rows = []  
for idx, result in  enumerate(organic_results):  
row = {  
'title': result.get('title', ''),  
'link': result.get('link', ''),  
'snippet': result.get('snippet', ''),  
'position': result.get('position', idx + 1),  
'display_link': result.get('display_link', ''),  
'date': result.get('date', None),  
}  
rows.append(row)  

return pa.Table.from_pylist(rows, schema=schema)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3: Arrow-Native Transformations
&lt;/h2&gt;

&lt;p&gt;Now we can work with the data without re-serializing it. You can get as creative as you want here (and if you do,  &lt;a href="https://arrow.apache.org/cookbook/py/data.html" rel="noopener noreferrer"&gt;the Arrow cookbook has you covered&lt;/a&gt;) but to simulate a basic production workflow I’m considering three broad operations —  &lt;strong&gt;filtering, sorting, and aggregation&lt;/strong&gt;. That should cover most real-world use cases.&lt;/p&gt;

&lt;p&gt;These all make heavy use of PyArrow’s  &lt;a href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html" rel="noopener noreferrer"&gt;Table&lt;/a&gt;  class, and  &lt;a href="https://arrow.apache.org/docs/python/api/compute.html" rel="noopener noreferrer"&gt;Compute&lt;/a&gt;  functions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# src/transformations.py  
# Simulates a typical production workflow with Arrow native transformations  
import pyarrow as pa  
import pyarrow.compute as pc  
from typing import  Optional, List  
from urllib.parse import urlparse  


# 1. filters SERP results by position (zero-copy operation)  
# Returns a filtered Arrow table  
def  filter_by_position(table: pa.Table, max_position: int = 10) -&amp;gt; pa.Table:  
if  'position'  not  in table.column_names:  
return table  

mask = pc.less_equal(table['position'], max_position)  
return table.filter(mask)  

# 2. Sort table by position (zero-copy operation)  
# Returns a sorted arrow table  
def  sort_by_position(table: pa.Table, ascending: bool = True) -&amp;gt; pa.Table:  
if  'position'  not  in table.column_names:  
return table  

sort_keys = [('position', 'ascending'  if ascending else  'descending')]  
return table.sort_by(sort_keys)  

# 3. Select specific columns (zero-copy operation)  
# Returns a table containing ONLY the selected columns  
def  select_columns(table: pa.Table, columns: List[str]) -&amp;gt; pa.Table:  
available_columns = [col for col in columns if col in table.column_names]  
if  not available_columns:  
return table  

return table.select(available_columns)  

# 4. Aggregate SERP results by domain using Arrow-native group_by operations  
# Returns an aggregated Arrow table with domain stats  
# This uses Arrow's native group_by which is zero-copy and much faster than Python loops for large datasets.  
def  aggregate_by_domain(table: pa.Table) -&amp;gt; pa.Table:  
if  'display_link'  not  in table.column_names and  'link'  not  in table.column_names:  
return table  

# extract domain using Arrow compute functions  
# we need to extract domains first, then group by them  
link_column = 'display_link'  if  'display_link'  in table.column_names else  'link'  

# extract domains - we still need Python for URL parsing, but minimize it  
# by using Arrow compute for string operations where possible  
domains_list = []  
positions_list = []  

# extract domains efficiently  
link_array = table[link_column]  
position_array = table['position'] if  'position'  in table.column_names else  None  

for i in  range(len(table)):  
url_str = link_array[i].as_py()  
if  not url_str or  not  isinstance(url_str, str) or  not url_str.startswith('http'):  
continue  

try:  
parsed = urlparse(url_str)  
domain = parsed.netloc  
if domain.startswith('www.'):  
domain = domain[4:]  
except Exception:  
if  '//'  in url_str:  
domain = url_str.split('//')[1].split('/')[0]  
if domain.startswith('www.'):  
domain = domain[4:]  
else:  
continue  

if domain:  
domains_list.append(domain)  
if position_array is  not  None:  
positions_list.append(position_array[i].as_py())  
else:  
positions_list.append(0)  

if  not domains_list:  
return pa.Table.from_pylist([])  

# create Arrow table with domains and positions  
domain_table = pa.Table.from_pydict({  
'domain': domains_list,  
'position': positions_list  
})  

# use Arrow's native group_by for aggregation (zero-copy)  
# aggregate returns a table with domain + aggregated columns  
# columns are returned in order: domain, position_count, position_mean, position_min, position_max  
grouped = domain_table.group_by('domain').aggregate([  
('position', 'count'),  
('position', 'mean'),  
('position', 'min'),  
('position', 'max')  
])  

# rename columns to match expected output format  
# group_by returns: domain, position_count, position_mean, position_min, position_max  
aggregated = grouped.rename_columns([  
'domain',  
'result_count',  
'avg_position',  
'min_position',  
'max_position'  
])  

return aggregated  

# 5. (Optional) If you want you can filter SERP results by domain name just as easily  
# And it's also a zero-copy operation.  
# This returns a filtered Arrow table:  
def  filter_by_domain(table: pa.Table, domains: List[str]) -&amp;gt; pa.Table:  
if  'link'  not  in table.column_names and  'serp_link'  not  in table.column_names:  
return table  

link_column = 'link'  if  'link'  in table.column_names else  'serp_link'  

# create mask for matching domains  
masks = []  
for domain in domains:  
domain_mask = pc.match_substring(table[link_column], domain)  
masks.append(domain_mask)  

# combine masks with OR  
if masks:  
combined_mask = masks[0]  
for mask in masks[1:]:  
combined_mask = pc.or_(combined_mask, mask)  
return table.filter(combined_mask)  

return table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All done, let’s put it all together, run it, and benchmark the difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Benchmarking the Complete Pipeline
&lt;/h2&gt;

&lt;p&gt;We’ll simulate a realistic workflow: fetch data from an API, filter it, sort it, and serialize it for storage — then deserialize and compute on it. This mirrors what happens when you cache intermediate results or pass data between workers.&lt;/p&gt;

&lt;p&gt;We’re measuring the cost of repeatedly materializing, transforming, serializing, and re-parsing row-oriented Python objects vs. keeping data columnar (Arrow) and operating in native code (via  &lt;code&gt;pyarrow&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;We’ll run 500 iterations to simulate batch processing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import sys  
import time  
import json  
import tracemalloc  
import tempfile  
import os  
import argparse  
from pathlib import Path  
from datetime import datetime  

sys.path.insert(0, str(Path(__file__).parent.parent))  

from src.api_client import BrightDataClient  
from src.arrow_builder import serp_to_arrow  
from src.transformations import (  
filter_by_position,  
sort_by_position,  
aggregate_by_domain  
)  
import pyarrow as pa  
import pyarrow.compute as pc  
import pyarrow.parquet as pq  


def  json_processing(serp_data: dict, iterations: int = 100):  
"""traditional JSON-based processing pipeline  

We're simulating a typical data pipeline/workflow: filter -&amp;gt; sort -&amp;gt; cache (serialize) -&amp;gt; read (deserialize) -&amp;gt; compute  
Many real pipelines serialize data for caching, storage, or transmission between services  
"""  
organic = serp_data.get('organic', [])  

start = time.time() # to measure runtime  
tracemalloc.start() # to measure Python heap usage  

for _ in  range(iterations):  
filtered = [r for r in organic if r.get('position', 0) &amp;lt;= 10]  
sorted_data = sorted(filtered, key=lambda x: x.get('position', 0))  
# simulate caching/storage: serialize to JSON (expensive but common in real pipelines)  
json_str = json.dumps(sorted_data)  
# simulate reading cached data: deserialize back to Python objects (expensive but common)  
parsed = json.loads(json_str)  
total_positions = sum(item.get('position', 0) for item in parsed)  

elapsed = time.time() - start  
current, peak = tracemalloc.get_traced_memory()  
tracemalloc.stop()  

return elapsed, len(organic), peak / 1024 / 1024  


def  arrow_processing(serp_data: dict, iterations: int = 100):  
"""Arrow-based zero-copy processing pipeline  

We'll simulate the same operations as JSON version, but no serialization needed:  
Filter -&amp;gt; Sort -&amp;gt; Compute here ALL work directly with Arrow data  
If you want persistence, just export to Parquet (native via pyarrow, but not needed here for in-memory ops)  
"""  
# convert JSON to Arrow table - one-time cost, amortized across iterations  
table = serp_to_arrow(serp_data)  

start = time.time() # to measure runtime  
tracemalloc.start() # to measure Python heap usage  

for _ in  range(iterations):  
# zero-copy: filter directly on Arrow data (no Python conversion)  
filtered = filter_by_position(table, max_position=10)  
# zero-copy: sort directly on Arrow data (no Python conversion)  
sorted_table = sort_by_position(filtered, ascending=True)  
# zero-copy: compute directly on Arrow column (no serialization needed for in-memory ops)  
total_positions = pc.sum(sorted_table['position']).as_py()  

elapsed = time.time() - start  
current, peak = tracemalloc.get_traced_memory()  
tracemalloc.stop()  

return elapsed, len(table), peak / 1024 / 1024  


def  main():  
cache_dir = Path(__file__).parent / "cache"  
cache_dir.mkdir(exist_ok=True)  
cache_file = cache_dir / "benchmark_data.json"  

parser = argparse.ArgumentParser(description='Run JSON vs Arrow benchmark')  
parser.add_argument('--refresh', action='store_true', help='Force refresh cached data')  
args = parser.parse_args()  

serp_data = None  
num_results = 0  

# try to load cached API data to avoid re-fetching on every run  
if  not args.refresh and cache_file.exists():  
try:  
with  open(cache_file, 'r') as f:  
cached_data = json.load(f)  
serp_data = cached_data.get('serp_data')  
num_results = len(serp_data.get('organic', [])) if serp_data else  0  
except Exception:  
serp_data = None  

# fetch data if cache doesn't exist or is too small (&amp;lt; 100)  
if serp_data is  None  or num_results &amp;lt; 100:  
client = BrightDataClient()  

# if you want more results, increase results per query or just...run more queries!  
num_results_per_query = 10  
target_results = 100  

queries = [  
"Python data processing",  
"distributed computing",  
"data engineering",  
"ETL pipeline design",  
# add more as needed  
]  

all_results = []  
successful_queries = 0  

for query in queries:  
if  len(all_results) &amp;gt;= target_results:  
break  

try:  
serp_data = client.search(query, num_results=num_results_per_query)  
query_results = serp_data.get('organic', [])  
all_results.extend(query_results)  
successful_queries += 1  
time.sleep(0.5) # just in case, for rate limits  
except Exception:  
pass  

if all_results:  
serp_data = {'organic': all_results}  
num_results = len(all_results)  

# cache results for future runs  
try:  
cache_data = {  
'serp_data': serp_data,  
'timestamp': datetime.now().isoformat(),  
'num_results': num_results,  
'successful_queries': successful_queries  
}  
with  open(cache_file, 'w') as f:  
json.dump(cache_data, f, indent=2)  
except Exception:  
pass  
else:  
serp_data = client.search("Python data processing", num_results=100)  
num_results = len(serp_data.get('organic', []))  

# run benchmarks: simulate processing multiple batches (here, 500)  
iterations = 500  
json_time, json_rows, json_memory = json_processing(serp_data, iterations)  
arrow_time, arrow_rows, arrow_memory = arrow_processing(serp_data, iterations)  

# calculate performance improvements  
speedup = json_time / arrow_time if arrow_time &amp;gt; 0  else  0  
memory_reduction = ((json_memory - arrow_memory) / json_memory * 100) if json_memory &amp;gt; 0  else  0  

# compare file sizes: JSON vs Parquet  
table = serp_to_arrow(serp_data)  

with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as json_file:  
json.dump(serp_data.get('organic', []), json_file, indent=2)  
json_filepath = json_file.name  

json_size = os.path.getsize(json_filepath)  

with tempfile.NamedTemporaryFile(suffix='.parquet', delete=False) as parquet_file:  
parquet_filepath = parquet_file.name  

# Parquet is columnar and compressed, typically much smaller than JSON  
pq.write_table(table, parquet_filepath, compression='snappy')  
parquet_size = os.path.getsize(parquet_filepath)  

# Arrow's in-memory representation is also more efficient  
arrow_memory_size = table.nbytes  

size_reduction = ((json_size - parquet_size) / json_size * 100) if json_size &amp;gt; 0  else  0  
memory_reduction_vs_json = ((json_size - arrow_memory_size) / json_size * 100) if json_size &amp;gt; 0  else  0  

# Print out the results  
print(f"\n{'Metric':&amp;lt;25} {'JSON':&amp;lt;20} {'Arrow':&amp;lt;20} {'Improvement':&amp;lt;15}")  
print("-" * 70)  
print(f"{'Processing Time':&amp;lt;25} {json_time:.4f}s{'':&amp;lt;15} {arrow_time:.4f}s{'':&amp;lt;15} {speedup:.2f}x faster")  
print(f"{'Throughput':&amp;lt;25} {iterations/json_time:.1f} ops/s{'':&amp;lt;10} {iterations/arrow_time:.1f} ops/s{'':&amp;lt;10} {speedup:.2f}x more")  
print(f"{'Peak Memory':&amp;lt;25} {json_memory:.2f} MB{'':&amp;lt;13} {arrow_memory:.2f} MB{'':&amp;lt;13} {memory_reduction:.1f}% less")  
print(f"{'Data Size (JSON file)':&amp;lt;25} {json_size/1024:.2f} KB")  
print(f"{'Data Size (Parquet file)':&amp;lt;25} {parquet_size/1024:.2f} KB{'':&amp;lt;13} {size_reduction:.1f}% smaller")  
print(f"{'Data Size (Arrow in-memory)':&amp;lt;25} {arrow_memory_size/1024:.2f} KB{'':&amp;lt;13} {memory_reduction_vs_json:.1f}% smaller")  

try:  
os.unlink(json_filepath)  
os.unlink(parquet_filepath)  
except Exception:  
pass  

# All done, lets save benchmark results to disk  
results_dir = Path(__file__).parent.parent / "benchmarks" / "results"  
results_dir.mkdir(parents=True, exist_ok=True)  

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")  
results_file = results_dir / f"benchmark_{timestamp}.json"  

results_data = {  
"timestamp": datetime.now().isoformat(),  
"dataset_size": num_results,  
"iterations": iterations,  
"json": {  
"time_seconds": json_time,  
"throughput_ops_per_sec": iterations / json_time,  
"peak_memory_mb": json_memory,  
"data_size_kb": json_size / 1024  
},  
"arrow": {  
"time_seconds": arrow_time,  
"throughput_ops_per_sec": iterations / arrow_time,  
"peak_memory_mb": arrow_memory,  
"data_size_kb": parquet_size / 1024,  
"in_memory_size_kb": arrow_memory_size / 1024  
},  
"improvements": {  
"speedup": speedup,  
"memory_reduction_percent": memory_reduction,  
"size_reduction_percent": size_reduction,  
"parquet_vs_json_reduction_percent": size_reduction  
}  
}  

with  open(results_file, 'w') as f:  
json.dump(results_data, f, indent=2)  


if __name__ == "__main__":  
main()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Optional:&lt;/strong&gt; You can persist Arrow data as  &lt;a href="https://en.wikipedia.org/wiki/Apache_Parquet" rel="noopener noreferrer"&gt;Parquet&lt;/a&gt;  easily with  &lt;code&gt;pyarrow&lt;/code&gt;. Parquet is Arrow’s natural output format — both are columnar.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pyarrow.parquet as pq  
from pathlib import Path  
def  export_to_parquet(table, filepath: str, compression: str = 'snappy'):  
    Path(filepath).parent.mkdir(parents=True, exist_ok=True)  
    pq.write_table(table, filepath, compression=compression)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;So I ran that benchmark across dataset sizes from ~100 to 10,000 SERP results by altering  &lt;code&gt;num_results_per_query&lt;/code&gt;  and  &lt;code&gt;target_results&lt;/code&gt;, then executed each logical pipeline 500 times per run to simulate repeated batch processing in a real data pipeline. How do the results scale?&lt;/p&gt;

&lt;h3&gt;
  
  
  Processing Time
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;On average (1,000–10,000 rows), Arrow is ~2.6x faster in processing time.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Press enter or click to view image in full size&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkj9egvv8vpjbld6b27hf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkj9egvv8vpjbld6b27hf.png" alt="Chart showing processing time in seconds vs. number of results. Arrow remains a constant low latency throughout, but using JSON at each stage makes processing time scale linearly with dataset size." width="560" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Processing Time (in seconds) vs. Dataset size (number of results). Arrow remains a constant low latency throughout, but using JSON at each stage makes processing time scale with dataset size.&lt;/p&gt;

&lt;p&gt;We can ignore the data for very small dataset sizes (that n=100 spike), because then the JSON pipeline appears MUCH slower than it really is due to fixed overheads in the JSON pipeline: repeated  &lt;code&gt;json.dumps()&lt;/code&gt;  /  &lt;code&gt;json.loads()&lt;/code&gt;  calls and Python object allocation dominate runtime.&lt;/p&gt;

&lt;p&gt;Arrow’s processing time remains nearly constant (~0.085–0.091 seconds) throughout, so here’s how the Arrow speedup grows with dataset size:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  ~1.6x at 1,000 rows&lt;/li&gt;
&lt;li&gt;  ~2–3x between 2,000 and 8,000 rows&lt;/li&gt;
&lt;li&gt;  ~3.6x at 10,000 rows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pattern is expected. The JSON pipeline scales processing time linearly with row count, The Arrow pipeline does not: filtering, sorting, and aggregation run inside native kernels over columnar buffers, so Python never enters the per-element hot path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Throughput
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;On average (1,000–10,000 rows), Arrow delivers ~2.6x higher throughput.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Press enter or click to view image in full size&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fam5op0mcazl7j77file7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fam5op0mcazl7j77file7.png" alt="Chart showing throughput (operations per second) vs. number of results. Arrow maintains at least 5000 plus operations per second throughout, but using JSON degrades throughput as dataset size increases." width="560" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Throughput (operations/second) vs. Dataset size (number of results). Arrow maintains at least 5,000+ ops/sec across all dataset sizes.&lt;/p&gt;

&lt;p&gt;Throughput mirrors processing time exactly.&lt;/p&gt;

&lt;p&gt;Arrow maintains a relatively constant throughput of  &lt;strong&gt;~5,500–6,000 ops/sec&lt;/strong&gt;  across all dataset sizes. The JSON pipeline’s throughput degrades steadily as row count increases, dropping from  &lt;strong&gt;~3,600 ops/sec @ 1,000 rows to ~1,500 ops/sec @ 10,000 rows&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Usage (Python Heap)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;On average (1,000–10,000 rows), Arrow uses ~84% less Python heap memory.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Press enter or click to view image in full size&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw3ffxslwrc5g430xxtjo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw3ffxslwrc5g430xxtjo.png" alt="Histogram showing peak memory usage in Megabytes for JSON and Arrow vs. dataset size (number of results). Arrow uses 80–100% less memory across all dataset sizes." width="560" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Peak Memory Usage (in MB). Pipelines that use Arrow use 80–100% less memory across all dataset sizes.&lt;/p&gt;

&lt;p&gt;Peak memory usage, measured via  &lt;code&gt;tracemalloc&lt;/code&gt;, shows a consistent and substantial reduction when using Arrow (again, we can ignore the test case with n=103.) Bear in mind we’re capturing Python heap allocations only.&lt;/p&gt;

&lt;p&gt;As dataset size increases, JSON memory usage remains tied to object churn, while Arrow’s Python-level memory stays low and stable — most data lives in native buffers outside the Python object model.&lt;/p&gt;

&lt;h3&gt;
  
  
  File Size (JSON vs Parquet)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;On average (1,000–10,000 rows), Parquet files are ~98.7% smaller than JSON.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Press enter or click to view image in full size&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftlnd3rt9spmzouzv01s6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftlnd3rt9spmzouzv01s6.png" alt="Chart showing file size (in Kilobytes) vs. number of results. Arrow based pipelines compress persisted results dramatically better than pipelines which use JSON" width="560" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;File Size (in KB) vs. Dataset size (number of results). Arrow based pipelines compress persisted results dramatically better than pipelines which use JSON.&lt;/p&gt;

&lt;p&gt;The most dramatic — but also the most boringly predictable difference.&lt;/p&gt;

&lt;p&gt;Parquet files are consistently  &lt;strong&gt;97–99% smaller&lt;/strong&gt;  than their JSON equivalents at realistic dataset sizes. No surprises there, that’s exactly what columnar formats like Parquet were designed to do.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Much Money Does This Save You?
&lt;/h2&gt;

&lt;p&gt;Breaking this down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;~2.6x faster processing&lt;/strong&gt; (average) = ~1/2.6th the compute time&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;~84% less Python heap memory&lt;/strong&gt;  = smaller instance sizes, less GC pressure&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;~98–99% smaller files&lt;/strong&gt;  = lower storage and I/O costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So for a pipeline that processes, say,  &lt;strong&gt;10,000 queries/day, a&lt;/strong&gt; JSON approach would use  &lt;strong&gt;~10,000 seconds&lt;/strong&gt;  of compute time while an Arrow approach would use  &lt;strong&gt;~3,800 seconds&lt;/strong&gt;  of compute time&lt;/p&gt;

&lt;p&gt;Assuming  &lt;strong&gt;$0.10/hour&lt;/strong&gt;  for compute, that’s  &lt;strong&gt;~$0.28/day vs ~$0.11/day&lt;/strong&gt;  — roughly a  &lt;strong&gt;60% reduction in compute cost&lt;/strong&gt;, before even accounting for memory and storage savings. I’ll take that any day.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Should You Use Apache Arrow?
&lt;/h2&gt;

&lt;p&gt;Here’s the big caveat — you can’t just use Apache Arrow for  &lt;em&gt;everything.&lt;/em&gt; It is an architectural choice for production data pipelines rather than some general optimization hack.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use an Arrow-based pipeline when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  You process dozens to thousands of rows per batch&lt;/li&gt;
&lt;li&gt;  Your pipeline applies multiple transformations (e.g. filter → sort → aggregate → export)&lt;/li&gt;
&lt;li&gt;  You’re building analytics, feature extraction, or training pipelines&lt;/li&gt;
&lt;li&gt;  Memory usage, throughput, or storage size matter&lt;/li&gt;
&lt;li&gt;  You might want to persist data in Parquet for downstream systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Honestly though, from experience,  &lt;em&gt;most&lt;/em&gt; production data pipelines should find using Arrow a net improvement. In these cases, avoiding that repeated Python object materialization + serialization tax will dramatically improve how your pipeline scales (and how much it costs to operate.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Keep your existing pipeline when:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  You’re running one-off scripts on very small datasets&lt;/li&gt;
&lt;li&gt;  The end of your pipeline has  to return JSON directly to  &lt;em&gt;another&lt;/em&gt; API consumer&lt;/li&gt;
&lt;li&gt;  Your data can’t be described by a schema&lt;/li&gt;
&lt;li&gt;  You aren’t transforming the data at all (the conversion cost won’t amortize)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adopting Apache Arrow just won’t be worth the effort (or rewrite) in those cases.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>python</category>
      <category>datascience</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
