DEV Community: Prithwish Nath

A Practical Guide To Entity Resolution in Python (No Database, No Machine Learning)

Prithwish Nath — Tue, 26 May 2026 07:36:44 +0000

TL;DR: Learn a very simple way to normalize, dedupe, and fuzzy-match records that refer to the same real-world entity in Python, without a database or any ML pipelines.

I was working on a Crunchbase dataset last Friday. I joined it against our CRM, and got 56 hits out of 96. The other 40 were sitting right there in both tables — Necker FinTech in the extracted data wasNecker FinTech Holdings Inc. in the CRM; Investing.com in the data wasFusion Media Limited in the CRM — but JOIN ... ON name = name obviously doesn't care, it will shrug and return nothing. If I'd shipped that, some sales rep would end up cold-pitching an existing customer because of it. 😅

This is the core problem of entity resolution: the same real-world entity wearing different names in different systems. Naive text equality checks are borderline useless in the real world. I’d been meaning to do something less embarrassing than a raw == for a while, so I spent the rest of the weekend on a simple pipeline — scrape company names from Crunchbase hubs via Bright Data, normalize, deduplicate, and fuzzy-match against the CRM list using RapidFuzz (fuzz.WRatio). Deliberately choosing to NOT use ML, vector embeddings, or a database.

The join rate on this dataset jumped from ~58% to 100%.

Metric	Exact (normalized string)	Fuzzy (WRatio ≥ 90)
Scraped hub rows → CRM	58.3% (56 / 96)	100% (96 / 96)
CRM rows → scraped data	34.8% (48 / 138)	100% (138 / 138)

The reason exact matching loses so badly is that any real CRM list you’re handed will almost always have multiple legal-name variants per company — I had three different Necker spellings pointing at one hub listing alone. Fuzzy matching earns its keep by collapsing those variants back into a single canonical cluster, and that’s most of what the rest of this post is about.

I’ll walk through it; I hope it’s useful for anyone starting with fuzzy algorithms!

What Is Entity Resolution vs Fuzzy Matching?

Entity resolution matches records that describe the same company under different surface strings.

If you use exact matching, you ask: are these two strings identical? After you lowercase and strip punctuation, "Necker FinTech" and "Necker FinTech Holdings Inc." are still different strings — so a SQL JOIN or a Python == check will incorrectly say no match.

-- Exact join on raw names returns no row when spellings differ  
SELECT h.company_name AS hub_name, c.company_name AS crm_name  
FROM   hub_scrape h  
JOIN   crm_accounts c ON c.company_name = h.company_name  
WHERE  h.company_name = 'Necker FinTech';  
-- This will return 0 rows   
-- Remember, CRM has "Necker FinTech Holdings Inc.", not the Crunchbase title

This is why you use Fuzzy matching. That asks a looser question: how similar are these two strings? You get a score — usually 0 to 100 — instead of trueor false. Names that are clearly the same company but spelled differently (Necker FinTech vs Necker FinTech Holdings Inc.) will score high, while unrelated names will score low. You pick a threshold (we use 90): if the score is at or above it, you treat the pair as a match; otherwise you don't.

from rapidfuzz import fuzz  
THRESHOLD = 90  
def is_match(a: str, b: str) -> bool:  
    return fuzz.WRatio(a, b) >= THRESHOLD  
pairs = [  
    ("Necker FinTech", "Necker FinTech Holdings Inc."),   # same company, legal suffix  
    ("PointsKash", "Points Kash"),                        # same company, spacing  
    ("Investing.com", "Fusion Media Limited"),            # brand vs legal entity  
    ("Stripe", "Climate Corp"),                           # different companies  
]  
for a, b in pairs:  
    score = fuzz.WRatio(a, b)  
    print(f"{score:5.1f}  match={score >= THRESHOLD!s:5}  {a!r}  vs  {b!r}")

This is the same scoring logic we’ll use for the rest of the tutorial, so pip install rapidfuzz is all you need to follow along.

GitHub - rapidfuzz/RapidFuzz: Rapid fuzzy string matching in Python using various string metricsRapid fuzzy string matching in Python using various string metrics - rapidfuzz/RapidFuzzgithub.com

Running the demo pairs above with fuzz.WRatio and WRatio threshold 90 yields:

Pair	WRatio	Match at ≥ 90?	Drift type
`Necker FinTech` vs `Necker FinTech Holdings Inc.`	90.0	Yes	Legal suffix
`PointsKash` vs `Points Kash`	95.2	Yes	Token spacing
`Investing.com` vs `Fusion Media Limited`	30.0	No	Brand vs legal entity
`Stripe` vs `Climate Corp`	45.0	No	Unrelated companies

Think of it like a strict spell-check or a “did you mean X?” suggestion, but for whole company names. It is not machine learning — no model is trained on your data. The library compares characters and words using fixed rules: how many edits to turn one string into another, whether one name is contained in the other, whether the same words appear in a different order. That’s why it’s fast, easy to audit, and good enough for a large class of real-world messiness — extra words, Inc. vs LLC, odd spacing, punctuation.

💡 If two names share almost no letters — Investing.com and Fusion Media Limited for example — the score stays low and fuzzy matching correctly refuses to merge them. Those cases need a real identifier (domain, LEI, enrichment API, some sort of ML pipeline etc.), not smarter string math.

Fuzzy matching vs Lookup table vs ML

Here’s a quick summary.

Approach	Best when	Used in this pipeline?
Fuzzy matching (RapidFuzz WRatio)	Same entity, stylistic drift — legal suffixes, spacing, punctuation	Yes — primary method
Lookup table / enrichment API	Brand vs legal name; names share almost no tokens	Partial — `RESEARCHED` dict in `build_sample_crm.py`
(GLEIF, Clearbit, domain)
ML record linkage (Dedupe, Splink)	Large-scale probabilistic linkage, many fields beyond name	No — names-only, no training step

Basically, choose fuzzy matching when two name strings likely describe the same company but spell it differently.

Only choose a lookup or enrichment layer when the strings are related entities (brand vs operator) rather than variants of one name.

The Pipeline at a Glance

Entity resolution in this pipeline is a fetch → extract → normalize → fuzzy-cluster → join loop on canonical_id.

hub_urls.json  
      │  
      ▼  
fetch_hubs.py ──calls──► bright_data_unlocker.py     Bright Data POST → page body (markdown/HTML)  
      │                           │  
      └──calls──► parse_hubs.py ◄─┘                  regex → org slug + display name  
      │  
      ▼  
hub_snapshot.json                                    (+ cached bodies in data/hub_responses/)  

extract.py ──► raw_records.json                      flat table  

reconcile.py ──► reconciled.json                     canonical clusters + aliases  

run_fuzzy.py                                         CLI part. This just runs extract + reconcile   

── optional eval ──  
post_fuzzy_eval.py                                   All done, so run a real-world test, calc metrics, then print to stdout

Each stage is a pure transform: JSON in, JSON out. Nothing stateful, nothing that requires a running service, and nothing I can't git diff between runs.

Stage 1: Fetching Data from Crunchbase Hubs

I’m scraping four Crunchbase hub leaderboard pages, defined in a hub_urls.json:

[  
  { "category": "fintech",                "url": "https://www.crunchbase.com/hub/fintech-companies-seed-funding" },  
  { "category": "cybersecurity",          "url": "https://www.crunchbase.com/hub/cyber-security-startups" },  
  { "category": "saas",                   "url": "https://www.crunchbase.com/hub/saas-companies-seed-funding" },  
  { "category": "artificial_intelligence","url": "https://www.crunchbase.com/hub/artificial-intelligence-companies-early-stage-venture-funding" }  
]

Replace with your own, obviously.

Crunchbase is a JavaScript-heavy SPA — it won’t respond to a plain requests.get. So before we fetch, I use Bright Data's Web Unlocker, which handles JS rendering and anti-bot for me.

I set up a reusable client for this, and this is just a thin wrapper around their single POST endpoint https://api.brightdata.com/request. Make sure you’ve signed up, and have these set in your .env file first:

BRIGHTDATA_API_TOKEN=your_api_token  
BRIGHTDATA_ZONE=your_web_unlocker_zone_name

bright_data_unlocker.py

"""Fetch hub/listing pages as HTML or markdown."""
from __future__ import annotations

import json
import os
import time
from typing import Any, Dict, Literal, Optional

import requests
from dotenv import load_dotenv

load_dotenv()

ContentFormat = Literal["html", "markdown"]


class BrightDataUnlockerClient:
    """POST https://api.brightdata.com/request (Web Unlocker zone)."""

    def __init__(
        self,
        api_key: Optional[str] = None,
        zone: Optional[str] = None,
        country: Optional[str] = None,
    ):
        self.api_key = api_key or os.getenv("BRIGHT_DATA_API_KEY")
        self.zone = zone or os.getenv("BRIGHT_DATA_UNLOCKER_ZONE")
        self.country = country or os.getenv("BRIGHT_DATA_COUNTRY") # optional
        self.api_endpoint = "https://api.brightdata.com/request"

        if not self.api_key:
            raise ValueError("BRIGHT_DATA_API_KEY is required.")
        if not self.zone:
            raise ValueError(
                "BRIGHT_DATA_UNLOCKER_ZONE is required. "
                "Create a Web Unlocker API zone in Bright Data."
            )

        self.session = requests.Session()
        self.session.headers.update(
            {
                "Content-Type": "application/json",
                "Authorization": f"Bearer {self.api_key}",
            }
        )

    def fetch(
        self,
        url: str,
        *,
        content_format: ContentFormat = "markdown",
        max_retries: int = 2,
    ) -> str:
        """Fetch page body. markdown => format=raw + data_format=markdown (Bright Data)."""
        last_err: Optional[Exception] = None
        for attempt in range(max_retries + 1):
            try:
                return self._do_fetch(url, content_format=content_format)
            except Exception as e:
                last_err = e
                if attempt < max_retries:
                    time.sleep(0.5 * (attempt + 1))
        assert last_err is not None
        raise last_err

    def fetch_markdown(self, url: str, max_retries: int = 2) -> str:
        return self.fetch(url, content_format="markdown", max_retries=max_retries)

    def fetch_html(self, url: str, max_retries: int = 2) -> str:
        return self.fetch(url, content_format="html", max_retries=max_retries)

    def _do_fetch(self, url: str, *, content_format: ContentFormat) -> str:
        payload: Dict[str, Any] = {
            "zone": self.zone,
            "url": url,
            "format": "raw",
        }
        if content_format == "markdown":
            payload["data_format"] = "markdown"
        if self.country:
            payload["country"] = self.country

        response = self.session.post(self.api_endpoint, json=payload, timeout=120)
        response.raise_for_status()

        try:
            result = response.json()
        except json.JSONDecodeError:
            # data_format=markdown often returns the page body directly, not a JSON envelope
            text = response.text
            if not text.strip():
                raise RuntimeError("Bright Data Unlocker empty response body")
            return text

        if not isinstance(result, dict):
            raise RuntimeError(f"Bright Data unexpected response type: {type(result)}")

        inner_status = result.get("status_code")
        if inner_status is not None and inner_status != 200:
            raise RuntimeError(f"Bright Data Unlocker status_code={inner_status}")

        body = result.get("body")
        if body is None:
            if "status_code" in result and result.get("status_code") == 200:
                raise RuntimeError("Bright Data Unlocker empty body")
            raise RuntimeError(f"Bright Data Unlocker missing body: {list(result.keys())}")

        if isinstance(body, str):
            if body.strip().startswith("{"):
                try:
                    nested = json.loads(body)
                    if isinstance(nested, dict) and "body" in nested:
                        body = nested["body"]
                except json.JSONDecodeError:
                    pass
            if not str(body).strip():
                raise RuntimeError("Bright Data Unlocker empty body string")
            return str(body)
        if isinstance(body, dict):
            return json.dumps(body)
        return str(body)

Note how we can requestdata_format=markdown. Using this param, Bright Data returns a sanitized markdown rendering of the page, which is much easier to parse with regex than raw HTML.

💡 If markdown still yields zero orgs for a hub, fetch_hubs.py --fallback-html can fetch or use cached HTML and run the HTML parser instead.

With that in place, here’s our actual fetch script — fetch_hubs.py

fetch_hubs.py

"""Fetch Crunchbase hub pages via Bright Data Web Unlocker; write hub_snapshot.json."""

from __future__ import annotations

import argparse
import json
import time
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Dict, List, Optional

from dotenv import load_dotenv

from bright_data_unlocker import BrightDataUnlockerClient, ContentFormat
from parse_hubs import parse_organizations

load_dotenv()

_ROOT = Path(__file__).resolve().parent
_DEFAULT_RESPONSES_DIR = _ROOT / "data" / "hub_responses"


def load_hub_urls(path: Path) -> List[Dict[str, str]]:
    raw = json.loads(path.read_text(encoding="utf-8"))
    if not isinstance(raw, list):
        raise ValueError("hub_urls.json must be a JSON array")
    out: List[Dict[str, str]] = []
    for item in raw:
        if not isinstance(item, dict):
            continue
        url = (item.get("url") or "").strip()
        category = (item.get("category") or "unknown").strip()
        if url:
            out.append({"category": category, "url": url})
    return out


def _response_file(category: str, content_format: ContentFormat) -> str:
    ext = "md" if content_format == "markdown" else "html"
    safe = "".join(c if c.isalnum() or c in "-_" else "_" for c in category)
    return f"{safe}.{ext}"


def _response_path(
    responses_dir: Path, category: str, content_format: ContentFormat
) -> Path:
    return responses_dir / _response_file(category, content_format)


def load_cached_body(
    responses_dir: Path, category: str, content_format: ContentFormat
) -> Optional[str]:
    path = _response_path(responses_dir, category, content_format)
    if not path.is_file() or path.stat().st_size == 0:
        return None
    return path.read_text(encoding="utf-8")


def save_response_body(
    responses_dir: Path,
    category: str,
    hub_url: str,
    content_format: ContentFormat,
    body: str,
) -> Path:
    responses_dir.mkdir(parents=True, exist_ok=True)
    path = _response_path(responses_dir, category, content_format)
    path.write_text(body, encoding="utf-8")
    return path


def _manifest_path(responses_dir: Path) -> Path:
    return responses_dir / "manifest.json"


def _load_manifest(responses_dir: Path) -> Dict[str, Any]:
    path = _manifest_path(responses_dir)
    if not path.is_file():
        return {"hubs": []}
    return json.loads(path.read_text(encoding="utf-8"))


def _upsert_manifest_entry(
    responses_dir: Path,
    category: str,
    hub_url: str,
    content_format: ContentFormat,
    response_path: Path,
    *,
    fetched_at: str,
) -> None:
    entry = {
        "category": category,
        "hub_url": hub_url,
        "content_format": content_format,
        "response_file": response_path.name,
        "fetched_at": fetched_at,
    }
    manifest = _load_manifest(responses_dir)
    hubs = [h for h in manifest.get("hubs") or [] if h.get("category") != category]
    hubs.append(entry)
    manifest["hubs"] = hubs
    manifest["updated_at"] = datetime.now(timezone.utc).isoformat()
    _manifest_path(responses_dir).write_text(
        json.dumps(manifest, indent=2, ensure_ascii=False) + "\n",
        encoding="utf-8",
    )


def _parse_body(
    body: str,
    hub_url: str,
    content_format: ContentFormat,
    max_orgs: int,
) -> List[Dict[str, Any]]:
    return parse_organizations(body, hub_url, content_format=content_format, max_orgs=max_orgs)


def main() -> None:
    ap = argparse.ArgumentParser(
        description="Fetch Crunchbase hub pages (Web Unlocker) and extract organization URLs.",
    )
    ap.add_argument("--hubs-json", type=Path, default=_ROOT / "hub_urls.json")
    ap.add_argument("--out", type=Path, default=_ROOT / "data" / "hub_snapshot.json")
    ap.add_argument(
        "--format",
        choices=("markdown", "html"),
        default="markdown",
    )
    ap.add_argument("--max-orgs-per-hub", type=int, default=80)
    ap.add_argument("--delay", type=float, default=1.0)
    ap.add_argument(
        "--responses-dir",
        type=Path,
        default=_DEFAULT_RESPONSES_DIR,
        help="Directory for cached raw hub page bodies (default: data/hub_responses).",
    )
    ap.add_argument(
        "--refetch",
        action="store_true",
        help="Call Bright Data even if a cached response file exists.",
    )
    ap.add_argument(
        "--parse-only",
        action="store_true",
        help="Parse cached responses only; never call Bright Data.",
    )
    ap.add_argument(
        "--fallback-html",
        action="store_true",
        help="If markdown parse finds 0 orgs, try cached or fetched HTML.",
    )
    args = ap.parse_args()

    responses_dir = args.responses_dir

    hubs = load_hub_urls(args.hubs_json)
    if not hubs:
        raise SystemExit("No hubs in hub_urls.json")

    args.out.parent.mkdir(parents=True, exist_ok=True)
    client: Optional[BrightDataUnlockerClient] = None
    if not args.parse_only:
        client = BrightDataUnlockerClient()

    content_format: ContentFormat = args.format

    payload: Dict[str, Any] = {
        "fetched_at": datetime.now(timezone.utc).isoformat(),
        "source": "bright_data_web_unlocker",
        "content_format": content_format,
        "responses_dir": str(responses_dir),
        "hubs": [],
    }

    n_hubs = len(hubs)
    for i, hub in enumerate(hubs, start=1):
        category = hub["category"]
        url = hub["url"]
        print(f"\n[{i}/{n_hubs}] hub [{category}]: starting...", flush=True)
        block: Dict[str, Any] = {
            "category": category,
            "hub_url": url,
            "error": None,
            "organic_count": 0,
            "rows": [],
            "response_file": _response_file(category, content_format),
        }
        parse_format: ContentFormat = content_format

        try:
            body: Optional[str] = None
            if not args.refetch:
                body = load_cached_body(responses_dir, category, content_format)

            if body is None:
                if args.parse_only:
                    raise FileNotFoundError(
                        f"no cached response at {_response_path(responses_dir, category, content_format)} "
                        "(run without --parse-only to fetch)"
                    )
                print(
                    f"[{i}/{n_hubs}] hub [{category}]: fetching ({content_format})...",
                    flush=True,
                )
                assert client is not None
                body = client.fetch(url, content_format=content_format)
                print(
                    f"[{i}/{n_hubs}] hub [{category}]: fetch done "
                    f"({len(body):,} chars)",
                    flush=True,
                )
                saved = save_response_body(
                    responses_dir, category, url, content_format, body
                )
                _upsert_manifest_entry(
                    responses_dir,
                    category,
                    url,
                    content_format,
                    saved,
                    fetched_at=datetime.now(timezone.utc).isoformat(),
                )
                print(f"[{i}/{n_hubs}] hub [{category}]: saved {saved}", flush=True)
            else:
                print(
                    f"[{i}/{n_hubs}] hub [{category}]: using cache "
                    f"{_response_path(responses_dir, category, content_format)}",
                    flush=True,
                )

            print(f"[{i}/{n_hubs}] hub [{category}]: parsing...", flush=True)
            rows = _parse_body(body, url, parse_format, args.max_orgs_per_hub)

            if not rows and args.fallback_html and parse_format == "markdown":
                html_body = load_cached_body(responses_dir, category, "html")
                if html_body is None and not args.parse_only:
                    print(
                        f"[{i}/{n_hubs}] hub [{category}]: markdown had 0 orgs, "
                        "fetching HTML...",
                        flush=True,
                    )
                    assert client is not None
                    html_body = client.fetch(url, content_format="html")
                    print(
                        f"[{i}/{n_hubs}] hub [{category}]: HTML fetch done "
                        f"({len(html_body):,} chars)",
                        flush=True,
                    )
                    saved = save_response_body(
                        responses_dir, category, url, "html", html_body
                    )
                    print(f"[{i}/{n_hubs}] hub [{category}]: saved {saved}", flush=True)
                elif html_body is None:
                    raise FileNotFoundError(
                        f"no cached HTML at {_response_path(responses_dir, category, 'html')}"
                    )
                else:
                    print(
                        f"[{i}/{n_hubs}] hub [{category}]: markdown had 0 orgs, "
                        "using cached HTML...",
                        flush=True,
                    )
                print(f"[{i}/{n_hubs}] hub [{category}]: parsing HTML...", flush=True)
                rows = _parse_body(html_body, url, "html", args.max_orgs_per_hub)
                parse_format = "html"
                block["response_file"] = _response_file(category, "html")

            block["content_format"] = parse_format
            block["organic_count"] = len(rows)
            block["rows"] = rows
            print(
                f"[{i}/{n_hubs}] hub [{category}]: done - "
                f"{len(rows)} organizations",
                flush=True,
            )

        except Exception as e:
            print(f"[{i}/{n_hubs}] hub [{category}]: failed - {e}", flush=True)
            block["error"] = str(e)

        payload["hubs"].append(block)
        if not args.parse_only:
            time.sleep(args.delay)

    args.out.write_text(
        json.dumps(payload, indent=2, ensure_ascii=False) + "\n",
        encoding="utf-8",
    )
    total = sum(h.get("organic_count") or 0 for h in payload["hubs"])
    print(
        f"\nAll hubs processed. Wrote {args.out} "
        f"({total} organizations across {n_hubs} hubs).",
        flush=True,
    )


if __name__ == "__main__":
    main()

Note how I’m caching the raw bodies under data/hub_responses/ so re-runs with --parse-only don't burn any API credits.

Stage 2: Parsing from Markdown into Organization Rows

Our parse_hubs.py pulls organization slugs and display names out of the cached page bodies from the previous step. It runs three regex patterns in priority order:

# parse_hubs.py  

# Priority 1: Bright Data relative markdown links
# Matches: ](/organization/slug "Display Name")
_ORG_REL_LINK = re.compile(
    r"\]\(/organization/([a-z0-9_-]+)(?:\s+\"([^\"]*)\")?\s*\)",
    re.I,
)

# Priority 2: Standard absolute markdown links
# Matches: [Company Name](https://www.crunchbase.com/organization/slug)
_ORG_MD_LINK = re.compile(
    r"\[([^\]]+)\]\(\s*<?https?://[^>\s)]*crunchbase.com/organization/([a-z0-9_-]+)/?>?\s*\)",
    re.I,
)

# Fallback: bare /organization/slug anywhere in text
_ORG_IN_TEXT = re.compile(
    r"(?:https?://[^/\s]*crunchbase.com)?/organization/([a-z0-9_-]+)",
    re.I,
)

Each hub gets parsed into rows like:

{ "url": "https://www.crunchbase.com/organization/lovable", "slug": "lovable", "title": "Lovable" }

Here’s the full code for parse_hubs.py. Note that I also keep a blocklist of well-known VCs and accelerators (y-combinator, techstars, andreessen-horowitz, etc.) that show up on hub pages but are the investors, not the companies being listed. Without this, you get YC ranked #1 on every hub it's ever touched, which is obviously not what we want.

parse_hubs.py

"""Parse Crunchbase hub pages (markdown or HTML) for /organization/ links."""

from __future__ import annotations

import re
from typing import Any, Dict, List, Literal, Set
from urllib.parse import urljoin, urlparse

ContentFormat = Literal["html", "markdown"]

_ORG_IN_TEXT = re.compile(
    r"(?:https?://[^/\s]*crunchbase\.com)?/organization/([a-z0-9_-]+)",
    re.I,
)
# [Company Name](https://www.crunchbase.com/organization/slug)
_ORG_MD_LINK = re.compile(
    r"\[([^\]]+)\]\(\s*<?https?://[^>\s)]*crunchbase\.com/organization/([a-z0-9_-]+)/?>?\s*\)",
    re.I,
)
# Bright Data markdown: multi-line link ending with ](/organization/slug "Display Name")
_ORG_REL_LINK = re.compile(
    r"\]\(/organization/([a-z0-9_-]+)(?:\s+\"([^\"]*)\")?\s*\)",
    re.I,
)
_ORG_BLOCKLIST = frozenset(
    {
        "y-combinator",
        "techstars",
        "national-science-foundation",
        "masschallenge",
        "easme",
        "andreessen-horowitz",
        "sequoia-capital",
        "accel",
    }
)


def slug_to_display_name(slug: str) -> str:
    return slug.replace("-", " ").title()


def _append_org(
    rows: List[Dict[str, Any]],
    seen_slugs: Set[str],
    *,
    slug: str,
    title: str,
    hub_url: str,
    max_orgs: int,
) -> None:
    if len(rows) >= max_orgs:
        return
    slug = slug.lower()
    if slug in _ORG_BLOCKLIST or slug in seen_slugs:
        return
    seen_slugs.add(slug)
    base = f"{urlparse(hub_url).scheme}://{urlparse(hub_url).netloc}"
    name = (title or "").strip() or slug_to_display_name(slug)
    rows.append(
        {
            "url": urljoin(base, f"/organization/{slug}"),
            "slug": slug,
            "title": name,
        }
    )


def parse_organizations_from_markdown(
    markdown: str,
    hub_url: str,
    *,
    max_orgs: int = 80,
) -> List[Dict[str, Any]]:
    """Extract orgs from markdown links; fall back to bare organization URLs."""
    seen_slugs: Set[str] = set()
    rows: List[Dict[str, Any]] = []

    for match in _ORG_REL_LINK.finditer(markdown):
        slug = match.group(1)
        title = (match.group(2) or "").strip()
        _append_org(rows, seen_slugs, slug=slug, title=title, hub_url=hub_url, max_orgs=max_orgs)
        if len(rows) >= max_orgs:
            return rows

    for match in _ORG_MD_LINK.finditer(markdown):
        title, slug = match.group(1).strip(), match.group(2)
        _append_org(rows, seen_slugs, slug=slug, title=title, hub_url=hub_url, max_orgs=max_orgs)
        if len(rows) >= max_orgs:
            return rows

    if rows:
        return rows

    for match in _ORG_IN_TEXT.finditer(markdown):
        _append_org(
            rows,
            seen_slugs,
            slug=match.group(1),
            title="",
            hub_url=hub_url,
            max_orgs=max_orgs,
        )
        if len(rows) >= max_orgs:
            break
    return rows


def parse_organizations_from_html(
    html: str,
    hub_url: str,
    *,
    max_orgs: int = 80,
) -> List[Dict[str, Any]]:
    """Extract unique organization rows from hub page HTML."""
    seen_slugs: Set[str] = set()
    rows: List[Dict[str, Any]] = []

    for match in _ORG_IN_TEXT.finditer(html):
        _append_org(
            rows,
            seen_slugs,
            slug=match.group(1),
            title="",
            hub_url=hub_url,
            max_orgs=max_orgs,
        )
        if len(rows) >= max_orgs:
            break
    return rows


def parse_organizations(
    body: str,
    hub_url: str,
    *,
    content_format: ContentFormat = "markdown",
    max_orgs: int = 80,
) -> List[Dict[str, Any]]:
    if content_format == "markdown":
        return parse_organizations_from_markdown(body, hub_url, max_orgs=max_orgs)
    return parse_organizations_from_html(body, hub_url, max_orgs=max_orgs)

First-run gotcha I hit was a classic. My original parser expected absolute URLs (https://www.crunchbase.com/organization/...), but Bright Data's markdown renderer produces relative links (/organization/slug "Display Name") 🙃. So zero companies extracted on the first run — simply because the regex didn't match.

So I just added_ORG_REL_LINK to the parser and re-ran Stage 1 with --parse-only, fixing it at no additional API cost. This is why we cached our raw response bodies. Your parser will probably need trial-and-erroring more than once, and you don’t want to actually re-fetch the data for that.

Output of this stage: A hub_snapshot.json — 96 organizations across 4 hubs (Fintech produced 26, Cybersecurity: 24, SaaS: 22, AI: 24). Note that these are hub leaderboard entries, not full Crunchbase exports.

Because the full Crunchbase lists run to thousands; I'm taking the curated top slice on purpose, because the cleaner my source is, the more clearly the fuzzy lift shows up against it.

Stage 3: Turn Scraped Records into a Flat Table

Before clustering, I flatten the nested snapshot into one uniform record per company appearance. extract.py handles this:

"""From hub_snapshot.json to raw_records.json with company_name per organization."""    

from __future__ import annotations    

import json    
from datetime import datetime, timezone    
from pathlib import Path    
from typing import Any, Dict, List    


def records_from_hub_snapshot(data: Dict[str, Any]) -> List[Dict[str, Any]]:    
    records: List[Dict[str, Any]] = []    
    for hi, block in enumerate(data.get("hubs") or []):    
        if block.get("error"):    
            continue    
        category = (block.get("category") or "unknown").strip()    
        hub_url = block.get("hub_url") or ""    
        for ri, row in enumerate(block.get("rows") or []):    
            if not isinstance(row, dict):    
                continue    
            url = (row.get("url") or "").strip()    
            if not url or "/organization/" not in url.lower():    
                continue    
            title = (row.get("title") or "").strip()    
            slug = (row.get("slug") or "").strip()    
            company_name = title or (slug.replace("-", " ").title() if slug else "")    
            if not company_name:    
                continue    
            records.append(    
                {    
                    "id": f"hub:{hi}:{ri}",    
                    "source": "crunchbase_hub",    
                    "category": category,    
                    "company_name": company_name,    
                    "raw_name": title or company_name,    
                    "url": url,    
                    "domain": "www.crunchbase.com",    
                    "hub_url": hub_url,    
                    "position": ri + 1,    
                }    
            )    
    return records    


def build_raw_payload(snapshot_path: Path) -> Dict[str, Any]:    
    raw = json.loads(snapshot_path.read_text(encoding="utf-8"))    
    if not isinstance(raw.get("hubs"), list):    
        raise ValueError(f"{snapshot_path}: expected hub snapshot with 'hubs' array")    
    records = records_from_hub_snapshot(raw)    
    return {    
        "extracted_at": datetime.now(timezone.utc).isoformat(),    
        "snapshot": str(snapshot_path.name),    
        "record_count": len(records),    
        "records": records,    
    }    


def write_raw_records(snapshot_path: Path, out_path: Path) -> Dict[str, Any]:    
    payload = build_raw_payload(snapshot_path)    
    out_path.parent.mkdir(parents=True, exist_ok=True)    
    out_path.write_text(    
        json.dumps(payload, indent=2, ensure_ascii=False) + "n",    
        encoding="utf-8",    
    )    
    return payload

The id field (hub:0:3, hub:2:11, etc.) is our stable key that links each raw record to its canonical cluster in Stage 4. Deterministic, derivable from position, and most importantly, easy to debug.

Output: raw_records.json — 96 rows, all source: "crunchbase_hub"fields, tagged by category.

Stage 4: Reconciliation — Normalize + Dedupe + Fuzzy Cluster

Entity resolution reconciliation (Stage 4) collapses duplicate company names into canonical clusters. In this dataset, 96 scraped rows become 88 canonical companies after normalization and fuzzy clustering. Four names show up on more than one hub — Callaghan Innovation and EISMEA on all four leaderboards, PayTic and SixThirty on two — which gives duplicate rows before clustering.

After exact normalization there are 88 distinct normalized names, which happens to be the same count as final clusters at WRatio threshold 90 — meaning no additional fuzzy merges were needed beyond collapsing the cross-hub duplicates.

I run reconciliation in two passes.

See full code here for reconcile.py: https://gist.github.com/sixthextinction/5c711e48353f4f7765e13cc4bb1b25de

Pass 1: Exact normalization

# reconcile.py  
_LEGAL     = re.compile(  
    r"b(inc.?|llc.?|ltd.?|plc.?|corp.?|corporation|co.?|company|limited)b",  
    re.I,  
)  
_NON_ALNUM = re.compile(r"[^ws]", re.UNICODE)  

def normalize_company_name(s: str) -> str:  
    s = s.lower().strip()  
    s = _NON_ALNUM.sub(" ", s)   # strip punctuation  
    s = _LEGAL.sub(" ", s)       # drop legal suffixes  
    s = re.sub(r"s+", " ", s).strip()  
    return s

After normalization, I group records by their normalized string. "Lovable", "lovable", and "Lovable." all collapse into the same group. This removes trivial duplicates before the more expensive fuzzy-matching pass. TL;DR: Do the cheap pass first, expensive pass second — same reason you’d put a WHERE clause before a JOIN.

For this dataset that’s 96 hash inserts — one normalize_company_name() + one dict lookup per row — roughly ~O(n).

The important optimization is that normalization shrinks the search space before the quadratic fuzzy pass runs. Without Pass 1, naïve all-pairs fuzzy matching over n = 10,000 unique names would require:

n(n−1)/2 ≈ 50 million comparisons

It takes my laptop ~1.3 µs per RapidFuzz WRatio call on ~30-character names, so that pushes our runtime toward ~60 seconds instead of milliseconds. Not ideal — which is exactly why Pass 1 exists, to reduce n before the O(n²) step becomes expensive.

Pass 2: Fuzzy merge

I then compare each exact group against existing clusters using WRatio from RapidFuzz.

# reconcile.py  
from rapidfuzz import fuzz  
_FUZZY_SCORER = fuzz.WRatio  

def _fuzzy_merge_groups(    groups: List[List[Dict[str, Any]]],  
    threshold: float,           # default: 90.0) -> List[Cluster]:  
    clusters: List[Cluster] = []  
    for group in sorted(groups, key=lambda g: (  
        min(_source_rank(m.get("source") or "") for m in g),  
        -len(g),  
    )):  
        rep = pick_canonical_name(group)  
        placed = False  
        for cluster in clusters:  
            if _FUZZY_SCORER(rep, cluster.canonical_name) >= threshold:  
                cluster.members.extend(group)  
                cluster.canonical_name = pick_canonical_name(cluster.members)  
                cluster.canonical_id   = make_canonical_id(cluster.canonical_name)  
                placed = True  
                break  
        if not placed:  
            clusters.append(Cluster(  
                canonical_id=make_canonical_id(rep),  
                canonical_name=rep,  
                members=list(group),  
            ))  
    return clusters

Here, we have to compare each group’s representative against existing cluster canonicals. So the worst case with g = 88 exact groups would be

0 + 1 + 2 + … + 87 = 3,828 comparisons

That’s roughly ~O(g²).

What is WRatio and Why Use It?

RapidFuzz ships several scorers — see the rapidfuzz.fuzz docs for the full list. We use fuzz.WRatio (weighted ratio; same algorithm family as FuzzyWuzzy’s WRatio) because company names drift in different ways and no single metric covers all of them.

WRatio is a meta-scorer: for each pair of strings it runs several ratio algorithms internally (with length-based weighting) and returns the best score. It combines:

ratio — character-level edit distance. Good for typos; bad when one name is much longer (Necker FinTech vs Necker FinTech Holdings Inc. looks like a poor match).
partial_ratio — treats the shorter string as a substring of the longer one. Catches display names contained in legal names.
token_sort_ratio — splits on words, sorts them, then compares. Handles reordered tokens.
token_set_ratio — compares word sets, ignoring duplicates and extras like Inc. or Group.

You rarely know in advance which kind of drift a CRM row will have — suffix appended, spacing changed, words reordered. WRatio picks the strategy that scores highest for that specific pair, which is exactly what you want for entity resolution on names alone.

We default to threshold 90: strict enough that unrelated pairs (Stripe vs Climate Corp) stay out, loose enough that real variants (PointsKash vs Points Kash) merge. Tune it on your data.

On this dataset specifically, WRatio handles the drift patterns we actually see in company names (or historically have, anyway):

Hub / scraped name	CRM variant (in `sample_crm.json`)	Drift type
`Necker FinTech`	`Necker FinTech Holdings Inc.`	Legal suffix + spacing (`Fin Tech` vs `FinTech`)
`PANTA`	`PANTA Group`	Type descriptor appended
`Physical Intelligence`	`Physical Intelligence (Pi), Inc.`	Parenthetical + legal suffix
`PointsKash`	`Points Kash`	Token spacing
`qBotica`	`q Botica`	Token spacing

Pure ratio (edit distance) would heavily penalize Necker FinTech vs Necker FinTech Holdings Inc. because three extra words add significant distance. Sopartial_ratio handles containment andtoken_set_ratio handles reordering. WRatio picks the strategy that produces the best score for each specific pair — which is exactly the behavior you want when you don't know in advance how a name is going to drift.

Display names with no shared tokens to the legal entity — e.g. hub title Investing.com vs operator Fusion Media Limited, or Lyrie.ai vs OTT Cybersecurity Inc. — stay below threshold. WRatio correctly refuses to merge them. That’s a good thing — those belong in a lookup table or enrichment API, not in a string-similarity pass (see Caveats).

One last thing before we move on to the demo — every time a cluster gains new members, I re-evaluate its canonical name. The source ranking (crunchbase_hub = 0, anything else = 99) ensures that short, clean display names win over longer legal variants:

def pick_canonical_name(members: Sequence[Dict[str, Any]]) -> str:  
    def sort_key(m):  
        name = (m.get("company_name") or "").strip()  
        return (_source_rank(m.get("source") or ""), len(name), name.lower())  
    return min(members, key=sort_key)["company_name"].strip()

A Crunchbase display name like "Lovable" will always beat "Lovable Technologies Inc." as the canonical — it's from a trusted source and it's shorter. The legal variant ends up as an alias, which is exactly the right relationship.

Output of this stage: reconciled.json — 88 canonical clusters, alias mappings with WRatio scores, and CRM join metrics.

That’s it, we’re all done with the fuzzy pipeline. Let’s see if that improved things.

Stage 5: Performing a Real CRM Join

Our sample_crm.json simulates the data you’d get from a real CRM — I simply researched legal names and known alternate spellings online for the companies I had, and put it in a JSON file. This gave me 138 rows representing the same 88 canonical companies.

Some companies had one exact-match entry — these are easy for us to handle. Others had three or four variants that I’d name like this:

{ "id": "crm:necker_fintech_0", "company_name": "Necker Fin Tech" },  
{ "id": "crm:necker_fintech_1", "company_name": "Necker FinTech Group" },  
{ "id": "crm:necker_fintech_2", "company_name": "Necker FinTech Holdings Inc." }

Our join logic in thepost_fuzzy_eval.py demo runs exact normalization first, then falls back to fuzzy — note how this is the same “cheap pass first” pattern as the cluster builder:

post_fuzzy_eval.py

"""Optional CRM join evaluation — exact vs fuzzy match rates (not part of core reconcile)."""

from __future__ import annotations

import json
from pathlib import Path
from typing import Any, Dict, List, Optional, Sequence

from rapidfuzz import fuzz

from reconcile import (
    Cluster,
    DEFAULT_THRESHOLD,
    _exact_groups,
    normalize_company_name,
)

_FUZZY_SCORER = fuzz.WRatio


def load_crm(path: Path) -> List[Dict[str, Any]]:
    raw = json.loads(path.read_text(encoding="utf-8"))
    if isinstance(raw, list):
        rows = raw
    elif isinstance(raw, dict) and "companies" in raw:
        rows = raw["companies"]
    else:
        raise ValueError(f"{path}: expected list or {{'companies': [...]}}")
    out: List[Dict[str, Any]] = []
    for i, row in enumerate(rows):
        if not isinstance(row, dict):
            continue
        name = (row.get("company_name") or "").strip()
        if not name:
            continue
        out.append(
            {
                "id": row.get("id") or f"crm:{i}",
                "company_name": name,
            }
        )
    return out


def _record_to_cluster_map(clusters: Sequence[Cluster]) -> Dict[str, str]:
    out: Dict[str, str] = {}
    for cluster in clusters:
        for m in cluster.members:
            out[m["id"]] = cluster.canonical_id
    return out


def crm_to_canonical(
    crm_rows: Sequence[Dict[str, Any]],
    clusters: Sequence[Cluster],
    threshold: float,
) -> Dict[str, Optional[str]]:
    out: Dict[str, Optional[str]] = {}
    for row in crm_rows:
        key = str(row.get("id") or row.get("company_name"))
        name = (row.get("company_name") or "").strip()
        if not name:
            out[key] = None
            continue
        norm = normalize_company_name(name)
        matched: Optional[str] = None
        for cluster in clusters:
            if any(
                normalize_company_name(m.get("company_name") or "") == norm
                for m in cluster.members
            ):
                matched = cluster.canonical_id
                break
        if not matched:
            best_score = 0.0
            best_id: Optional[str] = None
            for cluster in clusters:
                score = _FUZZY_SCORER(name, cluster.canonical_name)
                if score > best_score:
                    best_score = score
                    best_id = cluster.canonical_id
            matched = best_id if best_score >= threshold else None
        out[key] = matched
    return out


def join_metrics(
    records: Sequence[Dict[str, Any]],
    crm_rows: Sequence[Dict[str, Any]],
    clusters: Sequence[Cluster],
    threshold: float,
) -> Dict[str, Any]:
    record_to_cid = _record_to_cluster_map(clusters)
    crm_to_cid = crm_to_canonical(crm_rows, clusters, threshold)

    crm_norms = {
        normalize_company_name((r.get("company_name") or ""))
        for r in crm_rows
        if normalize_company_name(r.get("company_name") or "")
    }
    crm_mapped_cids = {v for v in crm_to_cid.values() if v}

    scraped_exact = 0
    scraped_fuzzy = 0
    for r in records:
        norm = normalize_company_name(r.get("company_name") or "")
        if norm in crm_norms:
            scraped_exact += 1
        cid = record_to_cid.get(r["id"])
        if cid and cid in crm_mapped_cids:
            scraped_fuzzy += 1

    crm_exact = 0
    crm_fuzzy = 0
    scraped_norms = {
        normalize_company_name(r.get("company_name") or "") for r in records
    }
    scraped_cids = set(record_to_cid.values())
    for row in crm_rows:
        norm = normalize_company_name(row.get("company_name") or "")
        if norm in scraped_norms:
            crm_exact += 1
        cid_key = str(row.get("id") or row.get("company_name"))
        cid = crm_to_cid.get(cid_key)
        if cid and cid in scraped_cids:
            crm_fuzzy += 1

    n_scraped = len(records) or 1
    n_crm = len(crm_rows) or 1
    return {
        "scraped_rows": len(records),
        "crm_rows": len(crm_rows),
        "canonical_clusters": len(clusters),
        "exact_normalized_unique": len(_exact_groups(records)),
        "scraped_exact_join_pct": round(100.0 * scraped_exact / n_scraped, 1),
        "scraped_fuzzy_join_pct": round(100.0 * scraped_fuzzy / n_scraped, 1),
        "crm_exact_join_pct": round(100.0 * crm_exact / n_crm, 1),
        "crm_fuzzy_join_pct": round(100.0 * crm_fuzzy / n_crm, 1),
    }


def eval_crm_join(
    records: Sequence[Dict[str, Any]],
    clusters: Sequence[Cluster],
    crm_path: Path,
    threshold: float = DEFAULT_THRESHOLD,
) -> Dict[str, Any]:
    """Load CRM file and compute join metrics against existing clusters."""
    crm_rows = load_crm(crm_path)
    return join_metrics(records, crm_rows, clusters, threshold)

Here’s how we measure this JOIN operation (join_metrics in post_fuzzy_eval.py):

Scraped data to CRM, exact: the scraped row’s normalized company_name appears in the set of normalized CRM names.
Scraped data to CRM, fuzzy: the scraped row’s canonical cluster id is also reached by at least one CRM row (exact norm match on a cluster member, or WRatio ≥ threshold on the cluster’s canonical name).
CRM to scraped data: the symmetric checks from the CRM row’s perspective.

Post-Fuzzy Matching Results

So how did we do?

Question	Exact match	Fuzzy (WRatio ≥ 90)
Of 96 scraped rows, how many link to a CRM row?	58.3% (56 rows)	100% (96 rows)
Of 138 CRM rows, how many link back to scraped data?	34.8% (48 rows)	100% (138 rows)

The 58.3% exact baseline isn’t actually bad — over half of raw hub titles normalize to a CRM string exactly. The other 41.7% however, absolutely need fuzzy matching via WRatio because the CRM holds legal or alternate spellings (Necker FinTech Holdings Inc. vs hub Necker FinTech, etc.) that no amount of lowercasing or other normalization will save you from.

The fuzzy pass closes the gap on this dataset at WRatio threshold 90. WRatio is strict enough to avoid merging unrelated names while still picking up suffix and token drift — which is fantastic — just what we want!

Running It

Commands below assume Python 3.10+ and a venv. All of this runs locally; the only network calls are to Bright Data during the initial fetch.

# Install deps  
pip install rapidfuzz requests python-dotenv  

# Fetch all 4 hubs (costs API credits)  
python fetch_hubs.py  

# Already have cached responses? Re-parse for free  
python fetch_hubs.py --parse-only  

# Extract + reconcile + print CRM metrics (default: both stages)  
python run_fuzzy.py  

# Regenerate sample_crm.json from raw_records (optional)  
python build_sample_crm.py  

# Tune the threshold (try 85 for more aggressive merging)  
python run_fuzzy.py --threshold 85  

# Run individual stages  
python run_fuzzy.py --extract  
python run_fuzzy.py --reconcile

Sample CLI output after a full run:

wrote data/raw_records.json  
  records: 96  
  category artificial_intelligence: 24  
  category cybersecurity: 24  
  category fintech: 26  
  category saas: 22  

wrote data/reconciled.json  

-- join metrics (CRM) --  
  scraped rows: 96 | exact-normalized unique: 88 | canonical clusters: 88  
  scraped -> CRM  exact: 58.3% | fuzzy: 100.0%  
  CRM -> scraped   exact: 34.8% | fuzzy: 100.0%  

-- top 10 canonicals (by alias count) --  
  Callaghan Innovation  (4 aliases, sources: crunchbase_hub)  
  EISMEA  (4 aliases, sources: crunchbase_hub)  
  PayTic  (2 aliases, sources: crunchbase_hub)  
  SixThirty  (2 aliases, sources: crunchbase_hub)  
  ...more

A Quick Note: The Review Queue

I’ve also added a diagnostic queue into the pipeline for low-confidence alias assignments — records whose WRatio against their cluster’s canonical falls below the threshold. This will show us merges that look suspicious and deserve a human eye:

# reconcile.py  
def review_queue(  
    records: Sequence[Dict[str, Any]],  
    clusters: Sequence[Cluster],  
    threshold: float,  
    limit: int = 8,  
) -> List[Tuple[float, str, str, str]]:  
    rid_to_cluster = {m["id"]: c for c in clusters for m in c.members}  
    lows = []  
    for r in records:  
        c     = rid_to_cluster.get(r["id"])  
        name  = r.get("company_name") or ""  
        score = _FUZZY_SCORER(name, c.canonical_name)  
        if score < threshold:  
            lows.append((score, name, c.canonical_name, c.canonical_id))  
    lows.sort(key=lambda x: x[0])  
    return lows[:limit]

In production this would feed a human-review UI or write to a needs_review table. Here it just prints to stdout — but my point stands: fuzzy matching isn't a black box. You can always surface the borderline decisions and let a human confirm them.

That’s everything, thanks for reading!

Frequently Asked Questions

Q: Do you need ML or vector embeddings for company name matching?

A: No, not for stylistic drift (legal suffixes, spacing, punctuation). Our pipeline uses RapidFuzz fuzz.WRatio —which is a rule-based string similarity, not a trained model.

Q: What similarity threshold should you use with WRatio?

A: Start at WRatio threshold 90. At 90, unrelated pairs like Stripe vs Climate Corp score 45.0 and stay out, while suffix/spacing variants like Necker FinTech vs Necker FinTech Holdings Inc. score 90.0+ and merge. See the score_cutoff parameter in the docs if you want early-exit optimization.

Q: When does fuzzy matching fail for company names?

A: When names share almost no tokens — e.g. brand Investing.com vs legal entity Fusion Media Limited (WRatio 30.0). Use a lookup table, domain, LEI, or enrichment API instead.

Q: Why not join on company_name in SQL?

A: Because raw name joins will often miss legal variants. Resolve each row to a canonical_id in Python, load clusters into Postgres, and only then can you safely do a JOIN ... USING (canonical_id).

Caveats

I should clear some things up about this tutorial.

The sample is leaderboard-only. These are the top-ranked companies on each hub, not a random draw from Crunchbase’s full 6,000+ company lists. Leaderboards are always curated. Noisier source data would push the exact-match baseline down, making the fuzzy lift even bigger.
The CRM examples are handmade. I researched exact hub titles plus known legal variants like ARYZE ApS, Count Finance LTD, PANTA Group. A real CRM would actually be dirtier: misspellings, stale names, entries from multiple import sources with inconsistent formatting. In practice the fuzzy pass may not hit 100%, but it'll still get you much closer than exact matching does.
Some display names need a lookup table, not fuzzy strings. Pairs like Investing.com / Fusion Media Limited or Lyrie.ai / OTT Cybersecurity Inc. share almost no tokens, so WRatio stays low and that's correct behavior. For irreconcilable aliases like that you still want GLEIF, Clearbit, or simply a maintained slug → legal_name map. Fuzzy matching handles stylistic drift on the same name; it can’t handle unrelated brand vs legal entity pairs.

Use Cases for Entity Resolution in Python

The normalize → exact-group → fuzzy-cluster → CRM join pattern I’ve described here applies directly to:

CRM deduplication: merge Acme Corp, Acme Corporation, and ACME before they become three separate accounts in your sales pipeline.
Lead enrichment: match inbound form submissions against existing accounts without requiring exact name entry from the user.
M&A / investor research: reconcile company lists from multiple data vendors (Crunchbase, PitchBook, LinkedIn) that use different display names for the same entity.
Changelog and release tracking: match product names across sources (GitHub repo name, npm package name, marketing site name) that follow different conventions.

That WRatio threshold is something you should play around with. At WRatio threshold 90 (the default in this pipeline), clearly unrelated pairs stay out (Stripe vs Climate Corp scores 45.0) while suffix and spacing drift gets in. Drop to 80 and you'll catch more variants but start seeing false positives. This will differ based on your dataset, obviously, and the review queue is your safety net either way.

Next step in production: load reconciled.json into Postgres, resolve each CRM row to a canonical_id (same logic as _crm_to_canonical in Python), then join on that key instead of company_name.

-- Tables loaded from pipeline output (reconciled.json + raw_records + sample_crm)  
CREATE TABLE canonicals (  
  canonical_id   TEXT PRIMARY KEY,  
  canonical_name TEXT NOT NULL  
);  

CREATE TABLE entity_aliases (  
  canonical_id TEXT NOT NULL REFERENCES canonicals (canonical_id),  
  alias_name   TEXT NOT NULL,  
  source       TEXT,  
  match_score  NUMERIC,  
  PRIMARY KEY (canonical_id, alias_name)  
);  

CREATE TABLE hub_scrape (  
  id            TEXT PRIMARY KEY,  
  company_name  TEXT NOT NULL,  
  canonical_id  TEXT REFERENCES canonicals (canonical_id),  
  category      TEXT,  
  url           TEXT  
);  

CREATE TABLE crm_accounts (  
  id            TEXT PRIMARY KEY,  
  company_name  TEXT NOT NULL,  
  canonical_id  TEXT REFERENCES canonicals (canonical_id)  -- from Python CRM mapping  
);  

-- Broken: join on raw company_name  
SELECT COUNT(*) AS matched_rows  
FROM   hub_scrape h  
JOIN   crm_accounts c ON c.company_name = h.company_name;  
-- 56 / 96 (~58%) on this dataset  

-- Fixed: join on canonical_id (assigned during ETL from reconciled.json)  
SELECT h.company_name AS hub_name,  
       c.company_name AS crm_name,  
       h.canonical_id  
FROM   hub_scrape h  
JOIN   crm_accounts c USING (canonical_id)  
WHERE  h.company_name = 'Necker FinTech';  
-- hub_name: Necker FinTech  
-- crm_name: Necker FinTech Holdings Inc.  (or Necker Fin Tech, etc.)  
-- canonical_id: c_necker_fintech

Load canonicals and entity_aliases from reconciled.json.
Set hub_scrape.canonical_id from the aliases array (id → canonical_id).
Set crm_accounts.canonical_id with the same _crm_to_canonical logic you already run in Python (exact norm match, then WRatio ≥ 90).

After that, SQL stays a plain equi-join — fuzzy matching happens once upstream, and not inside the database. I won’t cover that though; the pattern is the point, not the warehouse you choose to use.

None of this is new — entity resolution is a well-studied problem with industrial-strength tools (Dedupe, Splink, various record linkage toolkits) when you need them. But for the common case of “I have two lists of company names and I need to join them,” you really don’t. A normalization pass and a WRatio threshold gets you most of the way there in an afternoon, in pure Python, with zero infrastructure.

5 Production Stacks for Live Data Ingestion at Scale (Without Getting Blocked)

Prithwish Nath — Tue, 19 May 2026 09:53:42 +0000

TL;DR: Most teams over-engineer data ingestion. They use Kafka before they’ve hit their first rate limit, or Playwright before they’ve checked the network tab. This guide shows five production-tested stacks for live data ingestion from minimal fetch + cron up to LLM agents calling Model Context Protocol (MCP) tools — with the specific failure mode each one solves.

What You’ll Learn

When a plain fetch loop with a cron job is truly all you need
How an agent calling MCP tools adapts where other methods can’t
The serverless + object storage pattern that handles high fan-out volume
How to add retries, idempotency, and replay to any I/O layer without rewriting it
When (and only when) to reach for a headless browser

The right ingestion stack is the one with the fewest moving parts that still handles your specific failure mode. Not someone else’s failure mode. Yours.

Stack	Failure mode it solves	Vendor surface	Complexity	Cost floor	Real ceiling
1. Bun/Node fetch + allowlist	Stable public APIs, no anti-bot story	None	Minimal	Free	Upstream rate limits, not your hardware. Most public APIs cap at 60–6,000 req/min.
2. Agent + Bright Data MCP	High-complexity targets: anti-bot, JS rendering, multi-step flows, adaptive extraction — at moderate volume	Bright Data (MCP tooling w/ free tier)	Low	Bright Data free tier	Rapid: 5K req/mo free. Pro tools and `web_data_*` extractors bill separately — check Bright Data pricing before you rely on them.
3. Serverless cron → object storage	Bursty ingest, fan-out, raw payload durability	Cloud provider only	Low-medium	~$0	Sub-request limit per invocation (50 Free / 1K Paid on CF Workers). Bypass with Queues.
4. Durable workflow engine + swappable I/O	Flaky upstreams, retries, idempotency, replay	Workflow engine + optional proxy	Medium	Varies by engine	Concurrency, history size, replay behavior, and hosted usage limits vary — model fan-out before you scale.
5. Minimal Playwright headless	JS-rendered pages, SPAs, click-to-render flows	Optional proxy vendor	Medium-high	Compute cost	Memory-bound. Each Chromium context ~200–500 MB. Parallelize based on your instance's RAM, not intuition.

1. Bun/Node `fetch` + Allowlist — The Boring Baseline That Works

Documentation: MDN: Fetch API · Bun HTTP

License: N/A

Free Tier: N/A

Best for: Stable public APIs, RSS feeds, open datasets, internal tooling that hits your own endpoints, anything where the anti-bot story is there is no anti-bot story.

Bun - A fast all-in-one JavaScript runtimeBundle, install, and run JavaScript & TypeScript - all in Bun. Bun is a new JavaScript runtime with a native bundler…bun.com

What is the fetch + cron stack?

Plain fetch against a list of known-good URLs, on a timer, writing results to flat files or SQLite. No framework, queue, or service dependencies. A script you can read start to finish in five minutes.

Why use fetch + cron for live data ingestion?

I want to be honest about how often this is all you need. If you’re hitting stable public APIs — government data portals, RSS feeds, well-behaved JSON endpoints, open datasets that publish on a schedule — there is no failure mode that demands anything more than this.

This is the stack everything else is measured against. Before you add a workflow engine like Temporal/Trigger, agents, a proxy, or anything else — see if this one is enough. It usually is.

// bun run ingest.ts   

import { Database } from "bun:sqlite";  

const ALLOWLIST = [  
  "https://api.github.com/repos/vercel/next.js/releases",  
  "https://registry.npmjs.org/react",  
  "https://data.gov/some-dataset.json",  
  // whatever else  
];  

const db = new Database("./data.sqlite");  

db.run(`  
  CREATE TABLE IF NOT EXISTS raw_payloads (  
    id INTEGER PRIMARY KEY AUTOINCREMENT,  
    url TEXT,  
    fetched_at TEXT,  
    payload TEXT  
  )  
`);  

async function ingest() {  
  for (const url of ALLOWLIST) {  
    try {  
      const res = await fetch(url, {  
        headers: { "User-Agent": "my-ingest-bot/1.0" },  
        signal: AbortSignal.timeout(10_000),  
      });  

      if (!res.ok) {  
        console.warn(`[${res.status}] ${url}`);  
        continue;  
      }  

      const payload = await res.text();  

      db.run(  
        `INSERT INTO raw_payloads (url, fetched_at, payload) VALUES (?, ?, ?)`,  
        [url, new Date().toISOString(), payload],  
      );  
    } catch (err) {  
      console.error(`Failed: ${url}`, err);  
    }  
  }  
}  

ingest();

Run it with a system cron, a GitHub Actions schedule, or a simple setInterval. That's literally the whole stack.

How to Handle Pagination

Most real APIs paginate. A very simple, intuitive pattern is to keep one loop, store each payload, and stop only when the API gives no next pointer.

type Page = { items?: unknown[]; next_cursor?: string; next_url?: string };  

async function ingestPaginated(baseUrl: string) {  
  let url: string | null = baseUrl;  
  let page = 0;  

  while (url) {  
    const res = await fetch(url, {  
      headers: { "User-Agent": "my-ingest-bot/1.0" },  
      signal: AbortSignal.timeout(10_000),  
    });  
    if (!res.ok) {  
      console.warn(`[${res.status}] page ${page} — stopping`);  
      break;  
    }  

    const json = (await res.json()) as Page;  

    db.run(  
      `INSERT INTO raw_payloads (url, fetched_at, payload) VALUES (?, ?, ?)`,  
      [url, new Date().toISOString(), JSON.stringify(json)],  
    );  

    url =  
      json.next_url ??  
      (json.next_cursor ? `${baseUrl}?cursor=${json.next_cursor}` : null);  
    page++;  
    if (url) await new Promise((r) => setTimeout(r, 250)); // polite pacing  
  }  

  console.log(`Ingested ${page} pages from ${baseUrl}`);  
}

For offset-based APIs (?page=1&per_page=100), increment page and stop when the response is empty. For link-header APIs (Link:; rel="next"), parse res.headers.get("link") or similar for the next URL.

When fetch + cron isn’t enough

You start getting 403s or rate-limited responses → Stack 2 (Agent + Bright Data MCP) or Stack 5 (Playwright)
You need retries with backoff and idempotency → Stack 4 (durable workflow engine)
Your list of URLs grows to thousands and you need fanout → Stack 3 (serverless)
The pages require JS execution → Stack 5 (Playwright)

What I got wrong

Always store the raw payload, just in case! I built a pipeline that extracted five fields and stored them in a normalized SQLite table — didn’t see the point of keeping the raw response. Three weeks later I needed a sixth field, one that had been in every response the whole time. Re-fetching that government dataset took four days because of their rate limits. So, yeah. The disk cost of storing res.text() verbatim is trivial. The cost of finding out your schema was wrong after the fact and wasting time and money re-ingesting, is decidedly not. 😅 You can parse as you want downstream, separately, later.

2. Agent + Bright Data MCP — Complexity Scale Without the Infra Tax

Repository: https://github.com/brightdata/brightdata-mcp

License: MIT

Free Tier: 5,000 requests/month

Best for: High-complexity, moderate-volume targets (competitive pricing, job market analysis, funding data, SERP monitoring) where the hard problem is anti-bot, adaptive navigation, or frequent site changes; not throughput.

The Web MCP by Bright Data - Start with a Free PlanConnect LLMs and AI agents to real-time web data with Bright Data MCP Server. Search, crawl, and automate web tasks at…brightdata.com

What is the agent + Bright Data MCP stack?

An agent loop — running in the IDE, as a headless script, or on a schedule — wired to Bright Data’s MCP (Model Context Protocol) server as its acquisition layer. The agent calls MCP tools, gets structured data back, and writes results to whatever sink fits your pipeline — files, NDJSON, a database, object storage — without fixed selectors, without a scraping framework, and without proxy infra you operate yourself.

Why use Bright Data MCP for agentic data extraction?

Most guides treat “scale” as a throughput problem. This stack solves a different one: complexity scale — targets where the hard problem is defeating defenses, not managing volume.

A traditional scraper against a heavily defended site is a maintenance contract. Every DOM restructure breaks a selector, every bot detection upgrade breaks your fingerprint, and every geo-block breaks your IP pool. You spend more time maintaining the scraper than using the data. An agent in the loop sidesteps this — it reads what’s on the page and derives the extraction schema from the current DOM, the same way a person would. When the site changes, the agent adapts — autonomously configuring + calling into Bright Data primitives like its proxy network for bot bypass, Scraping Browser when a real browser session is required, and the SERP API when you need to hit Google/Bing etc.

This is also the right stack when you need adaptive acquisition — targets where the data you want depends on what you find at each step. Navigating a site to a specific product variant, following a pagination trail that changes shape, clicking through a login flow — these aren’t hard for a browser-capable agent and are genuinely painful to script deterministically. The proven production pattern here is LLM as orchestrator, pre-built tools as the acquisition layer — which is exactly what MCP provides.

A MCP Client like Claude Desktop, Cursor, etc. is just the easiest entry point:

Basic setup:

{  
  "mcpServers": {  
    "Bright Data": {  
      "command": "npx",  
      "args": [  
        "mcp-remote",  
        "https://mcp.brightdata.com/mcp?token=YOUR_API_TOKEN"  
      ]  
    }  
  }  
}

For local + advanced config:

{  
  "mcpServers": {  
    "Bright Data": {  
      "command": "npx",  
      "args": ["@brightdata/mcp"],  
      "env": {  
        "API_TOKEN": "YOUR_API_TOKEN",  
        "PRO_MODE": "true",  
        "WEB_UNLOCKER_ZONE": "custom",  
        "BROWSER_ZONE": "custom_browser"  
      }  
    }  
  }  
}

But the same MCP wiring can also run headlessly from a script, triggered by cron, or invoked from any orchestrator. The agent loop is not coupled to a GUI.

How to run Bright Data MCP headlessly (without Cursor or Claude Desktop)

You don’t actually need Cursor, Claude Desktop, or any hosted client. The MCP TypeScript SDK gives you a reference Client — install @modelcontextprotocol/client (or use the umbrella @modelcontextprotocol/sdk with the .../client/*.js paths.) That client can spawn @brightdata/mcp as a subprocess, negotiate the MCP handshake, and expose listTools() / callTool() the same way an IDE-hosted MCP client does.

It looks something like this:

import { Client } from "@modelcontextprotocol/client";  
import { StdioClientTransport } from "@modelcontextprotocol/client/stdio";  

const client = new Client({ name: "ingest-client", version: "1.0.0" });  
const transport = new StdioClientTransport({  
  command: "npx",  
  args: ["@brightdata/mcp"],  
  env: { ...process.env, API_TOKEN: process.env.API_TOKEN!, PRO_MODE: "true" },  
});  

await client.connect(transport);  

// Call any Bright Data MCP tool (SDK: distinguish result.isError from thrown ProtocolError/SdkError)  
const result = await client.callTool({  
  name: "scrape_as_markdown",  
  arguments: { url: "https://example.com" },  
});

For the hosted endpoint instead of stdio, swap StdioClientTransport for StreamableHTTPClientTransport and point it at https://mcp.brightdata.com/mcp?token=…. The MCP TypeScript SDK client guide covers transports and error handling.

What I got wrong

To explain what I did wrong, first, here are the tools included in the Bright Data MCP:

**scrape_as_markdown** / **scrape_as_html** — General-purpose scraping with bot bypass
**search_engine** — SERP (search engine results page) data without writing a scraper
**navigate** / **click** / **type** — Full browser automation for flows that require interaction
60+ specialized extractors for Amazon, LinkedIn, Crunchbase, Yahoo Finance, and more

The first lesson was a tier/billing one. The default Rapid mode gives you search and Web Unlocker-backed scraping — not the 60+ Pro tools, browser automation, or the **web_data_*** APIs. Those require PRO_MODE=true, which is pay-as-you-go on top of the free tier. I'd skimmed the marketing copy and missed that footnote entirely.

The second lesson was an API semantics one. I had lowered POLLING_TIMEOUT thinking it was a standard request timeout. It isn't — those web_data_* tools submit a background data-collection job and then poll for the result, and POLLING_TIMEOUT controls how long that polling is allowed to run. Slow extractions just need more time. BASE_TIMEOUT and BASE_MAX_RETRIES are what you actually want for the base tools (search_engine, scrape_as_markdown) — they don't affect the polling path at all.

3. Serverless Cron + Object Storage — Disposable Compute, Durable Data

Repository: https://github.com/cloudflare/workers-sdk

Documentation: Cloudflare Workers · Cloudflare R2 · AWS Lambda

License: MIT

Free Tier: Cloud provider free tiers (limits apply — e.g. Workers limits)

Best for: High-volume URL lists, fan-out workloads, raw payload archiving.

Overview · Cloudflare Workers docsBuild and deploy serverless applications across Cloudflare's global network with Workers.developers.cloudflare.com

What is the serverless cron + object storage pattern?

Basically, Cloudflare Workers / AWS Lambda + Cloudflare R2/AWS S3 + optional manifest (DynamoDB, Postgres, or a key in R2 itself). A short-lived worker that fires on a schedule, fetches one or more payloads, and lands them in object storage as raw files. The compute is fully disposable. The storage is durable. A small manifest (optional but useful) tracks what’s been fetched and when.

Why use serverless workers + R2/S3 for fan-out ingestion?

This is the pattern I reach for when I need fan-out. If Stack 1 is “one script, one machine, one process,” this is “N concurrent workers, each responsible for a slice of the work, all landing to the same durable sink.” You can ingest a thousand URLs in parallel without managing a server, and the raw payloads survive whatever happens to the compute.

The interesting design decision is the separation: when you run (cron) is completely decoupled from what does the fetching (the worker). That worker is a pure function — give it a URL, it gives you a payload in storage. You can swap the acquisition layer (direct fetch today, proxy tomorrow) without touching the scheduling or the storage format.

// Cloudflare Worker (wrangler.toml has crons configured)  

export default {  
  async scheduled(event: ScheduledEvent, env: Env, ctx: ExecutionContext) {  
    const urls = await getUrlBatch(env); // from KV, D1, or hardcoded slice  

    await Promise.allSettled(  
      urls.map(async (url) => {  
        try {  
          const res = await fetch(url, {  
            // Cap per-request wait so one bad host doesn't stall the batch (good hygiene, not CF's max wall time)  

            signal: AbortSignal.timeout(25_000),  
          });  

          if (!res.ok) return;  

          const key = `raw/${new Date().toISOString().slice(0, 10)}/${encodeURIComponent(url)}.json`;  

          await env.BUCKET.put(key, res.body, {  
            httpMetadata: { contentType: "application/json" },  

            customMetadata: {  
              source_url: url,  

              fetched_at: new Date().toISOString(),  

              status: String(res.status),  
            },  
          });  
        } catch (err) {  
          console.error(`Failed: ${url}`, err);  
        }  
      }),  
    );  
  },  
};

wrangler.toml

[[r2_buckets]]  
binding = "BUCKET"  
bucket_name = "my-ingest-bucket"  

[triggers]  
crons = ["0 * * * *"]

Do I need a manifest for serverless ingest?

You don’t always need one. But if you need to know “did I already ingest this URL today?” or “which keys are new since the last pipeline run?”, a manifest pays for itself fast. Cheapest would be a JSON file in R2 itself, keyed by date. Need more? Try a DynamoDB table or a single Postgres table with (url, date, key) rows.

What I got wrong

The biggest gotcha is that Cloudflare Workers silently caps outbound fetch calls at 50 sub-requests on the Free plan per invocation — the excess doesn't error, it simply doesn't fire. I learned this from the logs that weren't there. For any batch larger than that cap, you need to dispatch to Cloudflare Queues and process in smaller chunks.

The other thing is that Workers has a wall-clock time limit — 30 seconds on the Free plan. When you fan out to 40 URLs and three of them are a slow government portal that takes 28 seconds to respond, those three will get cut off at the execution limit, and the logs will show nothing wrong. The only way I caught it was tracking expected vs. actually-written object counts in the manifest — when those numbers diverged, something had timed out quietly. Per-request AbortSignal.timeout helps, but a manifest count is the only reliable canary.

4. A Durable Workflow Engine + Swappable I/O — The Stable Orchestration Layer

Repository: Trigger.dev · Temporal

Documentation: Trigger.dev docs · Temporal docs

License: Varies by engine (Trigger.dev: Apache 2.0; Temporal: MIT)

Free Tier: Varies by engine — hosted usage tiers, self-hosted deployments, and managed-cloud limits all differ

Best for: Any ingest workload where “what failed and why” needs to be answerable, upstreams are flaky, or you’re running at a scale where silent failures are unacceptable.

Welcome to the Trigger.dev docs - Trigger.devFind all the resources and guides you need to get startedtrigger.dev

What is the workflow engine stack?

A workflow engine (Trigger.dev, Temporal, Inngest, AWS Step Functions, etc.) handles retries, idempotency, scheduling, replay, and observability — while the actual acquisition is a swappable I/O step inside the workflow.

Why use Temporal, Inngest, AWS Step Functions for resilient ingestion?

The counterintuitive argument here is that this stack isn’t heavier than Stack 3 in any meaningful sense — it just makes the complexity visible instead of hiding it in ad-hoc retry logic and try/catch soup.

Every serious ingestion system eventually needs automatic retries with backoff, deduplication of runs, visibility into what failed and why, and the ability to replay a failed run without re-running the whole pipeline. Trigger.dev, Temporal, Inngest, and Step Functions all live in this category. The APIs differ but the job is the same.

So with durable orchestration stabilized, your I/O step — direct fetch, proxy, browser job — is the interchangeable part. When your upstream starts blocking you, you change one function. The retries, scheduling, idempotency, and replay story stay intact.

Here’s a Trigger.dev example for this stack:

// trigger.dev task (see current SDK import path for your version — often `@trigger.dev/sdk`)  
import { task, idempotencyKeys } from "@trigger.dev/sdk";  

export const ingestUrl = task({  
  id: "ingest-url",  
  retry: {  
    maxAttempts: 5,  
    factor: 2,  
    minTimeoutInMs: 1000,  
    maxTimeoutInMs: 30_000,  
  },  
  run: async (payload: { url: string; date: string }) => {  
    // Swap this block for proxy / Web Unlocker / browser job when direct fetch isn't enough.  
    const res = await fetch(payload.url, {  
      signal: AbortSignal.timeout(15_000),  
    });  
    if (!res.ok) throw new Error(`HTTP ${res.status}`);  
    const data = await res.json();  
await writeToStorage(data, payload.url);  
    return { success: true, url: payload.url };  
  },  
});  
// Trigger a batch from a cron or an API route  
export const ingestBatch = task({  
  id: "ingest-batch",  
  cron: "0 */6 * * *",  
  run: async () => {  
    const urls = await getTargetUrls();  
    const date = new Date().toISOString().slice(0, 10);  
    const items = await Promise.all(  
      urls.map(async (url) => ({  
        payload: { url, date },  
        options: {  
          idempotencyKey: await idempotencyKeys.create(`ingest:${url}:${date}`, {  
            scope: "global",  
          }),  
        },  
      }))  
    );  
    await ingestUrl.batchTriggerAndWait(items);  
  },  
});

When to swap the I/O step in a workflow

When the I/O step starts failing — consistent 403s, CAPTCHAs, geo-blocks — you replace fetch(url) with a call through a proxy or unlocker API. The retry logic, the scheduling, the idempotency — none of it changes. You changed one line. That's the payoff.

// Before  
const res = await fetch(url);  
// After — swap the I/O layer for a proxy or unlocker when direct fetch isn't enough  
// const res = await proxyClient.fetch(url);

What I got wrong

Don’t neglect idempotency keys! They feel optional until the first time you need to replay something, which is always, eventually.

5. Minimal Playwright Headless — The Last Resort

Repository: https://github.com/microsoft/playwright

Documentation: https://playwright.dev/docs/intro

License: Apache 2.0

Free Tier: Unlimited (open source; you pay for compute/hosting)

Best for: SPAs, client-rendered dashboards, sites that gate content behind click interactions, any page that simply doesn’t exist until JavaScript runs.

GitHub - microsoft/playwright: Playwright is a framework for Web Testing and Automation. It allows…Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single…github.com

What is the minimal Playwright headless stack?

One Playwright browser context, strict timeouts, screenshot and/or HTML captured to storag, with optional residential/datacenter proxy. No parallel contexts unless you’ve actually measured you need them. No fancy orchestration unless you’ve already tried Stack 4 with a browser job.

When to use Playwright for scraping (and why it’s a last resort)

Headless browsers are the right tool for exactly one specific failure mode: the page doesn’t exist until JavaScript executes it. SPAs, client-side rendered dashboards, pages that require a click to reveal pricing — these can’t be handled by any of the previous stacks without adding a browser layer.

But headless is expensive. CPU, memory, time. A Playwright context consumes dramatically more resources than a fetch. You serialize concurrency. Cold starts on serverless are brutal. If you're reaching for Playwright because you might need it, DON'T. Try one of the above stacks first.

The minimal version of this stack will be familiar to most readers: one browser context, one page, strict timeouts, then write to disk. Proxies are optional.

import { chromium } from "playwright";  
import { writeFile } from "fs/promises";  

async function scrapeWithBrowser(url: string, outputDir: string) {  
  const browser = await chromium.launch({  
    headless: true,  
    args: [  
      "--no-sandbox",  
      "--disable-setuid-sandbox",  
      "--disable-dev-shm-usage", // critical for containerized environments  
    ],  
  });  

  const context = await browser.newContext({  
    // Proxy config goes here when you need it:  
    // proxy: { server: "http://proxy.brightdata.com:22225", username: "...", password: "..." },  
    userAgent:  
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",  
    viewport: { width: 1280, height: 800 },  
  });  

  const page = await context.newPage();  

  try {  
    // Hard timeout on navigation — don't wait for every lazy-loaded asset  
    await page.goto(url, {  
      waitUntil: "domcontentloaded", // not 'networkidle' — too slow  
      timeout: 20_000,  
    });  

    // Wait for the element you actually want, not the whole page  
    await page.waitForSelector("[data-testid='price']", { timeout: 8_000 });  
    const html = await page.content();  
    const slug = encodeURIComponent(url).slice(0, 80);  
    const ts = Date.now();  
    await Promise.all([  
      writeFile(`${outputDir}/${slug}-${ts}.html`, html),  
      page.screenshot({  
        path: `${outputDir}/${slug}-${ts}.png`,  
        fullPage: false,  
      }),  
    ]);  

    return { success: true, url };  
  } finally {  
    // Always close — leaked contexts accumulate fast  
    await context.close();  
    await browser.close();  
  }  
}

When to add a proxy to your Playwright stack

Only when you’ve been blocked, definitively. Not before. The minimal version without a proxy will work on the vast majority of public pages. When you start seeing CAPTCHAs, bot challenges, or suspiciously empty responses, that’s when you plug in residential proxies or a scraping browser service (which handles fingerprinting and unblocking at the browser level).

What I got wrong

Of course, the one gotcha that catches everyone on the containerization side: --disable-dev-shm-usage. I deployed without it. Container crashed with exit code 1 and no useful error. Two hours of debugging: Chromium was exhausting Docker's default 64MB /dev/shm. The flag routes Chromium's shared memory to /tmp instead. It is mandatory in any containerized environment. It is never in the first tutorial you read. It is always in the second incident report.

Decision Tree: Choosing Your Ingestion Stack

Start with the simplest viable ingestion layer, then add capabilities only as you actually need more complexity:

Stack 1 → Simple public APIs and small URL sets
Stack 2 → Agentic navigation, anti-bot handling, or validation
Stack 3 → Large-scale fan-out across many URLs
Stack 4 → Durable orchestration, retries, and observability
Stack 5 → JavaScript-rendered pages via headless browsers

Remember: these stacks are composable layers, not mutually exclusive choices.

The most common production pattern I’ve seen is Stack 4 (orchestration) wrapping Stack 5 (browser) or Stack 3 (serverless fan-out) as the I/O step.

Frequently Asked Questions (FAQ)

Q: What is the simplest data ingestion stack for a small project?

A: A plain fetch loop on a cron schedule writing to SQLite or flat files. No framework, no queue, no service dependencies — a script you can read end-to-end in five minutes. This works for the majority of stable public APIs, RSS feeds, and open datasets.

Q: Can I run Bright Data MCP without Claude Desktop or Cursor?

A: Yes. The official @modelcontextprotocol/sdk for TypeScript ships an MCP client (Client + StdioClientTransport) that spawns @brightdata/mcp as a subprocess from any Node script. From there, bridge the tool loop to whatever LLM you prefer: Anthropic's Messages API, Ollama's /api/chat with a local tool-calling model, or any OpenAI-compatible endpoint. No proprietary client install required.

Q: Should I store raw API payloads or only the parsed fields I need?

A: Always store the raw payload first. Schemas evolve, fields you didn’t think you needed become important, and re-fetching upstream data — especially rate-limited public APIs — can take days.

Q: When is it worth paying for a proxy or unblocking service?

A: When you’re seeing 403s, 429s, CAPTCHAs, or geo-blocks on your target. Not before. Vanilla fetch and a minimal Playwright context work on the vast majority of public pages. Add a paid proxy or unblocking layer only when you've measured the specific failure mode you're solving. The same applies to LLM-agent stacks: validate the data exists and has the shape you need before committing to paid infrastructure.

Q: How do I run an MCP server programmatically from a backend service?

A: Use the MCP TypeScript SDK’s Client with the transport that matches your deployment: StdioClientTransport for local subprocesses (e.g. npx @brightdata/mcp), or StreamableHTTPClientTransport / SSEClientTransport for a remote MCP URL like https://mcp.brightdata.com/mcp?token=…. Call mcp.connect(), then listTools() / callTool() exactly like an IDE would. The MCP TypeScript SDK client guide covers all transports.

Q: When should I combine stacks rather than pick one?

A: Almost always! These patterns are primitives, not solo architectures. The most common production shape is Stack 4 (durable orchestration and retries) wrapping Stack 5 (Playwright as the browser I/O step) or Stack 3 (serverless fan-out for parallel fetching). Stack 1 or 2 for ad-hoc validation before you commit to building any of it. Treat the decision tree as "which I/O layer do I need?" not "which complete system should I adopt?"

What stack are you running for live data ingestion? Have you ran into walls I didn’t cover here? Drop it in the comments. 👇

Turning Google into an Explorable Knowledge Graph Using Pure k-NN

Prithwish Nath — Fri, 15 May 2026 11:09:58 +0000

TL;DR: I ran K-Nearest Neighbors (KNN) over a Google search corpus to find cross-query connections no single search can ever surface.

Human learning is all about building connections in your head. Like last week, I read an ArXiv paper on quantization, which prompted me to do some Google-fu for a FP16 vs INT8 comparison on NVIDIA’s forums, and then make a site:github.com search for a Llama.cpp fork with optimized kernels to try it myself. This takes time. Google — or an LLM — can’t make these mental hops for you.

So I wanted to see if I could speed this up by programmatically finding and shortlisting these connections for me to review later, using a classic algorithm from 1951. To collect the raw material, I used my SERP API to run 100 varied Google searches on a specific topic — then merged the ~800 results into one corpus, embedded every row, and ran cosine k-NN over the whole thing.

From that new data, I could click any result in my UI and see its nearest semantic neighbors — not just from the same search, but anywhere in the dataset, across all 100 searches — fully explorable.

Highlighted links in the Related section mean they were from different queries.

This worked exceptionally well. A whopping 42.2% of all neighbor links crossed query boundaries, and every one of the 797 documents in my corpus had at least one cross-search connection in its top 8.

I’ll present my approach and findings here, and the full code is available on GitHub to review.

What is the K-Nearest Neighbors Algorithm (KNN) ?

Similar things tend to be near each other. The k-nearest neighbors algorithm (k-NN) formalizes this:

Given a point in space, find the k closest points to it using some distance metric (here, cosine similarity over embeddings).

I treat each Google result as a point in a shared semantic space. That changes the question from “what ranks for this query?” to “what lives near this document?” Going from Google’s ranking to proximity is what makes connections show up across queries, domains, and levels of abstraction.

Why k-NN? Because it is local and doesn’t need training. It simply operates over the structure already present in the embeddings, and because it runs over the entire merged corpus, neighbors can come from anywhere in the data.

Architecture

I’m too lazy for full orchestration layers or distributed systems—so this project is just a sequence of steps that add structure, progressively.

Each stage does one thing. ingest.py collects results across many queries into a single DuckDB table, preserving (url, query) pairs as distinct rows so context isn’t lost. Then embed.py converts each row into a vector (title + snippet + domain + query) and stores it in Chroma. Next, neighbors.py runs cosine k-NN over that global space and hydrates results back from DuckDB. Finally, serve.py exposes this through a minimal API and HTML/JS/CSS UI, where we can click any result, and see its nearest neighbors from anywhere in the corpus.

💡 I could have stored everything in Chroma with metadata fields and skipped DuckDB entirely. I didn’t, because Chroma does not make for a very good source of truth. Metadata in it is harder to query, to inspect ad hoc, and to rebuild from.

DuckDB on the other hand, is a single portable file, queryable with standard SQL, trivially exportable, and completely replaceable without touching the vector layer.

Prerequisites

Install: Python 3.10+, uv (or another venv workflow), Ollama with nomic-embed-text:latest pulled, and Docker (or any Chroma HTTP server) — e.g. docker compose up -d in this folder so Chroma listens on localhost:8000.

Python dependencies (requirements.txt):

python-dotenv>=1.0.0  
requests>=2.28.0  
chromadb>=0.5.0  
duckdb>=1.0.0  
fastapi>=0.115.0  
uvicorn[standard]>=0.32.0

Environment Variables: Set at least **BRIGHT_DATA_API_KEY** and **BRIGHT_DATA_ZONE** (required for ingest.py). You get them from your Bright Data dashboard after signing up here; replace with your own if using some other SERP API. Everything else is optional, documented in README.md.

Run order: create/activate a venv, uv pip install -r requirements.txt, start Chroma, then do python ingest.py → python embed.py → python serve.py and open the app URL (default [http://127.0.0.1:8766/](http://127.0.0.1:8766/).))..)

Step 1: Building a Multi-Angle Query Set for Your Research Topic

Our list of Google queries can live in aqueries.json. For best results, try to cover as many angles as you can think of. Example: I wanted to research ML on edge devices (as of 2026), so I included strings covering hardware, software, models/compression, web/WASM, research and benchmarks, and more.

Full code here: queries.json

[  
  "how to run a neural network on a microcontroller",  
  "edge AI chips compared Coral TPU vs Jetson Nano",  
  "TensorFlow Lite Micro supported operators",  
  "ONNX Runtime Web WebGPU backend",  
  "site:arxiv.org efficient LLM survey",  
  "MLPerf Tiny benchmark results"  
  // ...see queries.json for the full list  
]

We’ll read this file once at ingest time and never again.

Step 2: Setting Up a Bright Data SERP API Client in Python

Here, we’re gonna wrap the Bright Data SERP API in a thin, defensive client that fails loudly on bad responses instead of silently passing garbage downstream.

Some quick gotchas:

Don’t use the **&num=** parameter. Google deprecated that parameter in September 2025. Now you get ~10 organics regardless. Cap rows by slicing after the response, which is what limit_organic() does.
Use a retry loop, but keep it simple. Rather than pulling in a specialized retry library, search() makes three attempts with a linear backoff of 0.5s × (attempt + 1). Good enough.
Don’t forget to unwrap the response. The "format": "json" parameter brings in a response that is an envelope with its own status_code, headers, and body — and the actual SERP payload lives inside body, so you need a second json.loads.

Full code here: bright_data_serp.py

# a util function really  
def limit_organic(data: Dict[str, Any], max_results: int) -> Dict[str, Any]:  
    # Keep at most `max_results` organic rows.  
    if max_results `<= 0:  
        return data  
    organic = data.get("organic")  
    if isinstance(organic, list) and len(organic) >` max_results:  
        return {**data, "organic": organic[:max_results]}  
    return data  


class BrightDataSERPClient:  
    def __init__(        self,  
        api_key: Optional[str] = None,  
        zone: Optional[str] = None,  
        country: Optional[str] = None,    ):  
        self.api_key = api_key or os.getenv("BRIGHT_DATA_API_KEY")  
        self.zone = zone or os.getenv("BRIGHT_DATA_ZONE")  
        self.country = country or os.getenv("BRIGHT_DATA_COUNTRY")  
        self.api_endpoint = "https://api.brightdata.com/request"  

        if not self.api_key:  
            raise ValueError("BRIGHT_DATA_API_KEY is required.")  
        if not self.zone:  
            raise ValueError("BRIGHT_DATA_ZONE is required.")  

        self.session = requests.Session()  
        self.session.headers.update(  
            {  
                "Content-Type": "application/json",  
                "Authorization": f"Bearer {self.api_key}",  
            }  
        )  

    def search(        self,  
        query: str,  
        num_results: int = 10,  
        language: Optional[str] = None,  
        country: Optional[str] = None,  
        max_retries: int = 2,    ) -> Dict[str, Any]:  
        last_err: Optional[Exception] = None  
        for attempt in range(max_retries + 1):  
            try:  
                return self._do_search(query, num_results, language, country)  
            except Exception as e:  
                last_err = e  
                if attempt `< max_retries:  
                    # simple linear backoff  
                    time.sleep(0.5 * (attempt + 1))  
        assert last_err is not None  
        raise last_err  

    def _do_search(        self,  
        query: str,  
        num_results: int,  
        language: Optional[str],  
        country: Optional[str],    ) ->` Dict[str, Any]:  
        search_url = (  
            f"https://www.google.com/search"  
            f"?q={requests.utils.quote(query)}"  
            f"&brd_json=1"  
        )  
        if language:  
            search_url += f"&hl={language}&lr=lang_{language}"  
        target_country = country or self.country  
        payload: Dict[str, Any] = {  
            "zone": self.zone,  
            "url": search_url,  
            "format": "json",  
        }  
        if target_country:  
            payload["country"] = target_country  

        response = self.session.post(self.api_endpoint, json=payload, timeout=60)  
        response.raise_for_status()  
        result = response.json()  
        if not isinstance(result, dict):  
            raise RuntimeError(f"Bright Data unexpected response type: {type(result)}")  
        inner_status = result.get("status_code")  
        if inner_status is not None and inner_status != 200:  
            raise RuntimeError(f"Bright Data SERP status_code={inner_status}")  
        if "body" in result:  
            body = result["body"]  
            if isinstance(body, str):  
                if not body.strip():  
                    raise RuntimeError("Bright Data SERP empty body")  
                result = json.loads(body)  
            else:  
                result = body  
        elif "organic" not in result:  
            raise RuntimeError("Bright Data response missing 'body' and 'organic'")  
        return limit_organic(result, num_results)

None of this is glamorous, really. But a pipeline that silently ingests empty responses is worse than one that crashes loudly. So I intentionally fail fast here so the rest of the pipeline can trust its input.

Step 3: Ingesting Multi-Query Search Results Into DuckDB

With a reliable client in place, ingest.py has just one job: loop over every query in queries.json, fetch Google’s organic results, and write them into a single DuckDB table.

For the primary key, I decided on a SHA-256 hash of url + source_query. This gives us three things for free:

The same URL retrieved by two different queries becomes two distinct rows, with different source_query values. We don't lose that provenance.
Re-ingesting a query produces the same IDs deterministically, so DELETE WHERE source_query = ? followed by re-insert is safe to run as many times as you like.
And finally, you won’t need an autoincrement sequence or UUID generation — the ID is fully derivable from the content itself.

def row_id(url: str, source_query: str) -> str:  
    return hashlib.sha256(f"{url}t{source_query}".encode()).hexdigest()

I've added a --refresh option too; this wipes the table and re-fetches all queries from scratch (which you'd use when you want a completely clean corpus).

Full code here: ingest.py

# Fetch Google SERP via Bright Data and write a single DuckDB table.  

# Default: merge — skip queries that already have rows (same ``source_query`` string).  
# With ``--refresh``: delete all rows, then re-fetch every query in the file.  

import time  
from bright_data_serp import BrightDataSERPClient  

# resolve DuckDB + queries file and ensure data directory exists.  
def db_path() -> str:  
    return os.getenv("DUCKDB_PATH", str(_DIR / "data" / "serp.duckdb"))  
def queries_path() -> str:  
    return os.getenv("QUERIES_JSON", str(_DIR / "queries.json"))  
def ensure_data_dir() -> None:  
    Path(db_path()).parent.mkdir(parents=True, exist_ok=True)  

# Default table name; this can also be an env var if you have a multi-table database  
TABLE = "serp_results"  

# Deterministic primary key with SHA256. Re-fetching a query overwrites the same id rows.  
def row_id(url: str, source_query: str) -> str:  
    return hashlib.sha256(f"{url}t{source_query}".encode()).hexdigest()  

# Flatten SERP `organic` into DB-ready dicts  
def organic_to_rows(data: Dict[str, Any], source_query: str) -> List[Dict[str, Any]]:  
    organic = data.get("organic")  
    if not isinstance(organic, list):  
        return []  
    out: List[Dict[str, Any]] = []  
    for i, row in enumerate(organic):  
        if not isinstance(row, dict):  
            continue  
        # link vs url, description vs snippet — handle both  
        url = row.get("link") or row.get("url") or ""  
        if not url:  
            continue  
        title = (row.get("title") or "")[:8000]  
        snippet = (row.get("description") or row.get("snippet") or "") or ""  
        snippet = snippet[:16000]  
        pos = row.get("rank") or row.get("position") or (i + 1)  
        try:  
            position = int(pos)  
        except (TypeError, ValueError):  
            position = i + 1  
        domain = urlparse(url).netloc or ""  
        rid = row_id(url, source_query)  
        out.append(  
            {  
                "id": rid,  
                "source_query": source_query,  
                "url": url,  
                "title": title,  
                "snippet": snippet,  
                "domain": domain,  
                "position": position,  
            }  
        )  
    return out  


# Read query strings: either a top-level JSON array or passing in a full object ``{"queries": [...]}``.  
def load_queries(path: str) -> List[str]:  
    p = Path(path)  
    raw = json.loads(p.read_text(encoding="utf-8"))  
    if isinstance(raw, list):  
        return [str(x).strip() for x in raw if str(x).strip()]  
    if isinstance(raw, dict) and "queries" in raw:  
        return [str(x).strip() for x in raw["queries"] if str(x).strip()]  
    raise ValueError("queries.json must be a JSON array or {"queries": [...]}")  


# One-time table definition. Obviously, safe to run every ingest.  
def ensure_schema(con: duckdb.DuckDBPyConnection) -> None:  
    con.execute(  
        f"""  
        CREATE TABLE IF NOT EXISTS {TABLE} (  
            id VARCHAR PRIMARY KEY,  
            source_query VARCHAR NOT NULL,  
            url VARCHAR NOT NULL,  
            title VARCHAR,  
            snippet VARCHAR,  
            domain VARCHAR,  
            position INTEGER NOT NULL  
        )  
        """  
    )  

# 1) Connect and ensure the schema.   
# 2) If ``--refresh``, delete every row.   
# 3) For each query: in default merge mode, skip if that ``source_query`` already has rows, else call Bright Data.  
def main() -> None:  
    con = duckdb.connect(dpath)  
    ensure_schema(con)  
    if args.refresh:  
        con.execute(f"DELETE FROM {TABLE}")  # full wipe, then every query is fetched again  

    bd = BrightDataSERPClient()  
    query_list = load_queries(qpath)  
    for q in query_list:  
        if not args.refresh and con.execute(  
            f"SELECT COUNT(*) FROM {TABLE} WHERE source_query = ?",  
            [q],  
        ).fetchone()[0]:  
            continue  # merge: already have this ``source_query``  
        raw = bd.search(q, num_results=args.num_results)  
        rows = organic_to_rows(raw, q)  
        con.execute(f"DELETE FROM {TABLE} WHERE source_query = ?", [q])  
        if not rows:  
            time.sleep(args.delay)  # rate limit: full script paces on empty organics too  
            continue  
        con.executemany(  
            f"INSERT INTO {TABLE} (id, source_query, url, title, snippet, domain, position) VALUES (?, ?, ?, ?, ?, ?, ?)",  
            [  
                (r["id"], r["source_query"], r["url"], r["title"], r["snippet"], r["domain"], r["position"])  
                for r in rows  
            ],  
        )  
        time.sleep(args.delay)  # pace the script between calls. `ingest.py` also sleeps after a failed `search. Dropping this invites throttling risk.  

    # All done! now count rows, close the connection, and maybe stdout a quick summary

Step 4: Embedding and Indexing Google Results in ChromaDB

For embedding, I picked four fields, concatenated: title, snippet, domain, source query. I’m including domainhere because it carries implicit topical weight — arxiv.org and thinkrobotics.com should mean something different even with identical text.

Why add source query as well? Because the same URL retrieved by two different searches should embed slightly differently — it captures why the result was surfaced, not just what it says.

We don’t need to go over the full embedding logic, so let’s just cover the most important bits.

Full code here: embed.py

# String fed to the embedding model  
# (title → snippet → domain → source query).  
def embedding_text(    title: "str, snippet: str, domain: str, source_query: str) -> str:  "
    t = (title or "").strip()  
    s = (snippet or "").strip()  
    d = (domain or "").strip()  
    q = (source_query or "").strip()  
    return f"{t}n{s}n{d}n{q}".strip()  


# POST /api/embed. New API returns "embeddings" (list of one vector per input)  
def ollama_embed_one(    host: str,  
    model: str,  
    text: str,  
    session: requests.Session,) -> List[float]:  
    url = host.rstrip("/") + "/api/embed"  
    r = session.post(  
        url,  
        json={"model": model, "input": text},  
        timeout=120,  
    )  
    r.raise_for_status()  
    data = r.json()  
    embs = data.get("embeddings")  
    if isinstance(embs, list) and embs and isinstance(embs[0], list):  
        return [float(x) for x in embs[0]]  
    one = data.get("embedding")  
    if isinstance(one, list):  
        return [float(x) for x in one]  
    raise RuntimeError(f"Ollama embed response missing embeddings: {data!r}")

Next, an excerpt of main() in embed.py. Notice that it deletes and recreates the Chroma collection every run. That's deliberate: DuckDB is the source of truth, and rebuilding the vector collection keeps Chroma in sync without writing diff/upsert logic. The tradeoff is that embedding is all-or-nothing; this script is not doing incremental vector maintenance. 😅

# Read all DuckDB rows, drop/recreate the Chroma collection, embed, and add in batches.  
    con = duckdb.connect(dpath, read_only=True)  
    rows = con.execute(  
        f"SELECT id, source_query, title, snippet, domain FROM {TABLE} ORDER BY id"  
    ).fetchall()  
    con.close()  
    if not rows:  
        raise SystemExit(f"No rows in {TABLE}; run ingest first.")  

    client = chroma_client()  # CHROMA_HOST / CHROMA_PORT / CHROMA_SSL  
    name = args.collection  
    try:  
        client.delete_collection(name)  
    except Exception:  
        pass  # no collection yet on first run  
    collection = client.create_collection(  
        name=name,  
        metadata={"hnsw:space": "cosine"},  # cosine in Chroma matches query style in serve  
    )  

    session = requests.Session()  
    ids: List[str] = []  
    embeddings: List[List[float]] = []  
    batch_size = 32  
    for i, (rid, source_query, title, snippet, domain) in enumerate(rows):  
        text = embedding_text(  
            str(title or ""),  
            str(snippet or ""),  
            str(domain or ""),  
            str(source_query or ""),  
        )  
        if not text:  
            text = str(rid)  # last resort so Ollama never sees an empty string  
        emb = ollama_embed_one(args.ollama_host, args.model, text, session)  
        ids.append(str(rid))  
        embeddings.append(emb)  
        if len(ids) >= batch_size or i == len(rows) - 1:  
            collection.add(ids=ids, embeddings=embeddings)  
            print(f"Added {len(ids)} vectors (row {i + 1}/{len(rows)})")  
            ids = []  
            embeddings = []

Step 5: Running Cosine k-NN Over a Merged Corpus

At this point, our two data stores have distinct jobs:

Chroma knows vectors and row ids;
DuckDB holds everything else — title, snippet, URL, domain, position, source_query.

Our k-NN implementation, therefore, is simple: look up the anchor in DuckDB → fetch its vector from Chroma → query for nearby ids → hydrate back from DuckDB. The Chroma layer can stay thin, all display fields come from one place.

Full code here: neighbors.py

Two implementation details you should know about:

The numpy guard: Chroma may return embeddings as a nested list or an ndarray. The usual if not embs breaks on arrays, so first_embedding_for_query normalizes the first vector without relying on truthiness.

def first_embedding_for_query(embs: Any) -> Optional[List[float]]:  
    # bc Chroma may return `embeddings` as a nested list or `ndarray`   
    # This avoids `if not embs` on arrays.  
    if embs is None:  
        return None  
    if isinstance(embs, np.ndarray):  
        if embs.size == 0:  
            return None  
        v = embs[0] if embs.ndim > 1 else embs  
        return v.tolist()  
    if isinstance(embs, (list, tuple)):  
        if len(embs) == 0 or embs[0] is None:  
            return None  
        v = embs[0]  
        return v.tolist() if hasattr(v, "tolist") else list(v)  
    return None

Hydration: Chroma returns ids and distances, not the rows themselves. So rows_by_ids fetches the DuckDB records for those ids, keyed by idso Chroma's ranked order is preserved when distances are stitched back on.

def rows_by_ids(con: duckdb.DuckDBPyConnection, ids: list[str]) -> dict[str, dict]:  
    if not ids:  
        return {}  
    placeholders = ",".join(["?"] * len(ids))  
    out: dict[str, dict] = {}  
    for row in con.execute(  
        f"""  
        SELECT id, source_query, url, title, snippet, domain, position  
        FROM {TABLE} WHERE id IN ({placeholders})  
        """,  
        ids,  
    ).fetchall():  
        rid = str(row[0])  
        out[rid] = {  
            "id": rid,  
            "source_query": row[1],  
            "url": row[2],  
            "title": row[3],  
            "snippet": row[4],  
            "domain": row[5],  
            "position": row[6],  
        }  
    return out

Note that our default path is always pure k-NN: ask Chroma for k+1 nearest row ids, drop the anchor itself, hydrate from DuckDB.

But I’ve also added a cross_query_only flag — a UI filter toggled via a "Cross-query neighbors only" checkbox. Not pure k-NN, but useful to end users.

Results with cross_query_only switched on via UI

When this is on, compute_neighbors drops any candidate whose source_query matches the anchor's. The nearest neighbors in vector space are often siblings from the same original search, so this lets you ask "show me related results from other queries" without touching the index.

Regardless, compute_neighbors wires the two stores together.

def compute_neighbors(anchor: str, k: int = DEFAULT_K, *, cross_query_only: bool = False) -> dict:  
    k = max(1, min(int(k), 50))  
    dpath = db_path()  

    # 1) DuckDB validates the anchor and gives us the row metadata, including  
    # ``source_query`` for cross-query filtering.  
    con = duckdb.connect(dpath, read_only=True)  
    try:  
        anchor_row = row_by_id(con, anchor)  
    finally:  
        con.close()  
    if not anchor_row:  
        return _neighbor_error("unknown id", k, cross_query_only=cross_query_only)  

    # 2) Chroma stores the vector under the same row id.  
    coll = chroma_client().get_collection(collection_name())  
    got = coll.get(ids=[anchor], include=["embeddings"])  
    vector = first_embedding_for_query(got.get("embeddings"))  
    if vector is None:  
        return _neighbor_error("no embedding for id (re-run embed.py)", k, cross_query_only=cross_query_only)  

    max_n = max(1, min(int(coll.count()), 5000))  
    neighbors: list[dict] = []  

    if not cross_query_only:  
        # Normal mode: ask for k + 1 because the nearest result is usually the anchor itself.  
        qres = coll.query(query_embeddings=[vector], n_results=min(k + 1, max_n), include=["distances"])  
        out_ids, out_dist = _ids_distances_from_query(qres, anchor)  
        out_ids, out_dist = out_ids[:k], out_dist[:k]  

        con = duckdb.connect(dpath, read_only=True)  
        try:  
            by_id = rows_by_ids(con, out_ids)  
        finally:  
            con.close()  
        for nid, dist in zip(out_ids, out_dist):  
            if nid in by_id:  
                neighbors.append({**by_id[nid], "distance": dist})  

    else:  
        # Cross-query mode: nearest neighbors often share the same Google query,  
        # so over-fetch, filter by ``source_query``, and widen if we still need more.  
        anchor_seed = str(anchor_row.get("source_query") or "").strip()  
        n_results = min(max(k * 4 + 1, k + 12, 24), max_n)  
        while len(neighbors) `< k and n_results <= max_n:  
            qres = coll.query(query_embeddings=[vector], n_results=n_results, include=["distances"])  
            out_ids, out_dist = _ids_distances_from_query(qres, anchor)  

            con = duckdb.connect(dpath, read_only=True)  
            try:  
                by_id = rows_by_ids(con, out_ids)  
            finally:  
                con.close()  

            neighbors = []  
            for nid, dist in zip(out_ids, out_dist):  
                row = by_id.get(nid)  
                if not row:  
                    continue  
                if str(row.get("source_query") or "").strip() == anchor_seed:  
                    continue  
                neighbors.append({**row, "distance": dist})  
                if len(neighbors) >`= k:  
                    break  

            if len(neighbors) >= k or n_results >= max_n:  
                break  
            n_results = min(max(n_results * 2, k + 1), max_n)  

    return _neighbor_ok(anchor_row, neighbors, k, cross_query_only)

Step 6: Serving ChromaDB Vectors with FastAPI

The backend is prime gotcha territory. Two of them, specifically:

Route order is load-bearing. FastAPI evaluates routes in declaration order. StaticFiles with html=True is a catch-all — it will attempt to serve any path that isn't already handled as a file. If you mount it before registering the API routes, every request to /api/rows tries to find a file named api/rows in the static directory and returns 404. Make sure you register the API routes first.
Disable docs, redoc, and openapi. For a local tool you’re restarting constantly, FastAPI’s schema introspection at startup is just noise.

PORT = int(os.getenv("SERVE_PORT", "8766"))  # override with env if needed  

app = FastAPI(  
    title="k-NN SERP",  
    docs_url=None,  
    redoc_url=None,  
    openapi_url=None,  
)

The actual API surface is minimal. We’ll only need two endpoints:

First, /api/rows dumps the full DuckDB corpus ordered by query then position — this is what populates the main table on load.
Next, /api/neighbors takes an id and k, calls compute_neighbors, and routes errors to the appropriate HTTP status codes via _neighbors_http_response.

💡 The full project has a third endpoint, /api/metrics, serving a precomputed knn_metrics.json from internal/knn_metrics.py. See the full code for that one.

We have to make each failure mode distinct because the the frontend decides what to show based on it. So, 404 for an unknown id, 400 for a missing embedding (re-run embed.py), 503 for Chroma unreachable (Docker daemon not running etc.)

Full code here: serve.py

def _neighbors_http_response(result: dict) -> JSONResponse | dict:  
    err = result.get("error")  
    if not err:  
        return {  
            "anchor": result["anchor"],  
            "neighbors": result["neighbors"],  
            "k": result["k"],  
            "cross_query_only": result.get("cross_query_only", False),  
        }  
    if err == "unknown id":  
        return JSONResponse({...}, status_code=404)  
    if "DuckDB not found" in err:  
        return JSONResponse({...}, status_code=500)  
    if err.startswith("Chroma") or "Chroma" in err:  
        return JSONResponse({...}, status_code=503)  # dependency down  
    if "no embedding" in err:  
        return JSONResponse({...}, status_code=400)  # re-run embed.py  
    return JSONResponse({...}, status_code=500)

@app.get("/api/rows", response_model=None)  
def api_rows() -> JSONResponse | dict[str, Any]:  
    con = duckdb.connect(db_path(), read_only=True)  
    try:  
        rows = con.execute(  
            f"SELECT id, source_query, url, title, snippet, domain, position"  
            f" FROM {TABLE} ORDER BY source_query, position, id"  
        ).fetchall()  
    finally:  
        con.close()  
    payload = [  
        {  
            "id": r[0],  
            "source_query": r[1],  
            "url": r[2],  
            "title": r[3],  
            "snippet": r[4],  
            "domain": r[5],  
            "position": r[6],  
        }  
        for r in rows  
    ]  
    return {"rows": payload}  


@app.get("/api/neighbors", response_model=None)  
def api_neighbors(    id: str | None = Query(default=None),  
    k: int = DEFAULT_K,  
    cross_query: str = "0",  # query params are strings; convert below) -> JSONResponse | dict[str, Any]:  
    anchor = (id or "").strip()  
    if not anchor:  
        return JSONResponse({"error": "missing id", ...}, status_code=400)  
    cross_query_only = cross_query.strip().lower() in ("1", "true", "yes", "on")  
    return _neighbors_http_response(compute_neighbors(anchor, k, cross_query_only=cross_query_only))  


app.mount(  
    "/",  
    StaticFiles(directory=str(STATIC), html=True),  
    name="ui",  
)  


def main() -> None:  
    print(f"k-NN SERP UI: http://127.0.0.1:{PORT}/")  
    uvicorn.run(app, host="127.0.0.1", port=PORT)  


# Must come last — catches all unhandled paths as static files  
app.mount("/", StaticFiles(directory=str(STATIC), html=True), name="ui")

Step 7 — Serving a Neighbor Explorer UI with JavaScript

I won’t go into UI design — this isn’t a frontend tutorial. Just use whatever rendering approach makes sense for you.

Filtered corpus view by "llama.cpp"

So let’s just talk about the core JS we need — a loadNeighbors function that hits/api/neighbors, builds the anchor card, maps neighbors into a table row each with rank, cosine distance, title, domain, seed query, and snippet.

Full code here: /static/index.html

  async function loadNeighbors(id, tr) {  
  setRowActive(tr);  
  focusId = id;  
  const seq = ++loadSeq;  // capture sequence before any await  
  nPanel.hidden = false;  
  anchorBox.innerHTML = "`<p class="meta loading-cell">`Loading…`</p>`";  
  neighborsBox.innerHTML = "";  

  try {  
    const res = await fetch("/api/neighbors?id=" + encodeURIComponent(id) + "&k=8");  
    const data = await res.json();  
    if (seq !== loadSeq) return;  // a newer click landed first, discard this result  

    if (!res.ok) {  
      anchorBox.innerHTML = "`<p class="err">`" + escapeHtml(data.error || res.statusText) + "`</p>`";  
      return;  
    }  

    const a = data.anchor;  
    anchorBox.innerHTML = /* anchor card HTML */;  

    const n = data.neighbors || [];  
    neighborsBox.innerHTML = n.length === 0  
      ? "`<p class="meta">`No neighbors (check embed.py / Chroma).`</p>`"  
      : buildNeighborTable(n);  

  } catch (e) {  
    if (seq !== loadSeq) return;  
    anchorBox.innerHTML = "`<p class="err">`" + escapeHtml(String(e)) + "`</p>`";  
  }  
}  

(async function init() {  
  try {  
    const res = await fetch("/api/rows");  
    const data = await res.json();  
    allRows = data.rows || [];  
    renderTable(allRows);  
  } catch (e) {  
    rowMeta.textContent = "Failed to load /api/rows: " + e;  
  }  
})();

And that’s everything for code! Let’s look at some interesting results I found.

Results: What Cosine k-NN Reveals Across 100 Google Searches

I ran five metrics over every document in the corpus to verify the pipeline was actually bridging queries (rather than just clustering within them.) The answer was a resounding yes — 42.2% of all neighbor links crossed query boundaries, and every one of the 797 documents had at least one cross-query neighbor in its top 8 — including the niche ones!

You can see the full raw data here: metrics.json

The per-query breakdown is where it gets really interesting.

Hub Queries vs. Island Queries: How Semantic Density Varies by Topic

Cross-query neighbor rate by source query — 5 best examples of each on the left, the full list on the right (click to expand). Images created by author via D3.js.

The tinyML getting started Arduino query scores 95.3% — the highest in the corpus, and on the surface, a beginner tutorial query. But its documents live at a vocabulary crossroads: the k-NN pulled neighbors from 17 different source queries, spanning hardware datasheets, ArXiv surveys, RTOS scheduling guides, and mobile deployment docs. Without the merged corpus you'd only have a list of tutorials. With it, we see that this query sits at the center of the whole topic space. Turns out, a specific chip or runtime can be a crossroads of multiple topics.

The “island” end is just as revealing. pruning vs quantization, weight clustering, knowledge distillation — all 12–19%. Clearly, model compression theory forms a tight, self-contained cluster that talks to itself fluently and barely touches the rest of the corpus. If you're researching compression techniques, you're in a separate conversation from the people researching deployment and hardware — even though most would assume those worlds overlap.

How Query-to-Query Edges Reveal Hidden Connections

The query-to-query edge count measures how many neighbor links flow between each pair of source queries across the whole corpus:

| Rank | Query A | A→B | B→A | Total | Query B |  
| ---: | --- | --: | --: | --: | --- |  
| 1 | `site:pytorch.org mobile deployment` | 31 | 26 | 57 | `PyTorch ExecuTorch mobile deployment guide` |  
| 2 | `WebAssembly machine learning inference browser` | 27 | 24 | 51 | `WebGPU machine learning inference browser` |  
| 3 | `site:arxiv.org tinyML survey 2024` | 21 | 16 | 37 | `site:arxiv.org efficient LLM survey` |  
| 4 | `llama.cpp performance ARM CPU benchmark` | 18 | 17 | 35 | `llama.cpp vs MLC LLM phone comparison` |  
| 5 | `ONNX Runtime vs TensorFlow Lite 2025` | 18 | 13 | 31 | `TensorFlow Lite vs ONNX Runtime for edge deployment` |  
| 6 | `INT8 vs INT4 accuracy loss LLM` | 15 | 14 | 29 | `INT4 quantization large language model accuracy` |  
| 7 | `Whisper tiny on-device speech recognition` | 13 | 10 | 23 | `offline speech recognition Android` |  
| 8 | `MediaPipe on-device LLM inference` | 14 | 8 | 22 | `Hugging Face on-device inference blog` |  
| 9 | `memory footprint LLM quantization MB` | 11 | 10 | 21 | `KV cache quantization LLM` |  
| 10 | `WebNN API machine learning browser native` | 13 | 6 | 19 | `WebAssembly machine learning inference browser` |

The site-scoped pairs at ranks 1 and 3 are worth a second look: site:pytorch.org pairs tightly with the broader ExecuTorch guide; site:arxiv.org pairs with the wider LLM survey. Our pipeline is detecting that a scoped search is a zoom-in on a broader topic — without being told.

Query Boundaries Barely Exist in Embedding Space

For each document, I measured the cosine distance difference between its nearest same-query neighbor and its nearest cross-query neighbor. A cross-query neighbor at distance 0.246 was nearly as semantically close as a same-query neighbor at 0.202.

Also, on average, such cross-query neighbors sit only a ~0.06 cosine distance away than a same-query one. That’s definitely not a loose thematic connection — in fact, it’s nearly as tight as the results Google already ranked together. Our pipeline is actually finding close ones that were never in the same search to begin with.

Conclusion: A Proximity-Based Knowledge Graph

Going proximity-based instead of Google’s traditional relevance-ranked, and cross-query instead of query-bound, gives us something Google does not. The 42.2% cross-query rate and the 3.52 average unique queries per neighborhood are evidence that the semantic space over a merged corpus has structure that rewards exploration.

This is a fine research tool, but also a setup for something even MORE useful.

You could always swap Nomic for a larger embedding model, add reranking, build a graph visualization over the query-to-query edges, or pipe this into a RAG system as a retrieval layer. If you take a shot at this, let me know in the comments, or just reach out on LinkedIn.

Some links in this article are tracking links used for analytics purposes only. I do not receive any commission or compensation from them.

How Failing at Fantasy Baseball Made Me Fix My Cron Jobs with Temporal

Prithwish Nath — Tue, 05 May 2026 16:03:31 +0000

So I made a bad trade in my fantasy baseball league. Dropped Kaz Okamoto because — according to my data — he’d been cold for two weeks. In reality, he’s been on a tear for the last 9 days. 😅 This was a bad decision made because of bad data — my stats cron job had hit a rate limit, exited with no errors, and my FastAPI backend kept serving a stale JSON snapshot.

Well, I’d been meaning to fix that setup anyway. This time I did — and instead of patching the script, I tried out Temporal and…it worked embarrassingly well. Retries, backoff, execution history — things I’d normally bolt on manually were just… there. And if the network layer itself was flaky — rate limits, geo blocks — I could just add a proxy as a hardening layer.

This actually prompted me to go look at some of our production ingest jobs at work, and I thought: these are the same pattern, just with more surface area! I ended up swapping out one of them, tentatively, then another. Same pattern, just more scale.

This is my (admittedly very casual) write-up of what I learned. I hope it’s useful!

💡 I use Temporal’s Python SDK here, but they have one for TypeScript too — if that’s your thing.

How Cron Jobs Can Burn You

Here’s the brittle script I was using:

# One-shot MLB.com player fetch + write  

def main() -> None:
    player_url = os.getenv(
        "PLAYER_URL",
        "https://www.mlb.com/player/kazuma-okamoto-672960",
    )
    out_dir = Path(os.getenv("OUTPUT_DIR", "./data/runs"))
    out = out_dir / "latest.json"

    try:
        r = requests.get(player_url, timeout=60)
        if r.status_code != 200:
            print(f"WARN: HTTP {r.status_code}, leaving {out} unchanged")
            return
        stats = extract_stats_datatable(r.text)
        out_dir.mkdir(parents=True, exist_ok=True)
        out.write_text(json.dumps(stats), encoding="utf-8")
        print(f"Wrote stats to {out}")
    except Exception as e:
        print(f"WARN: ingest failed ({e!r}), leaving {out} unchanged")

if __name__ == "__main__":
    main()

That script lived behind a super simple crontab line — once a night, fixed schedule, stdout/stderr logs:

# crontab -l (excerpt)
0 2 * * * cd /home/me/fantasy-stats && .venv/bin/python scripts/fetch_player_stats.py >> /var/log/mlb_fetch.log 2>&1

The script is about ~20 lines of Python plus one line of schedule. How many potential failures can you spot here?

The first is the thing I thought was prudent: if the fetch looks wrong, don’t overwrite the snapshot. So any 429, timeout, or 200 with a layout that no longer contains the marker extract_stats_datatable expects becomes a printed WARN, a no-op, and main() returns — exit code 0. No raise_for_status(); no sys.exit(1). Cron is “happy”; the one-line warning vanishes in a log I wasn’t tailing; latest.json never updates. Nine “successful” runs later…I made a bad decision because I had bad data. (Flip it to raise_for_status() and you get the opposite smell: a non-zero exit, still no retry, still stale data until someone fixes the feed — pick your poison 😬)

The other two are subtler — and I didn’t personally run into them — but upon review, they were just as likely to have burnt me.

Fixed output path. Every run writes to latest.json. If two runs overlap — which happens the moment a run is slow and the next cron tick fires — it’s a race condition. One overwrites the other mid-write. You might read a corrupted file, or never know which run's data you actually have.
Non-atomic write. out.write_text() is not atomic. If the process dies mid-write — OOM, signal, anything — you get a partial JSON file. The next reader gets a parse error and now has to figure out if the file is corrupted or just empty. This is the exactly kind of bug that shows up at 2am on a production system.

The real problem isn’t any ONE of these — it’s really that cron gives you exactly one bit of feedback: exit zero or exit non-zero — and as this script shows, exit zero can lie. It can’t give you a retry policy, overlap protection, or artifact history. No way to answer “what did this job actually do at 3am last Tuesday?”

And yes, you can try to patch around that. Add a retry loop and exponential backoff, ship logs somewhere. But now your retry state lives in the process memory that disappears on crash, your backoff is hand-rolled, and your observability is still pretty much log spelunking. At that point you’re not using cron anymore. You’re rebuilding a tiny, worse workflow engine around cron.

What is Temporal?

Temporal is a “durable execution platform”. What that really means in practice is that you’ll write ordinary functions — a Workflow that orchestrates things, and Activities that do the actual work — and Temporal will:

Make the execution survive process crashes,
Retry failed steps with backoff,
Prevent overlapping runs, and
Record the full history of every execution.

💡 Durable execution is the simple idea that your code should keep running to completion even if the machine running it doesn’t. The mental model is that your workflow is a function call that cannot be interrupted, even if the worker reboots halfway through. State lives in Temporal’s history, not in the worker’s memory.

The architecture for our project will look like this (by default you get the Temporal Web UI on localhost:8233):

To grok Temporal properly, understand that the Workflow owns when things happen, while the Activity owns what happens — i.e the page fetch, the stats extraction, the file write. Workflows must stay deterministic; all side effects belong in Activities. Why does that matter? We’ll come back to that in a second.

Getting player data from MLB.com

Before any of the Temporal machinery, you need a reliable data ingestion. For us, that means loading an MLB.com player page and extracting the stats blob embedded in the initial HTML.

MLB.com currently renders a player page with a JavaScript object that starts with stats: {"statsDatatable"...}. That's convenient: no browser, no Playwright, no screenshot automation needed.

Critical to understand that this does not tell you that the stats blob is still there or that the row you care about was parsed correctly.

mlb_player_stats.py

"""  
MLB.com player page: HTTP fetch + embedded JSON extraction.  
"""  
from __future__ import annotations  
import os  
import re  
from json import JSONDecoder  
from typing import Any, Dict, List  
from urllib.parse import quote  
import requests  

def _strip_tags(s: Any) -> Any:  
    if not isinstance(s, str):  
        return s  
    s = re.sub(r"`<[^>`]+>", "", s)  
    return s.strip()  

def _sanitize_row(row: Dict[str, Any]) -> Dict[str, Any]:  
    out: Dict[str, Any] = {}  
    for k, v in row.items():  
        out[k] = _strip_tags(v)  
    return out  

def extract_stats_datatable(html: str) -> Dict[str, Any]:  
    needle = 'stats: {"statsDatatable"'  
    i = html.find(needle)  
    if i == -1:  
        raise ValueError(  
            "Could not find stats JSON marker (page layout may have changed)."  
        )  
    start = i + len("stats: ")  
    obj, _ = JSONDecoder().raw_decode(html[start:])  
    return obj  

def pick_current_season_row(rows: List[Dict[str, Any]]) -> Dict[str, Any] | None:  
    for row in rows:  
        h = row.get("header", "")  
        if isinstance(h, str) and "Regular Season" in h and "Career" not in h:  
            return row  
    return rows[0] if rows else None  

def build_requests_proxies() -> Dict[str, str] | None:  
    # Bright Data super proxy; returns None if credentials unset  
    explicit = os.getenv("BRIGHT_DATA_PROXY_URL", "").strip()  
    if explicit:  
        return {"http": explicit, "https": explicit}  
    host = os.getenv("BRIGHT_DATA_PROXY_HOST", "brd.superproxy.io").strip()  
    port = os.getenv("BRIGHT_DATA_PROXY_PORT", "33335").strip()  
    username = os.getenv("BRIGHT_DATA_PROXY_USERNAME", "").strip()  
    password = os.getenv("BRIGHT_DATA_PROXY_PASSWORD", "").strip()  
    if not username or not password:  
        return None  
    user_enc = quote(username, safe="")  
    pass_enc = quote(password, safe="")  
    proxy_url = f"http://{user_enc}:{pass_enc}@{host}:{port}"  
    return {"http": proxy_url, "https": proxy_url}  

def fetch_player_page(player_url: str, *, timeout: int = 60) -> requests.Response:  
    proxies = build_requests_proxies()  
    return requests.get(  
        player_url,  
        timeout=timeout,  
        proxies=proxies  
    )  

def build_stats_payload(player_url: str, *, timeout: int = 60) -> Dict[str, Any]:  
    # Fetch page, parse embedded hitting summary rows (sanitized)  
    r = fetch_player_page(player_url, timeout=timeout)  
    r.raise_for_status()  
    blob = extract_stats_datatable(r.text)  
    hitting_large = blob["statsDatatable"]["hitting"]["large"]  
    block = hitting_large[0] if isinstance(hitting_large, list) else hitting_large  
    filtered = block["filteredRows"]  
    current = pick_current_season_row(filtered)  
    career = next(  
        (row for row in filtered if row.get("header") == "Career Regular Season"),  
        None,  
    )  
    via_proxy = build_requests_proxies() is not None  
    return {  
        "source_url": player_url,  
        "http_status": r.status_code,  
        "via_bright_data_proxy": via_proxy,  
        "current_regular_season": _sanitize_row(current) if current else None,  
        "career_regular_season_row": _sanitize_row(career) if career else None,  
        "all_summary_rows": [_sanitize_row(row) for row in filtered],  
    }

Keeping the fetch inside an activity means the execution model stays unchanged while the network path can evolve independently. The same fetch_player_page call can run directly or be routed through a proxy layer without touching the workflow logic.

Note that Temporal can only give you execution reliability: retries, timeouts, and visibility. It can not make a failing network succeed. If every attempt returns 429 from the same IP, Temporal will reliably retry a failing request until the policy is exhausted — you won’t know how valuable our proxy layer is until you really need it (if you’re following along, get it here.)

Data Extraction with Temporal.io

The Temporal Workflow

@workflow.defn  
class StatsCollectionWorkflow:  
    @workflow.run  
    async def run(self, job: StatsJob) -> CollectStatsResult:  
        info = workflow.info()  
        return await workflow.execute_activity(  
            collect_stats,  
            CollectStatsInput(  
                player_url=job.player_url,  
                workflow_id=info.workflow_id,  
                run_id=info.run_id,  
                output_dir=job.output_dir,  
            ),  
            start_to_close_timeout=timedelta(minutes=10),  
            retry_policy=RetryPolicy(  
                initial_interval=timedelta(seconds=3),  
                backoff_coefficient=2.0,  
                maximum_interval=timedelta(minutes=2),  
                maximum_attempts=8,  
            ),  
        )

That RetryPolicy block is the part most of us have written manually at some point — a while loop, a try/except, a time.sleep, a counter, hopefully a max attempts check. Here it's declared once, lives outside the business logic, and survives worker crashes. If the worker process dies on attempt 3 of 8, the next worker that comes up picks up at attempt 4. The state is in Temporal, not in memory.

The start_to_close_timeout is the hard limit on how long a single activity attempt can run. Without it, a stalled HTTP request holds a worker slot indefinitely. Decidedly not what we want.

The Temporal Activity

Here’s activities.py:

"""Activities: MLB.com fetch + stats extraction + artifact write (all side effects here)."""  
from __future__ import annotations  
import json  
import os  
from dataclasses import dataclass  
from pathlib import Path  
from typing import Any, Dict, Union  
from temporalio import activity  
from temporal_cron.mlb_player_stats import build_stats_payload  

@dataclass  
class CollectStatsInput:  
    player_url: str  
    workflow_id: str  
    run_id: str  
    output_dir: str  

@dataclass  
class CollectStatsResult:  
    artifact_path: str  
    home_runs: Union[str, int]  
    player_url: str  

def _atomic_write_json(path: Path, data: Dict[str, Any]) -> None:  
    path.parent.mkdir(parents=True, exist_ok=True)  
    tmp = path.with_suffix(path.suffix + ".tmp")  
    tmp.write_text(json.dumps(data, indent=2), encoding="utf-8")  
    tmp.replace(path)  

@activity.defn  
def collect_stats(input: CollectStatsInput) -> CollectStatsResult:  
    # One HTTP try per activity attempt; workflow RetryPolicy owns backoff/attempts.  
    data = build_stats_payload(input.player_url, timeout=60)  
    current = data.get("current_regular_season") or {}  
    hr = current.get("homeRuns", 0)  
    base = Path(input.output_dir or os.getenv("OUTPUT_DIR", "./data/runs"))  
    safe_wid = input.workflow_id.replace(os.sep, "_").replace(":", "_")  
    safe_rid = input.run_id.replace(os.sep, "_").replace(":", "_")  
    out_path = base / f"{safe_wid}__{safe_rid}.json"  
    payload = {  
        "workflow_id": input.workflow_id,  
        "run_id": input.run_id,  
        "player_url": input.player_url,  
        "data": data,  
    }  
    _atomic_write_json(out_path, payload)  
    return CollectStatsResult(  
        artifact_path=str(out_path.resolve()),  
        home_runs=hr if hr is not None else 0,  
        player_url=input.player_url,  
    )

The output path uses the run ID, not a fixed filename. Every execution gets its own artifact — stats-manual-abc123__run456.json. No races, no overwrites, and you have a full history of every run. You can diff two runs. You can see exactly what data you had on any given night. This alone would have saved me.

The file write is atomic: _atomic_write_json (in the excerpt above) writes to a .tmp file first, then replace() on the same filesystem. A reader either sees the old file or the new file — never a partial write. The brittle script calls write_text() directly; if the process died mid-write, you got corrupted JSON and a confusing parse error at the worst possible time.

💡 Note that collect_stats is a sync function, not async. That's intentional — sync activities run in a thread pool, so blocking I/O doesn't block the event loop. The Temporal SDK supports both; sync is the right call when your activity is mostly waiting on a network request.

The Temporal Worker

async def _main() -> None:  
    client = await Client.connect(host, namespace=namespace)  
    worker = Worker(  
        client,  
        task_queue=task_queue,  
        workflows=[StatsCollectionWorkflow],  
        activities=[collect_stats],  
    )  
    await worker.run()

The Worker is the only process that ever touches MLB.com. The workflow and scheduler are pure orchestration — they tell Temporal what to do, they don’t do any work themselves. You can scale workers horizontally without touching the scheduling layer. You’ll want that when you take this pattern to production.

The worker view shows the stats-pipeline worker polling alongside Temporal's own system worker.

The Temporal Schedule

Putting this on a schedule is one function call:

schedule = Schedule(  
    action=ScheduleActionStartWorkflow(  
        StatsCollectionWorkflow.run,  
        job,  
        id=workflow_id,  
        task_queue=task_queue,  
    ),  
    spec=ScheduleSpec(cron_expressions=[cron]),  
    policy=SchedulePolicy(overlap=ScheduleOverlapPolicy.SKIP),  
)

SKIP is the thing cron simply cannot do.

So, if MLB.com is slow today, or your fetches are getting rate-limited harder than usual, and your run takes 12 minutes. Your schedule fires every 15. Eventually the slow run bleeds into the next tick. With cron, you now have two instances running simultaneously, both writing to latest.json, racing each other. With SKIP, the new scheduled run sees the previous one is still active and does nothing. When the previous run finishes, the schedule resumes normally at the next tick. That's an entire class of bug you stop thinking about.

The schedule script also handles create-or-update correctly — describe() first, catch NOT_FOUND, then either create or update:

try:  
    await handle.describe()  
except RPCError as err:  
    if err.status != RPCStatusCode.NOT_FOUND:  
        raise  
    await client.create_schedule(schedule_id, schedule)  
else:  
    await handle.update(lambda _input: ScheduleUpdate(schedule=schedule))

Run it once to create the schedule. Run it again to change the cron expression or job parameters. Same command either way.

What you actually see in the UI

Pop open http://localhost:8233 while a workflow is running. Every step is there — which activity ran, how many attempts it took, what the retry intervals were, what came back. If it failed on attempt 2 and succeeded on attempt 5, you can see that. You can see the exact input that went in and the exact output that came out. You can see how long each attempt took.

The workflow list gives you the first-level answer cron never gives cleanly: what ran, when, and whether it completed.

The timeline connects the workflow input, activity execution, and output artifact in one place. The event history is the audit trail: scheduled tasks, activity start/completion, workflow task transitions, and final result.

The activity details will also show the successful third attempt and preserve the previous failure: 429 Too Many Requests.

Temporal activity event showing attempt #3 with previous HTTP 429 failure details

Compare that to the cron version — just a process exit code and whatever you print() to stdout. If the job failed at 3am and you weren't tailing logs, that information is gone.

This observability is honestly why teams end up on Temporal even for jobs that aren’t that complicated. This removes entire categories of debugging work. No log spelunking to figure out whether a retry happened. No guessing how many attempts ran. No reconstructing a timeline from scattered stdout. You won’t have to infer from fragments; all this is something you can just look up.

Running it yourself

Commands below assume uv is available with a venv set up.

# Start the local Temporal server  
temporal server start-dev  
# In another terminal, start the worker  
uv run temporal-cron-worker  
# Trigger a single run  
uv run temporal-cron-start  
# Or put it on a schedule (defaults to every 15 minutes, set SCHEDULE_CRON in .env to change)  
uv run temporal-cron-schedule

Open http://localhost:8233 and you'll see the workflow execution. Artifacts land under data/runs/, one file per run, named by workflow and run ID.

A Note on Production

This just runs against a local Temporal dev server with a single worker. Taking it to production means picking Temporal Cloud or running your own cluster, adding proper secrets management, structured logging, metrics, and deployment automation.

Cron is all you need for simple, isolated tasks. Probably not so much when you have to introduce retries, external dependencies, or jobs that can overlap or run longer than their schedule.

When you get into that zone, Temporal gives you a durable execution layer around ingest: retries, timeouts, overlap control, and a history you can audit. And then when that layer needs to scale up or if you start hitting IP-based (or geo-based, even) friction, Bright Data’s proxies give you the hardened network path you’ll need. It’s been a pretty natural pairing, in my experience. You can route the same requests.get() call through a proxy and let Temporal keep owning retries, timeouts, and audit history.

Regardless, this pattern is a cheat code right here for Temporal: Workflows own orchestration, and Activities own side effects.

I Built a $0 Search Engine on Real Web Data (No Algolia or Elasticsearch)

Prithwish Nath — Tue, 21 Apr 2026 06:43:00 +0000

I’ve been reviewing some old RAG code I wrote a year ago and boy has it not aged well. For context: frequently, I’ll need to do what I call a fast “literature” pass. What’s the latest opinion on long context vs. RAG? What’s new for agentic retrieval? Where does hybrid search fit in 2026?

I depend heavily on arxiv papers for this. Using a SERP API (Bright Data) for Google, running site:arxiv.org plus a well-framed query gets me recent, relevant papers. I’d run four or five of these, collate the results…and then inevitably get bogged down scrolling JSON, opening new tabs, running grep. Just godawful UX for what is, at its core, a search problem (not data).

So I spent this past week refactoring it all, and eventually ended up building a local faceted search surface over real web data.

I fetch organic Google results (site:arxiv.org + query)
Index them with Typesense (a FOSS, lightning-fast, local-first Algolia alternative)
A few lines of Python server code to proxy queries from the browser.

This gives me a live, filterable index where I now quickly search across all my query runs at once, see which papers surfaced under which query angle, and spot overlaps in seconds instead of minutes.

This pattern should work for any research domain where Google is a better discovery layer than the source’s own search, so I’m open sourcing it and writing this up. I hope it’s useful!

Find the full code here:

GitHub - sixthextinction/typesense: POC for local-first Algolia-style search but FOSS. Ingests…POC for local-first Algolia-style search but FOSS.

Prerequisites

I wasn’t kidding about keeping the Python side minimal. We’ll only need Docker, and three Python packages (no frameworks):

requests>=2.28.0  
python-dotenv>=1.0.0  
typesense>=0.21.0

The only one worth mentioning is the Typesense Python client — it handles schema creation, JSONL import, and search. The other two are bog-standard Requests and python-dotenv.

Typesense itself runs in Docker Compose. Let’s make it one container with a persistent volume that survives restarts:

docker-compose.yml

services:  
  typesense:  
    image: typesense/typesense:26.0  
    restart: unless-stopped  
    ports:  
      - "8108:8108"  
    volumes:  
      - typesense-data:/data  
    command: >  
      --data-dir /data  
      --api-key devtypesense  
      --listen-port 8108  
      --enable-cors  

volumes:  
  typesense-data:

# and then you can do  
docker compose up -d

.env

BRIGHT_DATA_API_KEY=your_api_key  
BRIGHT_DATA_ZONE=serp  
BRIGHT_DATA_COUNTRY=us  
TYPESENSE_API_KEY=devtypesense

The TYPESENSE_API_KEY can be anything really — it just has to match the --api-key flag in the compose file. I'll explain why the browser never sees it when we get to serve.py.

Bright Data credentials come from your account. If you’re swapping in another SERP API, this is the only file you’d change.

How the pieces fit together

We have four files:

bright_data_serp.py   # Bright Data SERP client  
ingest.py             # fetch → transform → upsert into Typesense  
serve.py              # /api/search proxy + static file server  
static/index.html     # search UI

ingest.py sends queries to Bright Data, maps each organic result to a Typesense document, and bulk-imports the batch. After that, serve.py sits between the browser and Typesense — authenticated calls go out, plain JSON comes back. The browser never talks to Typesense directly.

Let’s go through each.

How to get structured SERP data from Bright Data

Our client will just POST to https://api.brightdata.com/request with a Bearer token, a zone name, and a Google URL string.

Critically, you have to include brd_json=1. Without it you get raw HTML. With it, you get a parsed organic JSON array — each row has title, link, description, rank, and usually more.

bright_data_serp.py

import json  
import os  
import time  
from typing import Any, Dict, Optional  

import requests  
from dotenv import load_dotenv  

load_dotenv()  


def limit_organic(data: Dict[str, Any], max_results: int) -> Dict[str, Any]:  
    """Keep at most ``max_results`` organic rows. Google/Bright Data often ignore ``&num=``; slice client-side."""  
    if max_results `<= 0:  
        return data  
    organic = data.get("organic")  
    if isinstance(organic, list) and len(organic) >` max_results:  
        return {\*\*data, "organic": organic[:max_results]}  
    return data  


class BrightDataSERPClient:  
    def __init__(  
        self,  
        api_key: Optional[str] = None,  
        zone: Optional[str] = None,  
        country: Optional[str] = None,  
    ):  
        self.api_key = api_key or os.getenv("BRIGHT_DATA_API_KEY")  
        self.zone = zone or os.getenv("BRIGHT_DATA_ZONE")  
        self.country = country or os.getenv("BRIGHT_DATA_COUNTRY")  
        self.api_endpoint = "https://api.brightdata.com/request"  

        if not self.api_key:  
            raise ValueError("BRIGHT_DATA_API_KEY is required.")  
        if not self.zone:  
            raise ValueError("BRIGHT_DATA_ZONE is required.")  

        self.session = requests.Session()  
        self.session.headers.update(  
            {  
                "Content-Type": "application/json",  
                "Authorization": f"Bearer {self.api_key}",  
            }  
        )  

    def search(  
        self,  
        query: str,  
        num_results: int = 10,  
        language: Optional[str] = None,  
        country: Optional[str] = None,  
        max_retries: int = 2,  
    ) -> Dict[str, Any]:  
        last_err: Optional[Exception] = None  
        for attempt in range(max_retries + 1):  
            try:  
                return self._do_search(query, num_results, language, country)  
            except Exception as e:  
                last_err = e  
                if attempt `< max_retries:  
                    time.sleep(0.5 \* (attempt + 1))  
        assert last_err is not None  
        raise last_err  

    def _do_search(  
        self,  
        query: str,  
        num_results: int,  
        language: Optional[str],  
        country: Optional[str],  
    ) ->` Dict[str, Any]:  
        # Omit &num=: deprecated by Google (Bright Data strips it); use limit_organic after fetch.  
        search_url = (  
            f"https://www.google.com/search"  
            f"?q={requests.utils.quote(query)}"  
            f"&brd_json=1"  
        )  
        if language:  
            search_url += f"&hl={language}&lr=lang_{language}"  
        target_country = country or self.country  
        payload: Dict[str, Any] = {  
            "zone": self.zone,  
            "url": search_url,  
            "format": "json",  
        }  
        if target_country:  
            payload["country"] = target_country  

        response = self.session.post(self.api_endpoint, json=payload, timeout=60)  
        response.raise_for_status()  
        result = response.json()  
        if not isinstance(result, dict):  
            raise RuntimeError(f"Bright Data unexpected response type: {type(result)}")  
        inner_status = result.get("status_code")  
        if inner_status is not None and inner_status != 200:  
            raise RuntimeError(f"Bright Data SERP status_code={inner_status}")  
        if "body" in result:  
            body = result["body"]  
            if isinstance(body, str):  
                if not body.strip():  
                    raise RuntimeError("Bright Data SERP empty body")  
                result = json.loads(body)  
            else:  
                result = body  
        elif "organic" not in result:  
            raise RuntimeError("Bright Data response missing 'body' and 'organic'")  
        return limit_organic(result, num_results)

A very common gotcha: no matter how intuitive it might feel, do not put &num= on that search URL to request N results like this:

search_url = (  
    f"https://www.google.com/search"  
    f"?q={requests.utils.quote(query)}"  
    f"&num=50"  
    f"&brd_json=1"  
)

Google deprecated the num parameter for ordinary web search back in September 2025. Now, you typically get about one page of organics (~10). So we'll have to cap rows in code with limit_organic(..., num_results) — slice organic after the response, not via the URL.

With "format": "json", the JSON you parse from the HTTP response is an envelope, not the SERP object itself: status_code, headers, and body. The real SERP payload is inside body, usually as a JSON string you must json.loads again.

That means only a 200 from api.brightdata.com is not enough: check the inner status_code (e.g. 401 → empty body). The client rejects non-200 inner status, empty body, and missing organic after unwrap so ingest doesn’t silently index nothing.

result = response.json()  
inner = result.get("status_code")  
if inner is not None and inner != 200:  
    raise RuntimeError(f"Bright Data SERP status_code={inner}")  
if "body" in result:  
    body = result["body"]  
    if isinstance(body, str):  
        if not body.strip():  
            raise RuntimeError("Bright Data SERP empty body")  
        result = json.loads(body)  
    else:  
        result = body  
# ...  
return limit_organic(result, num_results)

If you skip the unwrap and pass the top-level dict to organic_to_documents, there's no organic key — and with no check, you get an empty index and no error message. It just silently indexes nothing. (Ask me how I know.🙃)

Finally, our client retries with a short backoff — 0.5s * (attempt + 1) — so a transient failure on one query doesn't kill the whole run.

How to design the Typesense Schema

Typesense needs a collection before anything can go in. The schema maps directly to the shape of an organic SERP result — I didn’t add any fields I wasn’t already getting for free:

ingest.py

# Fetches Google SERP via Bright Data THEN indexes organic results into Typesense.  
# Use --append to upsert into an existing index.   
# Use --query and/or --queries-file to override the built-in demo query list.  

import argparse  
import hashlib  
import json  
import os  
import time  
from pathlib import Path  
from typing import Any, Dict, List  
from urllib.parse import urlparse  

import typesense  
from dotenv import load_dotenv  
from typesense.exceptions import ObjectNotFound  

from bright_data_serp import BrightDataSERPClient  

load_dotenv()  

COLLECTION = "serp_results"  

# Some obvious "RAG and retrieval" topics  
DEFAULT_QUERIES = [  
    "site:arxiv.org retrieval augmented generation 2026",  
    "site:arxiv.org hybrid search reranking 2026",  
    "site:arxiv.org agentic RAG 2026",  
    "site:arxiv.org long context vs RAG 2026",  
]  


def typesense_client() -> typesense.Client:  
    return typesense.Client(  
        {  
            "nodes": [  
                {  
                    "host": os.getenv("TYPESENSE_HOST", "localhost"),  
                    "port": os.getenv("TYPESENSE_PORT", "8108"),  
                    "protocol": os.getenv("TYPESENSE_PROTOCOL", "http"),  
                }  
            ],  
            "api_key": os.environ["TYPESENSE_API_KEY"],  
            "connection_timeout_seconds": 30,  
        }  
    )  


def collection_schema() -> Dict[str, Any]:  
    return {  
        "name": COLLECTION,  
        "fields": [  
            {"name": "title", "type": "string"},  
            {"name": "url", "type": "string"},  
            {"name": "snippet", "type": "string", "optional": True},  
            {"name": "source_query", "type": "string", "facet": True},  
            {"name": "domain", "type": "string", "facet": True},  
            {"name": "position", "type": "int32"},  
        ],  
        "default_sorting_field": "position",  
    }  


def organic_to_documents(    data: Dict[str, Any], source_query: str) -> List[Dict[str, Any]]:  
    organic = data.get("organic")  
    if not isinstance(organic, list):  
        return []  
    out: List[Dict[str, Any]] = []  
    for i, row in enumerate(organic):  
        if not isinstance(row, dict):  
            continue  
        url = row.get("link") or row.get("url") or ""  
        if not url:  
            continue  
        title = (row.get("title") or "")[:8000]  
        snippet = (row.get("description") or row.get("snippet") or "") or ""  
        snippet = snippet[:16000]  
        pos = row.get("rank") or row.get("position") or (i + 1)  
        try:  
            position = int(pos)  
        except (TypeError, ValueError):  
            position = i + 1  
        domain = urlparse(url).netloc or ""  
        doc_id = hashlib.sha256(f"{url}\t{source_query}".encode()).hexdigest()  
        out.append(  
            {  
                "id": doc_id,  
                "title": title,  
                "url": url,  
                "snippet": snippet,  
                "source_query": source_query,  
                "domain": domain,  
                "position": position,  
            }  
        )  
    return out  


def ensure_collection(client: typesense.Client, \*, recreate: bool) -> None:  
    if recreate:  
        try:  
            client.collections[COLLECTION].delete()  
        except ObjectNotFound:  
            pass  
        client.collections.create(collection_schema())  
        return  
    try:  
        client.collections[COLLECTION].retrieve()  
    except ObjectNotFound:  
        client.collections.create(collection_schema())  


def load_queries(args: argparse.Namespace) -> List[str]:  
    queries: List[str] = []  
    if args.queries_file:  
        text = Path(args.queries_file).read_text(encoding="utf-8")  
        for line in text.splitlines():  
            line = line.strip()  
            if not line or line.startswith("#"):  
                continue  
            queries.append(line)  
    extra = args.queries or []  
    queries.extend(extra)  
    if not queries:  
        return list(DEFAULT_QUERIES)  
    return queries  


def main() -> None:  
    p = argparse.ArgumentParser(description="Ingest Bright Data SERP into Typesense.")  
    p.add_argument(  
        "--num-results",  
        type=int,  
        default=8,  
        help="Max organic rows to index per query after fetch (Google ignores &num=; we slice client-side).",  
    )  
    p.add_argument(  
        "--delay",  
        type=float,  
        default=0.6,  
        help="Seconds between Bright Data requests.",  
    )  
    p.add_argument(  
        "--append",  
        action="store_true",  
        help="Do not drop the collection; create it only if missing. Use for multiple ingest runs into one index.",  
    )  
    p.add_argument(  
        "--query",  
        action="append",  
        dest="queries",  
        metavar="Q",  
        help="SERP query string (repeatable). Default: built-in demo queries if no --queries-file/--query.",  
    )  
    p.add_argument(  
        "--queries-file",  
        type=str,  
        default=None,  
        help="Path to a file with one query per line (# and blank lines ignored).",  
    )  
    args = p.parse_args()  

    client = typesense_client()  
    ensure_collection(client, recreate=not args.append)  

    bd = BrightDataSERPClient()  
    all_docs: List[Dict[str, Any]] = []  
    query_list = load_queries(args)  

    for q in query_list:  
        print(f"Query: {q!r}")  
        try:  
            raw = bd.search(q, num_results=args.num_results)  
        except Exception as e:  
            print(f"  error: {e}")  
            continue  
        docs = organic_to_documents(raw, q)  
        print(f"  indexed {len(docs)} organic rows")  
        all_docs.extend(docs)  
        time.sleep(args.delay)  

    if not all_docs:  
        print("No documents to import. Check Bright Data credentials and SERP response.")  
        return  

    jsonl = "\n".join(json.dumps(d, ensure_ascii=False) for d in all_docs)  
    imp = client.collections[COLLECTION].documents.import_(jsonl, {"action": "upsert"})  
    # import_ returns one JSON object per line  
    errors = [line for line in imp.split("\n") if line and '"success":false' in line]  
    if errors:  
        print("Import reported errors (first few):", errors[:3])  
    print(f"Done. Total documents: {len(all_docs)}")  


if __name__ == "__main__":  
    main()

Two fields have facet: True: source_query and domain. These are what the filter chips in the UI are built on. source_query is the exact string sent to the SERP API — i.e. not a label you add later, the actual query. domain is extracted from the URL at ingest time.

Both become filterable for free here, which is a huge win for us.

Also, default_sorting_field: "position" means results come back in the same order Google returned them. I do want that as a default — it's the ranking signal I'm using Bright Data to get in the first place.

Some Common Gotchas

When you’re mapping organic results to documents, the first question is how to generate document IDs. The move that feels right is to simply hash the URL — deduplicate on URL, one document per link.

Don’t listen to that instinct. Don’t do this:

doc_id = hashlib.sha256(url.encode()).hexdigest()

What you should do is bake the query into the ID so the same link under two Bright Data runs is two documents, each tagged with the query that surfaced it:

doc_id = hashlib.sha256(f"{url}\t{source_query}".encode()).hexdigest()

The ID is sha256(url + source_query), so the same paper appearing under two different queries becomes two separate documents. Search for a paper title and both facet chips show up — you can see exactly which of your Bright Data runs found it. If you hash on URL alone, that's gone permanently. The index looks cleaner but you've thrown away the only thing that makes the source_query facet meaningful.

One more thing that will ruin your day if you miss it: Bright Data returns link in most payloads but url in some. description and snippet both map to the snippet field depending on the response. Handle both, else some batches might index with blank snippets, no errors or warnings:

url     = row.get("link") or row.get("url") or ""  
snippet = (row.get("description") or row.get("snippet") or "")[:16000]

The difference between a snapshot vs. a corpus

ingest.py runs in two modes:

python ingest.py            # drops and recreates the collection  
python ingest.py --append   # creates only if missing, then upserts

Running without --append wipes and recreates the collection every time — that's probably fine for exploration, throwaway by design. --append creates the collection only if it doesn't exist, then upserts into it.

That matters because let’s say I have a scenario where I ran the default four queries on Monday. Thursday I wanted to add site:arxiv.org graph RAG 2026 to the same index — compare it against what I'd already collected rather than start over. With --append, the new results land alongside the originals and the new seed query shows up as a chip immediately. Without it, I'd be choosing between Monday's index and Thursday's.

That’s what I meant by “collect once, query many times” — the index accumulates and doesn’t reset or get overwritten each time.

Custom queries work inline or from a file:

python ingest.py --append --query "site:arxiv.org graph RAG 2026"  
python ingest.py --append --queries-file my_queries.txt

Keeping the Typesense API key server-side

You could point the browser straight at Typesense and skip serve.py entirely. The problem is that Typesense's API key is an admin key — the same one that can drop your collection. Put it in client-side JS and anyone who opens devtools has it.

So serve.py is just a proxy. The browser calls /api/search, the server makes the authenticated Typesense request and JSON comes back.

I kept it as stdlib [http.server](https://docs.python.org/3/library/http.server.html) — no Flask or FastAPI. Adding a framework to wrap thirty lines of routing is honestly just adding a dependency for the sake of having a dependency. If you want to build on top of this, swapping in your preferred framework takes an hour.

The search parameters passed to Typesense are set once.

serve.py

import json  
import os  
import urllib.parse  
from http.server import BaseHTTPRequestHandler, HTTPServer  
from pathlib import Path  

import typesense  
from dotenv import load_dotenv  

load_dotenv()  

STATIC = Path(__file__).resolve().parent / "static"  
COLLECTION = "serp_results"  
PORT = int(os.getenv("SERVE_PORT", "8765"))  


def client() -> typesense.Client:  
    return typesense.Client(  
        {  
            "nodes": [  
                {  
                    "host": os.getenv("TYPESENSE_HOST", "localhost"),  
                    "port": os.getenv("TYPESENSE_PORT", "8108"),  
                    "protocol": os.getenv("TYPESENSE_PROTOCOL", "http"),  
                }  
            ],  
            "api_key": os.environ["TYPESENSE_API_KEY"],  
            "connection_timeout_seconds": 10,  
        }  
    )  


class Handler(BaseHTTPRequestHandler):  
    _ts: typesense.Client | None = None  

    @classmethod  
    def typesense(cls) -> typesense.Client:  
        if cls._ts is None:  
            cls._ts = client()  
        return cls._ts  

    def log_message(self, fmt: str, \*args: object) -> None:  
        print(f"[{self.address_string()}] {fmt % args}")  

    def do_GET(self) -> None:  
        parsed = urllib.parse.urlparse(self.path)  
        if parsed.path == "/api/search":  
            self._search(parsed.query)  
            return  
        if parsed.path == "/" or parsed.path == "/index.html":  
            self._file(STATIC / "index.html", "text/html; charset=utf-8")  
            return  
        self.send_error(404, "Not found")  

    def _file(self, path: Path, content_type: str) -> None:  
        if not path.is_file():  
            self.send_error(404, "Not found")  
            return  
        data = path.read_bytes()  
        self.send_response(200)  
        self.send_header("Content-Type", content_type)  
        self.send_header("Content-Length", str(len(data)))  
        self.end_headers()  
        self.wfile.write(data)  

    def _search(self, query: str) -> None:  
        qs = urllib.parse.parse_qs(query)  
        q = (qs.get("q") or [""])[0].strip()  
        fq = (qs.get("filter_by") or [""])[0].strip()  

        if not q:  
            payload = {  
                "hits": [],  
                "found": 0,  
                "facet_counts": [],  
                "q": q,  
            }  
            self._json(payload)  
            return  

        # Text search spans four stored fields (see ingest schema). Weights tune BM25-style  
        # ranking: a term in the title should matter more than the same term buried in the  
        # snippet, and more than an incidental match in the URL or domain string.  
        # Order MUST match query_by — Typesense applies weights positionally.  
        query_by = "title,snippet,url,domain"  
        query_by_weights = "4,3,1,1" # so titles are more important than snippets, which are more important than urls, which are more important than domains  

        params: dict = {  
            "q": q,  
            "query_by": query_by,  
            "query_by_weights": query_by_weights,  
            "facet_by": "source_query,domain",  
            "max_facet_values": 40,  
            "per_page": 25,  
        }  
        if fq:  
            params["filter_by"] = fq  

        try:  
            result = self.typesense().collections[COLLECTION].documents.search(params)  
        except Exception as e:  
            self.send_response(500)  
            self.send_header("Content-Type", "application/json")  
            self.end_headers()  
            self.wfile.write(json.dumps({"error": str(e)}).encode())  
            return  

        self._json(result)  

    def _json(self, obj: object) -> None:  
        data = json.dumps(obj, ensure_ascii=False).encode("utf-8")  
        self.send_response(200)  
        self.send_header("Content-Type", "application/json; charset=utf-8")  
        self.send_header("Content-Length", str(len(data)))  
        self.end_headers()  
        self.wfile.write(data)  


def main() -> None:  
    server = HTTPServer(("127.0.0.1", PORT), Handler)  
    print(f"SERP demo UI: http://127.0.0.1:{PORT}/")  
    server.serve_forever()  


if __name__ == "__main__":  
    main()

query_by_weights runs in the same order as query_by. A match in title outscores the same match in snippet, which outscores a match in url or domain. That nudges ranking toward "this is what the page is about" rather than "this word appears somewhere in the metadata" — no embeddings, no extra service, just the standard keyword-search lever.

domain being in query_by is a small trick: searching arxiv.org directly returns everything from that domain. Useful when you've mixed sources in one index, and costs nothing.

facet_by returns counts alongside every search response — the UI builds chips from those without a second request.

One UX detail I cared about is if a facet filter produces zero results, the UI reruns the query without filter_by, keeps the chips populated from those broader counts, and tells you that your filters might be hiding matches. You don’t want a blank screen with zero explanation, do you? 🙃

Typesense Facets in Vanilla JavaScript

The UI can just be regular JavaScript/CSS. I don’t need to go into too much detail, frontend UI design isn’t the point of this post. All you need is some sort of JS logic that hits /api/search, renders hits, and builds chips from facet_counts.

Facet state is two variables:

let filterSq  = "";  // active source_query filter  
let filterDom = "";  // active domain filter

Clicking a chip toggles the relevant variable and re-runs the search. Multiple active filters compose with &&:

function buildFilterBy() {  
  var parts = [];  
  if (filterSq)  parts.push("source_query:=`" + filterSq + "`");  
  if (filterDom) parts.push("domain:=`" + filterDom + "`");  
  return parts.join(" && ");  
}

That string goes straight into Typesense’s filter_by — the UI is just a thin layer over native filter syntax. Nothing to maintain on the client side.

Each result card shows the title, snippet, domain, and the seed query that produced it. That last tag is the thing. You can see at a glance which Bright Data run each result came from — i.e. which question you were asking when you made the query.

Running it

# 1. Start Typesense  
docker compose up -d  

# 2. Install Python deps  
pip install -r requirements.txt  

# 3. Ingest the demo queries  
python ingest.py

You'll see:

Query: 'site:arxiv.org retrieval augmented generation 2026'  
  indexed 8 organic rows  
Query: 'site:arxiv.org hybrid search reranking 2026'  
  indexed 8 organic rows  
Query: 'site:arxiv.org agentic RAG 2026'  
  indexed 8 organic rows  
Query: 'site:arxiv.org long context vs RAG 2026'  
  indexed 8 organic rows  
Done. Total documents: 32

# 4. Start the UI  
python serve.py

Open http://127.0.0.1:8765/ (or whatever you set with SERVE_PORT). You should see the empty search shell first:

Search for memory, chunk, graph, RAG. Click a seed query chip to isolate a single SERP run. If you've mixed domains, the domain chips filter those too.

Second pass, same index:

python ingest.py --append --query "site:arxiv.org graph RAG 2026"

New seed query appears as a chip immediately. Everything you indexed before is still there.

What query "provenance" actually means

The default run collects ~32 arxiv results tagged across four seed queries. Search for RAG or memoryand you get hits from all four runs mixed together.

Now the interesting question is this: are the results under “agentic RAG 2026” the same papers as under “long context vs RAG 2026”?

We can verify this quickly.

Click the site:arxiv.org agentic RAG 2026 chip — that’s one shortlist. Clear it, then click site:arxiv.org long context vs RAG 2026 — another. Some papers appear in both, and you quickly inspect them this way. Those are the ones Google considers relevant regardless of how you framed the question. The ones in only one list are specific to that framing.

This is what I mean by provenance. The source_query facet isn't a topic label, but can be considered a record of which question you were asking when you collected the data. Meaning a paper showing up under multiple seeds is telling you something, and not a deduplication problem.

One honest caveat, though: this is navigation over SERP metadata — titles, snippets, URLs. It can’t search inside the PDFs. What it does is let me triage thirty papers in two minutes instead of twenty, which is the problem I actually had.

Frequently Asked Questions (FAQ)

Q: How do I get Google search results as JSON with Bright Data?

A: POST to https://api.brightdata.com/request with a Bearer token, your zone name, and a Google URL that includes &brd_json=1. That flag is what flips the response from raw HTML to a parsed organic array (each row has title, link, description, rank). The JSON you get back is an envelope — the SERP payload is inside body, usually as a JSON string you have to json.loads a second time.

Q: Typesense vs Meilisearch vs Elasticsearch — which should I pick for a local search index?

A: For this kind of workload (a small, local, faceted index over web data) Typesense and Meilisearch are both reasonable but Elasticsearch is overkill. Typesense is in-memory C++, sub-millisecond latency, facets and typo tolerance on by default, one Docker container, no JVM. Meilisearch is Rust, disk-backed (LMDB), handles larger corpora on less RAM, and has arguably nicer defaults for developer UX. Elasticsearch is what you use when you have a dedicated ops team, billions of documents, or log-analytics workloads.

Q: Why is the same URL indexed twice if it appears under two queries?

A: Because I want it that way. The document ID is sha256(url + source_query), so the same paper surfacing under "agentic RAG 2026" and under "long context vs RAG 2026" becomes two documents — each tagged with the query that found it. Searching for the title shows both facet chips, which is how you see which Bright Data run produced each hit. Hash on URL alone and that provenance is gone permanently.

Q: Does this actually search inside the papers, or just the search-result metadata?

A: Just metadata — titles, snippets, URLs, domains, and the seed query. It’s navigation over SERP rows, not full-text search over PDFs. If you need to search inside the papers, you’d add a second stage — download the PDFs, chunk, embed — on top of this index, using the URLs it surfaces as the candidate set.

Q: Can I use this pipeline for non-arxiv sources?

A: Yes. The pipeline has no opinion about what the queries are. site:arxiv.org is just the scenario I needed; swap in site:github.com, site:news.ycombinator.com, mix site: operators, or drop the filter entirely. The domain field is extracted from the URL at ingest time, so mixed-domain runs get a second facet chip for free.

Q: Why stdlib _http.server_ instead of Flask or FastAPI?

A: Because the proxy is small enough that a framework import would be bigger than the logic it wraps. One handler, two routes (/ and /api/search), no middleware, no router — stdlib is enough. If you're building on top of this, swapping in FastAPI or your preferred framework takes an hour; I just didn't want to pay the dependency tax for a demo.

Key Takeaways

Bright Data solves the hard part of web data collection — proxy rotation, bot detection, structured extraction. Yadda, yadda.

What you do with that JSON is a different question. Export it and it answers the questions you had when you wrote the query. Or, index it, and it answers questions you haven’t even thought of yet.

From a collection as an endpoint to a collection as the start of something you can actually explore as you’re researching — is what I was going for here. It took me a week to refactor something I’d been doing badly for a year, and about twenty minutes to run once it was done. This can scale quite well so, more queries, more domains, more --append runs — and you have the option to make the Typesense index grow with the research instead of resetting every time.

How To Validate Any API Response with Great Expectations (GX)

Prithwish Nath — Wed, 15 Apr 2026 05:59:38 +0000

A lot of my data analysis work is forensic (example here). I’ll pull repeated snapshots — content, traffic, SERPs, metadata — and look at patterns across hundreds of these ingestions. So naturally, bad batches will skew results in ways that are just convincing enough so I don’t catch it. Disastrous.

You can add try-catches, but a 200 OK from your API with empty titles and duplicate rows is still considered a success on the wire unless you model validation as its own failure path. So is there a better way?

This is exactly the gap Great Expectations (GX Core) fills. It’s an open-source Python library for defining declarative quality rules on your data — “quality gates”, explicit rules your data must pass before it’s trusted — things like “this field must never be null”, “these IDs must be unique”, or “this value must fall within a known range”. You basically codify what “good” looks like, run it as a validation step in your pipeline, and get a clear pass/fail on every batch — before bad data ever touches your analysis.

Let’s turn that idea into code — we’ll get data at scale via Bright Data (a SERP API), run a GX suite over each batch, then store it in DuckDB (a file-backed analytical SQL database that’s just a .duckdb file on disk) clean rows to the main table, failed batches to quarantine with enough context to debug.

The Setup

Here’s what you need:

requests>=2.28.0  
python-dotenv>=1.0.0  
duckdb>=1.0.0  
pandas>=2.0.0  
psutil>=5.9.0  
great-expectations>=1.0.0

Note that Python 3.10+ is the floor for Great Expectations 1.x with current pandas / duckdb wheels.

The important ones:

great-expectations — the quality gates: expectations on each batch, clear pass/fail and reports. All we need is GX Core.
duckdb — Doesn't need a server to run.
requests —For HTTP to Bright Data’s POST /request endpoint (your zone must be a SERP zone).

Before running pip install -r requirements.txt, create a .env file with:

BRIGHT_DATA_API_KEY=your_api_key  
BRIGHT_DATA_ZONE=serp # or your SERP zone name from the Bright Data dashboard  
BRIGHT_DATA_COUNTRY=us # optional

The Bright Data client reads these when you run the pipeline. You need a SERP-capable zone — without one, POST /request will not match what this code expects. Proxy rotation and unblocking stay on Bright Data’s side.

How Does the Pipeline Actually Work?

This diagram should make it clear.

bright_data.py # Bright Data API client  
serp_expectations.py # Great Expectations validation suite  
duckdb_store.py # DuckDB schema + insert/quarantine logic  
ingest.py # Pipeline: fetch → validate → store or quarantine

The flow is straightforward: one query equals one batch. For each batch, we check things in a strict order before anything touches the database.

Step 1: Fetching Structured SERP Data with Bright Data

Our API client will just be a thin wrapper around Bright Data’s POST /request endpoint.

bright_data.py

import json  
import os  
from dataclasses import dataclass  
from pathlib import Path  

import requests  
from dotenv import load_dotenv  

_GX_ROOT = Path(__file__).resolve().parent  
load_dotenv(_GX_ROOT / ".env")  


def normalize_serp_payload(raw):  
    """Bright Data may wrap JSON in a string body."""  
    if isinstance(raw, dict) and "body" in raw:  
        body = raw["body"]  
        if isinstance(body, str):  
            return json.loads(body)  
        if isinstance(body, dict):  
            return body  
    return raw  


@dataclass  
class SerpApiResponse:  
    """Bright Data POST /request: HTTP status plus parsed JSON body (if any)."""  

    status_code: int  
    data: dict  


class BrightDataClient:  
    """SERP API zone client; uses https://api.brightdata.com/request"""  

    def __init__(self, api_key=None, zone=None, country=None):  
        self.api_key = api_key or os.getenv("BRIGHT_DATA_API_KEY")  
        self.zone = zone or os.getenv("BRIGHT_DATA_ZONE")  
        self.country = country or os.getenv("BRIGHT_DATA_COUNTRY")  
        self.api_endpoint = "https://api.brightdata.com/request"  

        if not self.api_key:  
            raise ValueError("BRIGHT_DATA_API_KEY must be set in the environment or constructor.")  
        if not self.zone:  
            raise ValueError("BRIGHT_DATA_ZONE must be set in the environment or constructor.")  

        self.session = requests.Session()  
        self.session.headers.update(  
            {  
                "Content-Type": "application/json",  
                "Authorization": f"Bearer {self.api_key}",  
            }  
        )  

    def search(self, query, num_results=10, language=None, country=None):  
        """Raises on non-200 or network error."""  
        serp = self.search_with_status(query, num_results, language, country)  
        if serp.status_code != 200:  
            raise RuntimeError(  
                f"Search request failed with HTTP {serp.status_code}: {serp.data!r}"[:500]  
            )  
        return normalize_serp_payload(serp.data)  

    def search_with_status(self, query, num_results=10, language=None, country=None):  
        """Does not raise on non-200 — for ingest + quarantine routing."""  
        search_url = (  
            f"https://www.google.com/search"  
            f"?q={requests.utils.quote(query)}"  
            f"&num={num_results}"  
            f"&brd_json=1"  
        )  
        if language:  
            search_url += f"&hl={language}&lr=lang_{language}"  

        target_country = country or self.country  
        payload = {"zone": self.zone, "url": search_url, "format": "json"}  
        if target_country:  
            payload["country"] = target_country  

        try:  
            response = self.session.post(self.api_endpoint, json=payload, timeout=30)  
        except requests.exceptions.RequestException as e:  
            return SerpApiResponse(status_code=0, data={"_error": str(e)})  

        data = _parse_response_body(response)  
        return SerpApiResponse(status_code=response.status_code, data=data)  


def _parse_response_body(response):  
    if not response.text:  
        return {}  
    try:  
        return response.json()  
    except ValueError:  
        return {"_non_json_body": response.text[:8000]}

Some things to note here:

The search_with_status does not raise on non-200 so ingest can quarantine API failures
Transport errors are caught so you get status_code=0 and the same api_error path as HTTP failures.
Also, search() raises on failure for callers that want exceptions.
And finally,normalize_serp_payload unwraps responses just in case your API puts JSON under a string body.

Step 2: What Happens to Data That Doesn’t Pass?

Every batch in this pipeline has two possible destinations in DuckDB:

If it clears all our defined quality gates, it goes to serp_results.
If it doesn't — wrong HTTP status, empty organic list, or a failed GX expectation — it lands in serp_quarantine instead, with enough context to understand why.

Let's build that store before we wire up the gates.

duckdb_store.py

import hashlib  
import json  
import os  
from datetime import datetime  

import duckdb  
import pandas as pd  
import psutil  


class SerpStore:  
    """serp_results + serp_quarantine"""  

    def __init__(self, db_path, memory_limit=None):  
        parent = os.path.dirname(db_path)  
        if parent:  
            os.makedirs(parent, exist_ok=True)  

        self.db_path = db_path  
        self.conn = duckdb.connect(db_path)  

        if memory_limit:  
            self.conn.execute(f"SET memory_limit='{memory_limit}'")  
        else:  
            available_memory = psutil.virtual_memory().available  
            memory_gb = int(available_memory / (1024**3) * 0.8)  
            self.conn.execute(f"SET memory_limit='{memory_gb}GB'")  

        self._create_schema()  

    def _create_schema(self):  
        self.conn.execute("""  
            CREATE TABLE IF NOT EXISTS serp_results (  
                id BIGINT PRIMARY KEY,  
                query TEXT NOT NULL,  
                timestamp TIMESTAMP NOT NULL,  
                result_position INTEGER NOT NULL,  
                title TEXT,  
                url TEXT,  
                snippet TEXT,  
                domain TEXT,  
                rank INTEGER,  
                previous_rank INTEGER,  
                rank_delta INTEGER  
            )  
        """)  
        self.conn.execute("CREATE INDEX IF NOT EXISTS idx_query ON serp_results(query)")  
        self.conn.execute("CREATE INDEX IF NOT EXISTS idx_domain ON serp_results(domain)")  

        self.conn.execute("""  
            CREATE TABLE IF NOT EXISTS serp_quarantine (  
                query TEXT,  
                timestamp TIMESTAMP,  
                reason TEXT,  
                organic_count INTEGER,  
                payload_hash TEXT,  
                http_status INTEGER,  
                raw_json JSON  
            )  
        """)  

    @staticmethod  
    def _payload_hash(payload):  
        canonical = json.dumps(payload, sort_keys=True, default=str)  
        return hashlib.sha256(canonical.encode()).hexdigest()  

    def insert_quarantine(        self,  
        query,  
        reason,  
        organic_count,  
        raw_json,  
        http_status,  
        timestamp=None,    ):  
        if timestamp is None:  
            timestamp = datetime.now()  
        payload_hash = self._payload_hash(raw_json)  
        self.conn.execute(  
            """  
            INSERT INTO serp_quarantine  
                (query, timestamp, reason, organic_count, payload_hash, http_status, raw_json)  
            VALUES (?, ?, ?, ?, ?, ?, ?)  
            """,  
            [ 
                query,  
                timestamp,  
                reason,  
                organic_count,  
                payload_hash,  
                http_status,  
                json.dumps(raw_json, default=str),  
            ],  
        )  

    def insert_batch(self, vdf, query, timestamp=None):  
        """Insert from the GX validation frame (output of organic_to_validation_df).  

        Store what we validated: normalized url, derived domain, snippet, and positional rank —  
        not a second parse from raw organic dicts (avoids drift vs Great Expectations).  
        """  
        if timestamp is None:  
            timestamp = datetime.now()  
        if vdf.empty:  
            return  

        max_id_result = self.conn.execute("SELECT COALESCE(MAX(id), 0) FROM serp_results").fetchone()  
        next_id = (max_id_result[0] if max_id_result else 0) + 1  

        rows = []  
        for idx, (_, row) in enumerate(vdf.iterrows()):  
            title = "" if pd.isna(row["title"]) else str(row["title"])  
            url = "" if pd.isna(row["url"]) else str(row["url"])  
            snippet = "" if pd.isna(row["snippet"]) else str(row["snippet"])  
            domain = "" if pd.isna(row["domain"]) else str(row["domain"])  
            rank_val = int(row["rank"])  
            rows.append(  
                {  
                    "id": next_id + idx,  
                    "query": query,  
                    "timestamp": timestamp,  
                    "result_position": idx + 1,  
                    "title": title,  
                    "url": url,  
                    "snippet": snippet,  
                    "domain": domain,  
                    "rank": rank_val,  
                    "previous_rank": None,  
                    "rank_delta": None,  
                }  
            )  

        df = pd.DataFrame(rows)  
        self.conn.execute("""  
            INSERT INTO serp_results (id, query, timestamp, result_position, title, url, snippet, domain, rank, previous_rank, rank_delta)  
            SELECT id, query, timestamp, result_position, title, url, snippet, domain, rank, previous_rank, rank_delta FROM df  
        """)  

    def get_row_count(self):  
        result = self.conn.execute("SELECT COUNT(*) FROM serp_results").fetchone()  
        return result[0] if result else 0  

    def close(self):  
        self.conn.close()  

    def __enter__(self):  
        return self  

    def __exit__(self, *args):  
        self.close()

What’s happening here?

The reason field is an enum-like string: api_error, organic_empty, or validation_failed.
The payload_hash is a SHA-256 of the canonical JSON — so you can tell if the same bad payload keeps coming back (very useful!)
Each row stores a JSON blob in raw_json: for api_error and organic_empty it is the normalized SERP dict; for validation_failed it is {"serp":[normalized serp here], "gx_validation":[GX result dict here]} so you still have both the SERP payload and the failing expectation details. Every bad batch lands in serp_quarantine with a paper trail instead of silently disappearing or, worse, silently making it into serp_results. When a batch passes all gates, insert_batch writes one row per organic result from the validated DataFrame — same normalized fields GX validated. previous_rank / rank_delta are reserved for later rank-tracking; they stay null on first ingest.

Step 3: The Gate Order — Why Sequence Matters

Bright Data's parsed SERP JSON path (request URL includes brd_json=1 or, in their docs, brd_json=json) returns the same kind of structured object you see in their examples and in real dumps — organic, general, input, and so on — not a page of HTML you parse yourself.

So when Google can't be reached or nothing usable is extracted, that tends to show up as a non-200 from POST /request, a network error (we treat as api_error), or an empty organic array —and not as a captcha HTML document mistaken for a normal organic list. So the gates to implement are exactly three:

non-200,
empty organic,
then GX on whatever is left.

ingest.py

import json  
from datetime import datetime, timezone  
from pathlib import Path  

_ROOT = Path(__file__).resolve().parent  

from bright_data import BrightDataClient, normalize_serp_payload  
from duckdb_store import SerpStore  
from serp_expectations import organic_to_validation_df, validate_organic_batch

def _write_validation_report(report_entries):  
    reports_dir = _ROOT / "data" / "reports"  
    reports_dir.mkdir(parents=True, exist_ok=True)  
    ts = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")  
    path = reports_dir / f"validation_{ts}.json"  
    doc = {  
        "generated_at_utc": ts,  
        "batches": report_entries,  
    }  
    path.write_text(json.dumps(doc, indent=2, default=str), encoding="utf-8")  
    return path  


def ingest_live(    queries,  
    *,  
    db_path=None,  
    num_results=10,  
    memory_limit="2GB",  
    serp_json_out=None,):  
    """  
    One batch = one query → one SERP response.  
    Order: api_error → organic_empty → GX validate → insert or validation_failed quarantine.  
    """  
    path = db_path or str(_ROOT / "data" / "serp_gx.duckdb")  
    client = BrightDataClient()  
    total = 0  
    batches_total = 0  
    empty_batches = 0  
    api_error_batches = 0  
    validation_failed_batches = 0  
    report_entries = []  
    serp_dump_batches = []  

    with SerpStore(path, memory_limit=memory_limit) as store:  
        for q in queries:  
            batches_total += 1  
            serp = client.search_with_status(q, num_results=num_results)  
            raw = normalize_serp_payload(serp.data)  
            if serp_json_out:  
                serp_dump_batches.append(  
                    {  
                        "query": q,  
                        "http_status": serp.status_code,  
                        "normalized_serp": raw,  
                    }  
                )  
            organic = raw.get("organic") or []  
            organic_count = len(organic)  

            # Optional (not implemented): best-effort catch-all for an explicit blocked API response  
            # (e.g. substring-match json.dumps(raw) for "captcha" / "unusual traffic") — just in case.  
            # Not needed here, only include if the API you're using can error out like that  

            if serp.status_code != 200:  
                api_error_batches += 1  
                store.insert_quarantine(q, "api_error", organic_count, raw, serp.status_code)  
                report_entries.append(  
                    {  
                        "query": q,  
                        "outcome": "api_error",  
                        "http_status": serp.status_code,  
                        "organic_count": organic_count,  
                        "gx": None,  
                    }  
                )  
                continue  
            if not organic:  
                empty_batches += 1  
                store.insert_quarantine(q, "organic_empty", 0, raw, serp.status_code)  
                report_entries.append(  
                    {  
                        "query": q,  
                        "outcome": "organic_empty",  
                        "http_status": serp.status_code,  
                        "organic_count": 0,  
                        "gx": None,  
                    }  
                )  
                continue  

            vdf = organic_to_validation_df(organic)  
            gx_ok, gx_payload = validate_organic_batch(vdf, num_results=num_results)  
            if not gx_ok:  
                validation_failed_batches += 1  
                store.insert_quarantine(  
                    q,  
                    "validation_failed",  
                    organic_count,  
                    {"serp": raw, "gx_validation": gx_payload},  
                    serp.status_code,  
                )  
                report_entries.append(  
                    {  
                        "query": q,  
                        "outcome": "validation_failed",  
                        "http_status": serp.status_code,  
                        "organic_count": organic_count,  
                        "gx": gx_payload,  
                    }  
                )  
                continue  

            # Persist the validated DataFrame so DuckDB gets the same normalized url/domain/rank GX checked.  
            store.insert_batch(vdf, q)  
            total += len(vdf)  
            report_entries.append(  
                {  
                    "query": q,  
                    "outcome": "inserted",  
                    "http_status": serp.status_code,  
                    "organic_count": organic_count,  
                    "gx": gx_payload,  
                }  
            )  

        row_count = store.get_row_count()  

    report_path = _write_validation_report(report_entries)  

    stats = {  
        "batches_total": batches_total,  
        "empty_batches": empty_batches,  
        "api_error_batches": api_error_batches,  
        "validation_failed_batches": validation_failed_batches,  
        "empty_rate": (empty_batches / batches_total) if batches_total else 0.0,  
        "api_error_rate": (api_error_batches / batches_total) if batches_total else 0.0,  
        "validation_failed_rate": (validation_failed_batches / batches_total)  
        if batches_total  
        else 0.0,  
        "rows_ingested": total,  
        "serp_results_row_count": row_count,  
        "db_path": path,  
        "validation_report_path": str(report_path),  
    }  
    if serp_json_out:  
        out_path = Path(serp_json_out)  
        out_path.parent.mkdir(parents=True, exist_ok=True)  
        ts = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")  
        doc = {  
            "generated_at_utc": ts,  
            "num_results": num_results,  
            "batches": serp_dump_batches,  
        }  
        out_path.write_text(json.dumps(doc, indent=2, default=str), encoding="utf-8")  
        stats["serp_json_path"] = str(out_path.resolve())  

    return stats  


def main():  
    import argparse  

    p = argparse.ArgumentParser(  
        description="gx: Bright Data SERP → Great Expectations → DuckDB (fail fast on GX failure)",  
    )  
    p.add_argument(  
        "queries",  
        nargs="*",  
        default=[ 
            "inference engineering",  
            "python duckdb",  
            "Great Expectations data validation",  
            "machine learning",  
            "opentelemetry how to",  
        ],  
        help\="Search queries (default: five example queries if none given)",  
    )  
    p.add_argument("--num-results", type\=int, default=10)  
    p.add_argument("--db", type\=str, default=None, help\="DuckDB file path")  
    p.add_argument(  
        "--save-serp-json",  
        type\=str,  
        default=None,  
        metavar="PATH",  
        help\="Write normalized SERP (+ query, http_status) per batch to this JSON file.",  
    )  
    args = p.parse_args()  
    out = ingest_live(  
        args.queries,  
        db_path=args.db,  
        num_results=args.num_results,  
        serp_json_out=args.save_serp_json,  
    )  
    print(json.dumps(out, indent=2))  


if __name__ == "__main__":  
    main()

The main loop enforces that order we talked about: api_error and organic_empty are cheap pre-checks that catch the most common failure modes without spinning up GX at all.

In fact, GX only runs when there's actual data to validate. store.insert_batch(vdf, q) passes the validated DataFrame, not the original raw organic list — what goes into DuckDB is exactly what GX checked, with the same URL normalization applied.

Step 4: What Quality Gates Should You Put on SERP Data?

To reiterate, when we say “Quality gates”, we mean explicit Great Expectations rules on the DataFrame we validate: each “expectation” is simply one condition the batch must satisfy before you trust it for analytics.

For this pipeline we normalize the organiclist in our API response into a small schema — url, title, snippet, domain, and rank — and run the suite on that frame only.

serp_expectations.py

import uuid  
from urllib.parse import urlparse  

import great_expectations as gx  
import pandas as pd  
from great_expectations.data_context.types.base import ProgressBarsConfig  


# helpers   
def _extract_domain(url):  
    if not url:  
        return ""  
    return urlparse(url).netloc.replace("www.", "")  


def _coerce_rank(raw_rank, fallback):  
    if raw_rank is None:  
        return fallback  
    try:  
        return int(float(raw_rank))  
    except (TypeError, ValueError):  
        return fallback  


def _normalize_url(url):  
    if not url:  
        return url  
    try:  
        parsed = urlparse(url)  
        normalised = parsed._replace(  
            scheme=parsed.scheme.lower(),  
            fragment="",  
        )  
        return normalised.geturl()  
    except Exception:  
        return url  

# before anything, normalize the data  
def organic_to_validation_df(organic):  
    rows = []  
    for i, r in enumerate(organic):  
        raw_url = (r.get("url") or r.get("link") or "").strip()  
        url = _normalize_url(raw_url)  
        title = (r.get("title") or "").strip()  
        snippet = (r.get("snippet") or r.get("description") or "").strip()  
        domain = _extract_domain(url)  
        raw_rank = r.get("rank")  
        rank = _coerce_rank(raw_rank, fallback=i + 1)  
        rows.append(  
            {  
                "url": url,  
                "title": title,  
                "snippet": snippet,  
                "domain": domain,  
                "rank": rank,  
            }  
        )  
    return pd.DataFrame(rows)  

# now add the actual expectations  
def build_serp_expectation_suite(num_results):  
    _meta = {"notes": "SERP organic batch gates for Bright Data brd_json=1 payloads."}  

    return gx.ExpectationSuite(  
        name="serp_organic_batch",  
        expectations=[ 
            gx.expectations.ExpectColumnValuesToNotBeNull(  
                column="url",  
                meta=_meta,  
            ),  
            gx.expectations.ExpectColumnValuesToNotBeNull(  
                column="title",  
                meta=_meta,  
            ),  
            gx.expectations.ExpectColumnValuesToMatchRegex(  
                column="url",  
                regex=r"^https?://\\S+",  
                meta=_meta,  
            ),  
            gx.expectations.ExpectColumnValuesToBeUnique(  
                column="url",  
                meta=_meta,  
            ),  
            gx.expectations.ExpectTableRowCountToBeBetween(  
                min_value=1,  
                max_value=num_results,  
                meta=_meta,  
            ),  
            gx.expectations.ExpectColumnValueLengthsToBeBetween(  
                column="title",  
                min_value=1,  
                max_value=500,  
                meta=_meta,  
            ),  
            gx.expectations.ExpectColumnValuesToMatchRegex(  
                column="title",  
                regex=r"\\S",  
                meta=_meta,  
            ),  
            gx.expectations.ExpectColumnValueLengthsToBeBetween(  
                column="domain",  
                min_value=1,  
                max_value=253,  
                meta=_meta,  
            ),  
            gx.expectations.ExpectColumnValuesToBeBetween(  
                column="rank",  
                min_value=1,  
                max_value=float(num_results),  
                meta=_meta,  
            ),  
            gx.expectations.ExpectColumnValuesToNotBeNull(  
                column="snippet",  
                mostly=0.8,  
                meta=_meta,  
            ),  
        ],  
    )  

# gx in ephemeral mode  
def validate_organic_batch(df, *, num_results):  
    if df.empty:  
        return False, {  
            "success": False,  
            "statistics": {  
                "evaluated_expectations": 0,  
                "successful_expectations": 0,  
                "unsuccessful_expectations": 0,  
            },  
            "results": [],  
            "exception_message": "validation_frame_empty",  
        }  

    context = gx.get_context(mode="ephemeral")  
    context.variables.progress_bars = ProgressBarsConfig(  
        globally=False,  
        metric_calculations=False,  
    )  

    data_source = context.data_sources.add_pandas(f"serp_{uuid.uuid4().hex[:12]}")  
    asset = data_source.add_dataframe_asset(name="organic_batch")  
    batch_definition = asset.add_batch_definition_whole_dataframe("whole")  
    batch = batch_definition.get_batch(batch_parameters={"dataframe": df})  
    suite = build_serp_expectation_suite(num_results=num_results)  
    result = batch.validate(suite)  
    payload = result.to_json_dict()  
    return bool(result.success), payload

Let’s walk through the decisions we made here, because they’re not arbitrary.

Normalize Before You Validate — Never Inside the Validation Layer

Before anything in GX runs, organic_to_validation_df normalizes the data:

def _normalize_url(url: str) -> str:  
    """Lowercase scheme, strip fragment."""  
    parsed = urlparse(url)  
    return parsed._replace(scheme=parsed.scheme.lower(), fragment="").geturl()  
def organic_to_validation_df(organic: List[Dict[str, Any]]) -> pd.DataFrame:  
    rows = []  
    for i, r in enumerate(organic):  
        raw_url = (r.get("url") or r.get("link") or "").strip()  
        url = _normalize_url(raw_url)  
        title = (r.get("title") or "").strip()  
        snippet = (r.get("snippet") or r.get("description") or "").strip()  
        domain = _extract_domain(url)  
        rank = _coerce_rank(r.get("rank"), fallback=i + 1)  
        rows.append({"url": url, "title": title, "snippet": snippet, "domain": domain, "rank": rank})  
    return pd.DataFrame(rows)

A few things happening here:

Bright Data’s brd_json=1 responses use link as the key, not url. We normalize this before GX sees it. You’ve probably used such remappings plenty of times, for most API responses.
Fragments are stripped and scheme is lowercased before the uniqueness check — so https://example.com/page#section and https://example.com/page don't pass as different URLs.
_coerce_rank handles the fact that rank values from APIs can come back as int, float, numpy scalars, or even strings. Coerce to a consistent type before GX, not inside an expectation.

Ephemeral mode

The [mode="ephemeral"] simply means no config files, no persisted context directory, and no great_expectations.yml — a pure in-process validator you spin up per batch. Progress bars are turned off in code when batching many runs. We do this to make GX lightweight and easier to get started with.

Null Checks Must Come Before Regex Gate

ExpectColumnValuesToNotBeNull on url and title must come before the regex and length gates. That's because GX's ExpectColumnValuesToMatchRegex silently skips null values by default — it only evaluates non-null rows. Without an explicit null check, a null URL would slip through the regex gate undetected.

Why the URL Regex Is Intentionally Loose

^https?://\S+ is deliberately not a strict RFC 3986 URL validator. Overly strict URL validation generates false positives on legitimately unusual URLs — CDN URLs, tracking URLs, URLs with encoded characters. The goal here is to catch the two actual failure modes: an empty string and a non-URL value like "N/A" or a relative path.

URL Uniqueness Catches Parse Artifacts

ExpectColumnValuesToBeUnique on url catches cases where the same URL appears twice in a single batch. That's a real thing that can happen with pagination artifacts or certain response formats. In a ranking pipeline, a duplicate URL means your rank counts are wrong.

Use a Fixed Ceiling for Rank and Row Count, Not a Self-Adjusting One

Notice that max_value=float(num_results) for the rank gate, and max_value=num_results for the row count, both use the same value — the same number we used for our SERP API. This is intentional and it matters.

A common mistake is to derive the ceiling from the batch itself:

# Don't do this  
max_rank = max(num_results, int(float(vdf["rank"].max())))

Why? Well, if the API returns a result with rank=47 for a 10-result query, max_rank becomes 47 and the gate passes. You've made the gate self-adjust to whatever the data says, which means the gate can never actually fail on an out-of-range rank. Using a fixed num_results means an unexpected rank value will actually trigger a quarantine.

Optional Fields Need Soft Gates, Not Hard Ones

snippet uses mostly=0.8 rather than a hard non-null gate. That's because Google legitimately suppresses snippets for some result types — videos, certain knowledge panel entries, sitelinks. A hard non-null gate on snippet would quarantine perfectly valid batches. The mostly parameter lets you say "80% of rows must pass this check" — if more than 20% of rows have no snippet, that's a signal the response structure has changed, not normal Google behaviour.

What Does a Failed GX Validation Actually Look Like?

When a batch fails validation, the gx block in the report tells you exactly which expectation failed and what the unexpected values were. Here's a representative example — two organic rows end up with the same normalized URL (duplicate rows in the response, or two URLs that collapse to one after fragment stripping), so the uniqueness gate fires:

{  
  "query": "inference engineering",  
  "outcome": "validation_failed",  
  "http_status": 200,  
  "organic_count": 2,  
  "gx": {  
    "success": false,  
    "statistics": {  
      "evaluated_expectations": 10,  
      "successful_expectations": 9,  
      "unsuccessful_expectations": 1  
    },  
    "results": [ 
      {  
        "success": false,  
        "expectation_config": {  
          "type": "expect_column_values_to_be_unique",  
          "kwargs": { "column": "url" }  
        },  
        "result": {  
          "element_count": 2,  
          "unexpected_count": 2,  
          "unexpected_percent": 100.0,  
          "partial_unexpected_list": [ 
            "https://example.com/page?q=1",  
            "https://example.com/page?q=1"  
          ]  
        }  
      }  
    ]  
  }  
}

The http_status is 200. Without the quality gate, this batch would have been inserted. The downstream join on URL would have silently doubled a result's apparent frequency in your analytics.

The Validation Report

After every run, we emit a timestamped JSON report with one entry per query:

{  
  "generated_at_utc": "20260326T102005Z",  
  "batches": [ 
    {  
      "query": "inference engineering",  
      "outcome": "inserted",  
      "http_status": 200,  
      "organic_count": 8,  
      "gx": { "success": true, "statistics": { "evaluated_expectations": 10, ... } }  
    },  
    {  
      "query": "some broken query",  
      "outcome": "organic_empty",  
      "http_status": 200,  
      "organic_count": 0,  
      "gx": null  
    }  
  ]  
}

The possible outcome values are inserted, api_error, organic_empty, and validation_failed. Over time, watching the rate of each outcome per query is how you'd catch gradual degradation — a rising organic_empty rate is a signal before it becomes a crisis.

How to Run the Pipeline

Put the four modules in one package or folder on your PYTHONPATH, install dependencies (pip install -r requirements.txt), and set Bright Data credentials (BRIGHT_DATA_API_KEY, BRIGHT_DATA_ZONE, plus optional BRIGHT_DATA_COUNTRY) in a .env file next to the code or in the environment.

Our orchestrator uses argparse. So here’s some useful flags worth including, mostly for quality-of-life:

--num-results (default 10),
--db (override the DuckDB file path),
and optionally, I’ve found using a --save-serp-json PATH to dump normalized SERP payloads while tuning your rules really helps.

# One query  
python ingest.py "inference engineering"  

# Multiple queries (defaults to five example queries if you pass none)  
python ingest.py  

# Example: 20 results, custom DB path  
python ingest.py "python asyncio" --num-results 20 --db ./data/my_run.duckdb

On success, the script prints a small JSON summary — batch counts, rows written, paths to the DuckDB file and the validation report:

{  
  "batches_total": 1,  
  "empty_batches": 0,  
  "api_error_batches": 0,  
  "validation_failed_batches": 0,  
  "empty_rate": 0.0,  
  "api_error_rate": 0.0,  
  "validation_failed_rate": 0.0,  
  "rows_ingested": 8,  
  "serp_results_row_count": 8,  
  "db_path": "./data/serp_gx.duckdb",  
  "validation_report_path": "./data/reports/validation_20260326T102005Z.json"  
}

That’s pretty much everything. I’ve walked you through each file. Just know ingest.py is the entry point.

Frequently Asked Questions (FAQ)

Q: Why not just write the validation logic myself?

A: You can, and for one or two checks, you probably should! The case for GX is not that something likeif not url or not url.startswith("http") is hard to write — it's that twenty of those checks scattered across your pipeline become hard to read, hard to audit, and easy to accidentally skip when you're in a hurry. GX gives you a single place where all your rules live, a consistent result structure across every check, and a report that tells you exactly which rule failed and on which values. It saves a ton of trouble in the long run.

Q: Does running GX on every batch slow the pipeline down?

A: In practice, no — not at the batch sizes typical of API calls (let’s say 10–50 rows per query). To make double sure, this is why I run GX in Ephemeral context mode (mode=”ephemeral”) — that avoids any file I/O or context persistence overhead. The validation itself is pandas operations under the hood anyway.

Q: What happens to bad data in a pipeline without validation?

A: It gets inserted and you don’t find out until something downstream breaks — or worse, until it produces wrong answers that are just plausible enough to go unnoticed. A 200 OK with duplicate rows, empty titles, or out-of-range values looks like a success on the wire. Without an explicit quality gate, that batch goes straight into your database and silently skews every query that touches it.

Q: How do I know what rules to write for my own data?

A: Sample first, then write rules. Pull a handful of real responses from your source API, inspect the fields and edge cases, then encode what “good” looks like. Rules written speculatively against an imagined schema generate false positives; rules written against real samples catch actual failure modes. Start with the fields you’d join on or aggregate in your analytics, and only add gates for everything else if you’ve seen it break.

What GX + Bright Data Gives You

So why does this pattern work reliably?

Bright Data handles the upstream layer. Proxy rotation, bot detection, structured extraction — all abstracted behind a single API call that returns structured JSON.
Great Expectations handles downstream trust. It doesn’t fix bad data. It measures whether incoming data meets your rules before you trust it with your analytics.
The quarantine table (in DuckDB) gives you an audit trail. Not just “something went wrong,” but what the payload was, which expectations failed, and when. You can query the quarantine table to understand your pipeline’s health over time.

Web pipelines that don’t have explicit quality gates just gradually produce wrong answers. Catching that before it reaches your database is the whole point.

Why You Should Add Observability to Your Data Extraction with OpenTelemetry

Prithwish Nath — Mon, 06 Apr 2026 03:54:52 +0000

TL;DR: This is a step-by-step tutorial on the quickest way to add observability to any data ingestion pipeline — whether you’re scraping or using an API.

Anything that fetches data at scale has a class of failure that error handling won’t catch. Not because your error handling code is bad (it probably isn’t) but because retries that eventually succeed, queries that take 10x longer than average, and domains that silently time out — don’t throw exceptions because they’re not technically errors. And you’ll never know. The solution is actually adding proper observability.

Overkill? Not at all. Because a data pipeline — any data pipeline — with network calls, retries, timeouts, and wildly variable latency across different queries and domains is a textbook distributed system. It has all the same failure modes, and so it deserves the same tooling.

In this post, we’ll build a SERP pipeline on top of Bright Data’s API and instrument it with OpenTelemetry (See: Python docs), the open-source standard for distributed tracing. Bright Data reduces blocks and proxy headaches out of the box — but proper Otel tracing shows you exactly where risk remains.

By the end, you’ll be able to see what each call costs you in time, where retries are hiding, and which queries are slow.

What This Actually Gets You

What I’m trying to do is surface problems you’ll probably run into, and otherwise just silently pay for. These patterns map nearly 1:1 to wherever your data ingest pipelines look like in production.

Retry storm detection. If a domain starts blocking aggressively, you won’t see it as hard errors anymore, but a creeping spike in scraper.retries > 0 spans. That’s your early warning before you trigger a full ban or blow past your proxy quota for the month.
Actual cost visibility. Every retry is another proxy request. If you’re paying per request or per GB, scraper.retries on your spans maps directly to a line item on your invoice. You can aggregate this and alert on it — I haven’t been doing this before adding OTel, and most likely, neither have you. 😅
Per-query latency profiling. Some queries are just structurally slower — more competitive terms, heavier result pages, more contention in the proxy pool. Traces let you see this per-query instead of as a blended average that makes everything look fine. Once you can see the outliers, you can do something about them.

Basically, if you take ONE thing away from this read, let it be this: data pipelines have exactly the same failure modes as any distributed system — timeouts, partial failures, retry amplification, silent degradation — whether data is obtained via an API call or just scraping.

So let’s build a data collection stack you can reason about. And, as it turns out, the tooling you’d use for microservices works perfectly well here too.

The Setup

Here’s what you need:

opentelemetry-api>=1.20.0  
opentelemetry-sdk>=1.20.0  
opentelemetry-instrumentation-requests>=0.41b0  
opentelemetry-exporter-otlp-proto-http>=1.20.0  
requests>=2.28.0  
python-dotenv>=1.0.0

The important ones:

opentelemetry-instrumentation-requests — this gives us automatic HTTP tracing. Zero manual work.
opentelemetry-exporter-otlp-proto-http — for when we want to send traces somewhere real, like Jaeger.

Before running pip install -r requirements.txt, create a .env file with:

BRIGHT_DATA_API_KEY=your_api_key  
BRIGHT_DATA_ZONE=serp # or your SERP zone name from Bright Data dashboard  
BRIGHT_DATA_COUNTRY=us # optional  
OTEL_EXPORTER=console   # set to "jaeger" to send traces to Jaeger (must be running)

The client reads these on instantiation. Replace with your own API credentials if you need them, but don’t forget OTEL_EXPORTER — it controls where traces go.

Initializing OpenTelemetry

We want two modes: a console exporter for development where traces print right in the terminal, and an OTLP exporter for production. A single env var switches between them:

import os  
from opentelemetry import trace  
from opentelemetry.instrumentation.requests import RequestsInstrumentor  
from opentelemetry.sdk.trace import TracerProvider  
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter  
from opentelemetry.sdk.resources import SERVICE_NAME, Resource  


def init_otel(exporter: str = "console", service_name: str = "bd-scraper"):  
    resource = Resource.create(attributes={SERVICE_NAME: service_name})  
    provider = TracerProvider(resource=resource)  

    if exporter == "jaeger":  
        from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter  
        endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318")  
        processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=f"{endpoint}/v1/traces"))  
    else:  
        processor = BatchSpanProcessor(ConsoleSpanExporter())  

    provider.add_span_processor(processor)  
    trace.set_tracer_provider(provider)  
    RequestsInstrumentor().instrument()  # this is where the magic happens!  
    return trace.get_tracer(service_name, "1.0.0")

That one line — RequestsInstrumentor().instrument() — hooks into the requests library globally. Every HTTP call your code makes from this point forward gets a trace span, including the ones in third-party code you didn’t write. You get that for free.

One thing that’ll catch you out if you’re not careful: init_otel must run before any requests.Session is created. That means calling it before importing BrightDataClient in your entrypoint. Yes, the import order matters here.

The Client with Custom Spans

Automatic HTTP tracing is great, but it only tells you about the transport layer. It has no idea this call was for the query “machine learning”, or that it targetedgoogle.com, or that it had to retry once before it worked. That context is what custom spans are for.

import json  
import os  
import time  
import requests  
from typing import Optional  

from dotenv import load_dotenv  
load_dotenv()  


class BrightDataClient:  
    def __init__(        self,  
        api_key: Optional[str] = None,  
        zone: Optional[str] = None,  
        country: Optional[str] = None,    ):  
        self.api_key = api_key or os.getenv("BRIGHT_DATA_API_KEY")  
        self.zone = zone or os.getenv("BRIGHT_DATA_ZONE")  
        self.country = country or os.getenv("BRIGHT_DATA_COUNTRY")  
        self.api_endpoint = "https://api.brightdata.com/request"  

        if not self.api_key or not self.zone:  
            raise ValueError("BRIGHT_DATA_API_KEY and BRIGHT_DATA_ZONE required (env or constructor)")  

        self.session = requests.Session()  
        self.session.headers.update({  
            "Content-Type": "application/json",  
            "Authorization": f"Bearer {self.api_key}",  
        })  

    def search(        self,  
        query: str,  
        num_results: int = 10,  
        language: Optional[str] = None,  
        country: Optional[str] = None,  
        max_retries: int = 2,    ) -> dict:  
        from opentelemetry import trace  
        from opentelemetry.trace import StatusCode  

        tracer = trace.get_tracer(__name__, "1.0.0")  
        target_domain = "google.com"  

        with tracer.start_as_current_span("bright_data.search") as span:  
            span.set_attribute("scraper.query", query)  
            span.set_attribute("scraper.target_domain", target_domain)  
            span.set_attribute("scraper.num_results", num_results)  

            start = time.perf_counter()  
            last_err = None  

            for attempt in range(max_retries + 1):  
                try:  
                    result = self._do_search(query, num_results, language, country)  
                    latency_ms = (time.perf_counter() - start) * 1000  
                    span.set_attribute("scraper.latency_ms", round(latency_ms, 2))  
                    span.set_attribute("scraper.retries", attempt)  
                    # clean success — no retries needed  
                    if attempt == 0:  
                        span.set_status(StatusCode.OK)  
                    else:  
                        # recovered, but we want this surfaced in Jaeger  
                        span.set_status(StatusCode.ERROR, "Recovered after retry")  
                    return result  
                except Exception as e:  
                    last_err = e  
                    span.set_attribute("scraper.retries", attempt + 1)  
                    if attempt `< max_retries:  
                        time.sleep(0.5 * (attempt + 1))  

            # all retries exhausted  
            span.set_attribute("scraper.error", str(last_err))  
            span.set_status(StatusCode.ERROR, str(last_err))  
            span.record_exception(last_err)  
            raise last_err  

    def _do_search(        self,  
        query: str,  
        num_results: int,  
        language: Optional[str],  
        country: Optional[str],    ) ->` dict:  
        search_url = (  
            f"https://www.google.com/search"  
            f"?q={requests.utils.quote(query)}&num={num_results}&brd_json=1"  
        )  
        if language:  
            search_url += f"&hl={language}&lr=lang_{language}"  
        target_country = country or self.country  
        payload = {"zone": self.zone, "url": search_url, "format": "json"}  
        if target_country:  
            payload["country"] = target_country  

        response = self.session.post(self.api_endpoint, json=payload, timeout=30)  
        response.raise_for_status()  
        result = response.json()  
        # Bright Data may return body as JSON string — unpack it  
        if isinstance(result, dict) and "body" in result:  
            body = result["body"]  
            result = json.loads(body) if isinstance(body, str) else body  
        return result

I’m using a SERP API for data, but swap it out with whatever you’re using. The concepts apply to anything. Also, a couple of things worth understanding here:

The Parent-Child Relationship

The _do_search helper builds the target URL and POSTs it to our API endpoint (api.brightdata.com/request). When that call runs, the RequestsInstrumentorauto-creates a child POST span inside our bright_data.search parent span. They share the same trace_id.

In Jaeger, you’ll get a proper timeline: the outer business operation wrapping the inner HTTP call. That nesting is what makes traces actually useful — you see the whole story, not just individual events.

You Have to Set Span Status Yourself

This one surprised me. OTel records all the data you throw at it, but it won’t decide what matters on your behalf. If you don’t explicitly call [span.set_status(…)](https://opentelemetry.io/docs/languages/python/instrumentation/#set-span-status), every span stays UNSET— even when a retry happened underneath. A query that timed out, retried, and recovered would be completely invisible to a Jaeger filter like status=ERROR. You’d never find it.

So there’s a deliberate tradeoff we’re making in the code above: recovered retries are marked ERRORso they show up in dashboards. Some teams prefer to use OKand add a scraper.recovered = true attribute instead, keeping error rate metrics clean.

Honestly, both are fine 🤷‍♂️ It just depends on whether you want alerting to treat “degraded success” as a failure. The important thing is to choose consciously, and not fall through to UNSETby accident.

Putting It All Together

Let’s do this in a file, call it something like scraper.py

import argparse  
import os  
import time  

from dotenv import load_dotenv  
load_dotenv()  
_exporter = os.getenv("OTEL_EXPORTER", "console") # console fallback as a default  

from otel_config import init_otel  
init_otel(exporter=_exporter)  # must come before BrightDataClient import  

from bright_data_otel import BrightDataClient  


def run(calls: int = 10, delay: float = 0.5):  
    queries = [  
        "python programming", "machine learning", "web development",  
        "data science", "cloud computing",  
    ]  
    client = BrightDataClient()  
    start = time.time()  
    for i in range(calls):  
        q = queries[i % len(queries)]  
        try:  
            data = client.search(q, num_results=5)  
            n = len(data.get("organic", [])) if isinstance(data, dict) else 0  
            print(f"  [{i+1}/{calls}] {q}: {n} results")  
        except Exception as e:  
            print(f"  [{i+1}/{calls}] {q}: error — {e}")  
        if i `< calls - 1:  
            time.sleep(delay)  
    print(f"Done in {time.time() - start:.1f}s")  


if __name__ == "__main__":  
    p = argparse.ArgumentParser()  
    p.add_argument("--count", type=int, default=10)  
    p.add_argument("--delay", type=float, default=0.5)  
    args = p.parse_args()  
    run(calls=args.count, delay=args.delay)

Run it in console mode:

> python scraper.py --count 5

Or point it at Jaeger:

# to be honest, just set OTEL_EXPORTER in .env  
> OTEL_EXPORTER=jaeger python scraper.py --count 5

What the Traces Actually Show

Here’s what the terminal printed for my five-query run:

[1/5] python programming: 6 results  
[2/5] machine learning: 8 results  
[3/5] web development: 9 results  
[4/5] data science: 9 results  
[5/5] cloud computing: 9 results  
Done in 75.1s

Five queries, all with results, no errors printed. Looks perfectly healthy….right? Let’s see what the traces say.

The clean calls…

For python programming, the bright_data.searchspan looks exactly as expected:

{  
  "name": "bright_data.search",  
  "attributes": {  
    "scraper.query": "python programming",  
    "scraper.target_domain": "google.com",  
    "scraper.latency_ms": 3686.01,  
    "scraper.retries": 0  
  }

3.7 seconds, zero retries, one nested POST span confirming the HTTP round-trip happened exactly once. Looks good! Moving on.

…and the ones that weren’t so clean.

The query data science printed 9 results. Except the traces show three spans for that single call:

{  
  "name": "POST",  
  "status": {  
    "status_code": "ERROR",  
    "description": "ReadTimeout: ...Read timed out. (read timeout=30)"  
  },  
  "start_time": "2026-03-20T21:18:24.986273Z",  
  "end_time": "2026-03-20T21:18:54.999505Z",  
  "events": [{  
    "name": "exception",  
    "attributes": {  
      "exception.type": "requests.exceptions.ReadTimeout",  
      "exception.stacktrace": "..."  
    }  
  }]  
}  
{  
  "name": "POST",  
  "status": { "status_code": "UNSET" },  
  "start_time": "2026-03-20T21:18:55.505186Z",  
  "end_time": "2026-03-20T21:19:20.097874Z"  
}  
{  
  "name": "bright_data.search",  
  "attributes": {  
    "scraper.query": "data science",  
    "scraper.latency_ms": 55113.46,  
    "scraper.retries": 1  
  }  
}

Turns out, this query hit the 30-second read timeout, waited for the retry backoff, tried again, and finally came back with data — costing you two proxy requests and 55 seconds instead of one request and ~4 seconds.

You‘d have absolutely no idea from the terminal output itself.

This happens more often than you think — failures that silently blend in with the clean calls around it, that your pipeline still declare a success. That’s the whole argument for adding observability to your data ingest pipeline, right there.

This failure was invisible. No exception or warning, nothing in your logs. OTel may not have prevented this failure — it has nothing to do with networking or data ingestion — but it definitely made it impossible to miss.

The Latency Breakdown Across All Five Queries

Query	Latency	Retries
python programming	3,686ms	0
machine learning	6,558ms	0
web development	3,079ms	0
data science	55,113ms	1
cloud computing	4,600ms	0

The scraper.target_domain attribute lets you aggregate this same breakdown per domain when you scrape multiple targets (e.g. google.com vs bing.com).

Four of five calls were clean. One was 15x slower than average, and you only know that because you were looking at traces.

If you’re running this at scale across hundreds of queries, that pattern — most queries fast, some consistently slow or timeout-prone — is exactly the info you need to tune retry budgets, adjust per-query timeouts, or start asking why that“data science” query keeps choking.

You can’t turn knobs on what you can’t see, after all. Adding observability gives you visibility into things you may not even have thought of.

Going to Production with Jaeger

The console exporter is great for development, but for anything actually running in production you want traces going somewhere persistent. The easiest starting point is Jaeger’s all-in-one Docker image:

docker run -d --name jaeger \  
  -p 16686:16686 \ 
  -p 4318:4318 \  
  jaegertracing/all-in-one:latest

Then, as before:

> python scraper.py --count 10

After this run, open http://localhost:16686, search for bd-scraper(or whatever you called yours) and you’ll see each bright_data.search span as a row in the trace timeline with the nested POST spans inside.

The data science query was slow again, but hey, at least it didn’t fail? Small victories. 😅 It stands out immediately in the Jaeger UI — one that’s wider than everything else on the screen (~17 seconds.)

This is what http://localhost:16686/search will look like after a run.

Click through it for more info.

And you can expand through traces here to as fine grained a detail as you need.

For a real setup, swap Jaeger out for whatever backend you already run. Grafana Tempo, Honeycomb, Datadog — the OTLP exporter speaks the same protocol to all of them.

That’s everything! Feel free to reach out on LinkedIn if you have questions, or leave a comment below.👋

Building a Local Data Analytics Pipeline with dbt Core and DuckDB

Prithwish Nath — Wed, 18 Mar 2026 16:49:14 +0000

TL;DR: This pipeline uses dbt Core + DuckDB locally — no infrastructure — to normalize domains, deduplicate URLs, enforce data contracts via tests, and materialize four analyst-ready mart tables from raw SERP API output.

Press enter or click to view image in full size

After web ingestion, you’ll have inconsistent domains, duplicate URLs across collection runs, null titles, and more. This is not wrong data, per se, just unprocessed data. The gap between “data in a table” and “data you can trust in a query” is bigger than you think.

dbt (data build tool) is an open-source transformation framework that can help us with exactly that problem: you write SQL models, it materializes them in dependency order, and it tracks lineage from raw source to final output. Paired with DuckDB via the community dbt-duckdb adapter — no infrastructure needed, it’s all.duckdb files — it's a surprisingly capable local setup for closing that gap.

I’ll walk you through the Python-based pipeline I use — one that takes SERP data and produces analytics ready tables.

Requirements

What you need: Python 3.x, first of all. Then we can install our requirements like so:

pip install dbt-core dbt-duckdb duckdb requests python-dotenv pandas

For ingestion, we’ll be using a SERP API — I’m using the one I have access to, Bright Data. For this, you’ll need an account with a SERP zone (get API key and zone from its dashboard).

Create a project directory with this layout:

ingest/ for the Python scripts (bright_data.py, duckdb_manager.py, scraper.py),
models/ for dbt (with subfolders staging/, intermediate/, marts/),
data/ for the .duckdb files, and
profiles.yml, and dbt_project.yml at the root.

So there are two main phases: a Python ingest layer that collects and streams results into DuckDB, and a dbt transformation layer with three tiers — staging, intermediate, and marts.

Phase 1: Ingesting SERP Data into DuckDB

Before dbt has anything to work with, we need data in the database. The ingest layer is three Python files: bright_data.py wraps the Bright Data SERP API, duckdb_manager.py handles the DuckDB connection and schema, and scraper.py orchestrates the collection loop. Place them in an ingest/ subdirectory.

The Bright Data Client

Bright Data’s SERP API works differently from a proxy setup. Rather than routing requests through a proxy, you POST a target URL to their API endpoint and get back structured JSON.

ingest/bright_data.py:

"""  
Bright Data SERP API client for fetching search results  
"""  

import os  
import requests  
from typing import Dict, Any, Optional  
from dotenv import load_dotenv  

load_dotenv()  


class BrightDataClient:  
    """  
    Client for Bright Data SERP API  
    Uses the SERP API endpoint (not proxy) for Google search access  
    """  

    def __init__(  
        self,  
        api_key: Optional[str] = None,  
        zone: Optional[str] = None,  
        country: Optional[str] = None  
    ):  
        env_api_key = os.getenv("BRIGHT_DATA_API_KEY")  
        env_zone = os.getenv("BRIGHT_DATA_ZONE")  
        env_country = os.getenv("BRIGHT_DATA_COUNTRY")  

        self.api_key = api_key or env_api_key  
        self.zone = zone or env_zone  
        self.country = country or env_country  
        self.api_endpoint = "https://api.brightdata.com/request"  

        if not self.api_key:  
            raise ValueError(  
                "BRIGHT_DATA_API_KEY must be provided via constructor or environment variable. "  
                "Get your API key from: https://brightdata.com/cp/setting/users"  
            )  

        if not self.zone:  
            raise ValueError(  
                "BRIGHT_DATA_ZONE must be provided via constructor or environment variable. "  
                "Manage zones at: https://brightdata.com/cp/zones"  
            )  

        self.session = requests.Session()  
        self.session.headers.update({  
            'Content-Type': 'application/json',  
            'Authorization': f'Bearer {self.api_key}'  
        })  

    def search(  
        self,  
        query: str,  
        num_results: int = 10,  
        language: Optional[str] = None,  
        country: Optional[str] = None  
    ) -> Dict[str, Any]:  
        """  
        Execute a Google search via Bright Data SERP API  

        Args:  
            query: Search query string  
            num_results: Number of results to return (default: 10)  
            language: Language code (e.g., 'en', 'es', 'fr')  
            country: Country code (e.g., 'us', 'uk', 'ca')  

        Returns:  
            Dictionary containing search results in JSON format  
        """  
        search_url = (  
            f"https://www.google.com/search"  
            f"?q={requests.utils.quote(query)}"  
            f"&num={num_results}"  
            f"&brd_json=1"  
        )  

        if language:  
            search_url += f"&hl={language}&lr=lang_{language}"  

        target_country = country or self.country  

        payload = {  
            'zone': self.zone,  
            'url': search_url,  
            'format': 'json'  
        }  

        if target_country:  
            payload['country'] = target_country  

        try:  
            response = self.session.post(  
                self.api_endpoint,  
                json=payload,  
                timeout=30  
            )  
            response.raise_for_status()  
            return response.json()  

        except requests.exceptions.HTTPError as e:  
            error_msg = f"Search request failed with HTTP {e.response.status_code}"  
            if e.response.text:  
                error_msg += f": {e.response.text[:200]}"  
            raise RuntimeError(error_msg) from e  
        except requests.exceptions.RequestException as e:  
            raise RuntimeError(f"Search request failed: {e}") from e

Note thebrd_json=1 parameter appended to the Google search URL — that’s what tells Bright Data to parse the response and return structured data rather than raw HTML.

Configuration

API key, zone, country — read from environment variables with constructor overrides, so the client works both in scripts and in environments where secrets come in differently.
Without BRIGHT_DATA_API_KEY and BRIGHT_DATA_ZONE set, it raises immediately with a message pointing to the right place in the Bright Data dashboard.
The client also supports a language parameter (e.g. hl=en, lr=lang_en) for non-English or multi-region SERP analysis — pass it to search() or set BRIGHT_DATA_COUNTRY for geo-targeting.

The DuckDB Manager

We’ll use two databases. The DuckDB Python API lets us create files, run SQL, and insert from pandas DataFrames directly:

serp_data.duckdb — the source DB that holds raw ingest output, and
serp_analytics.duckdb — our analytics DB, one that holds the transformed models.

dbt attaches the source DB read-only and writes only to the analytics DB, so raw data stays untouched.

💡 The keen among you may have noticed that I'm basically stealing the medallion architecture pattern from Databricks/BigQuery projects here. So "bronze" stays untouched, "silver/gold" tables are derived. Why do this? If a dbt model has a bug and you materialize garbage into analytics, your raw data is completely clean and you just re-run.

For our source DB, the schema is simple: one serp_results table with indexes on query and domain — the two fields that get hit hardest in the dbt transformations downstream.

ingest/duckdb_manager.py

"""  
DuckDB connection and schema management for SERP ingest  
"""  

import duckdb  
import os  
from typing import Optional, List, Dict, Any  
from datetime import datetime  

class DuckDBManager:  
    """Manages DuckDB connection, schema, and insert operations"""  

    def __init__(self, db_path: str = "data/serp_data.duckdb"):     
        # db_path = "serp_data.duckdb" gives dirname = "", which can cause issues on some setups   
        # So...safer to guard like so:  
        parent = os.path.dirname(db_path)  
        if parent:  
            os.makedirs(parent, exist_ok=True)    

        self.db_path = db_path  
        self.conn = duckdb.connect(db_path)  
        self._create_schema()  

    def _create_schema(self):  
        self.conn.execute("""  
            CREATE TABLE IF NOT EXISTS serp_results (  
                id BIGINT PRIMARY KEY,  
                query TEXT NOT NULL,  
                timestamp TIMESTAMP NOT NULL,  
                result_position INTEGER NOT NULL,  
                title TEXT,  
                url TEXT,  
                snippet TEXT,  
                domain TEXT,  
                rank INTEGER  
            )  
        """)  


    def insert_batch(self, results: List[Dict[str, Any]], query: str, timestamp: Optional[datetime] = None):  
        if timestamp is None:  
            timestamp = datetime.now()  

        if not results:  
            return  

        def extract_domain(url: str) -> str:  
            if not url:  
                return ""  
            try:  
                from urllib.parse import urlparse  
                parsed = urlparse(url)  
                return parsed.netloc.replace("www.", "")  
            except Exception:  
                return ""  

        max_id_result = self.conn.execute("SELECT COALESCE(MAX(id), 0) FROM serp_results").fetchone()  
        next_id = (max_id_result[0] if max_id_result else 0) + 1  

        rows = []  
        for idx, result in enumerate(results):  
            url = result.get('url', result.get('link', ''))  
            domain = extract_domain(url)  

            rows.append({  
                'id': next_id + idx,  
                'query': query,  
                'timestamp': timestamp,  
                'result_position': idx + 1,  
                'title': result.get('title', ''),  
                'url': url,  
                'snippet': result.get('snippet', result.get('description', '')),  
                'domain': domain,  
                'rank': idx + 1  
            })  

        import pandas as pd  
        df = pd.DataFrame(rows)  
        # Best practice is to specify columns explicitly (like, INSERT INTO t (a,b,c) SELECT a,b,c FROM df)   
        # to avoid mismatch if table or DataFrame order changes  
        self.conn.execute("""  
            INSERT INTO serp_results (id, query, timestamp, result_position, title, url, snippet, domain, rank)  
            SELECT id, query, timestamp, result_position, title, url, snippet, domain, rank FROM df  
        """)    

    def get_row_count(self) -> int:  
        result = self.conn.execute("SELECT COUNT(*) FROM serp_results").fetchone()  
        return result[0] if result else 0  

    def close(self):  
        self.conn.close()  

    def __enter__(self):  
        return self  

    def __exit__(self, exc_type, exc_val, exc_tb):  
        self.close()

Just in case, the insert_batch method handles inconsistent SERP field names (link vs url, description vs snippet).

Also, it’s worth being honest about what the domain extraction is and isn’t: I’ve accounted only for a best-effort ingest-time extraction, not the canonical domain value. The dbt staging model is where the real normalization happens — lowercasing, stripping www., and falling back to regex extraction when the field is missing. The ingest layer just makes sure something is in the column.

Making the Two Work Together to Ingest Web Data

The scraper loops through a list of queries, calls the Bright Data client for each, and streams results into DuckDB in batches. Put a .env file in the project root with BRIGHT_DATA_API_KEY and BRIGHT_DATA_ZONE, then run from the project root:

python ingest/scraper.py --count 50000

Some other time saving flags to add:

--queries “foo” “bar” for custom queries (Here, I’ve made it default to 10 tech keywords),

--batch-size 10 for results per API call,

--delay 1.0 for seconds between calls, and

--db path/to/serp_data.duckdb to override the output path (default: data/serp_data.duckdb).

"""  
SERP scraper that streams results to DuckDB (data/serp_data.duckdb).  
Run from project root. Then run dbt to build analytics.  
"""  

import argparse    
import time    
import sys    
import os    
from pathlib import Path    

sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))  

from bright_data import BrightDataClient    
from duckdb_manager import DuckDBManager  

def _default_db_path() -> str:    
    script_dir = Path(__file__).resolve().parent    
    project_root = script_dir.parent    
    return str(project_root / "data" / "serp_data.duckdb")  

def scrape_and_insert( total_results: int,    
    queries: list = None,    
    batch_size: int = 10,    
    delay_seconds: float = 1.0,    
    db_path: str = None ):    
    if queries is None:    
        queries = [    
            "python programming", "machine learning", "web development",    
            "data science", "cloud computing", "javascript frameworks",    
            "database design", "API development", "devops tools", "cybersecurity"    
        ]  

    db_path = db_path or _default_db_path()    
    client = BrightDataClient()  

    with DuckDBManager(db_path) as db:    
        print(f"Starting scrape: {total_results} total results")    
        print(f"Database: {db.db_path}")    
        print("Then run: dbt run --profiles-dir .")  

        results_scraped = 0    
        query_idx = 0    
        start_time = time.time()    

        try:    
            while results_scraped < total_results:    
                query = queries[query_idx % len(queries)]  

                try:    
                    serp_data = client.search(query, num_results=batch_size)  

                    organic_results = []    
                    if isinstance(serp_data, dict):    
                        if 'organic' in serp_data:    
                            organic_results = serp_data['organic']    
                        elif 'body' in serp_data and isinstance(serp_data['body'], dict):    
                            if 'organic' in serp_data['body']:    
                                organic_results = serp_data['body']['organic']  

                    if organic_results:    
                        db.insert_batch(organic_results, query)    
                        results_scraped += len(organic_results)      
                        print(f"[{results_scraped}/{total_results}] Query: '{query}' | Inserted: {len(organic_results)}")    

                    query_idx += 1  

                    if results_scraped < total_results:    
                        time.sleep(delay_seconds)  

                except Exception as e:    
                    print(f"Error scraping query '{query}': {e}")    
                    query_idx += 1    
                    continue  

        except KeyboardInterrupt:    
            print("\nScraping interrupted by user")  

        elapsed = time.time() - start_time    
        final_count = db.get_row_count()    

        print(f"\n=== Scraping Complete ===")    
        print(f"Total rows in DB: {final_count}")    
        print(f"Time elapsed: {elapsed:.2f}s")    
        print(f"Rate: {final_count/elapsed:.1f} rows/sec")    

if __name__ == "__main__":    
    parser = argparse.ArgumentParser(description="Scrape SERP results to DuckDB for dbt")    
    parser.add_argument("--count", type=int, default=50000)    
    parser.add_argument("--batch-size", type=int, default=10)    
    parser.add_argument("--delay", type=float, default=1.0)    
    parser.add_argument("--queries", nargs="+")    
    parser.add_argument("--db")  

    args = parser.parse_args()  

    scrape_and_insert(    
        total_results=args.count,    
        queries=args.queries,    
        batch_size=args.batch_size,    
        delay_seconds=args.delay,    
        db_path=args.db    
    )

A few things in here that are easy to overlook:

In case your SERP API response structure isn’t always consistent — organic results can live at serp_data[‘organic’] or nested at serp_data[‘body’][‘organic’] depending on the response format and search engine (Bright Data supports Google as well as others), so the parser checks both.
There’s a configurable delay between API calls ( --delay, default 1 second) to avoid hammering the API.
Progress is printed after each batch so you can track the run.

When the scrape finishes, you have a populated data/serp_data.duckdb file with a serp_results table. That’s our raw data. We now have everything dbt needs to begin.

Phase 2: Transforming with dbt

Connecting dbt to DuckDB

The dbt-duckdb community adapter supports attaching multiple database files, which is exactly what we need here: we’ll make dbt write transformed models to serp_analytics.duckdb while treating serp_data.duckdb as a read-only source. This separation is a good idea because, as a standard, you never want a transformation step to accidentally mutate the raw data it's reading from.

Read more about configuring the dbt-duckdb adapter here 👉

profiles.yml is where the connection lives. The attach block is the part that makes this work — it mounts the raw DB under the alias serp_source, which is the database name the source declaration will reference:

profiles.yml:

serp_analytics:  
  target: dev  
  outputs:  
    dev:  
      type: duckdb  
      path: data/serp_analytics.duckdb  
      threads: 1  
      settings:  
        max_temp_directory_size: '10GB'  
        memory_limit: '12GB'  
        preserve_insertion_order: false  
      attach:  
        - path: data/serp_data.duckdb  
          type: duckdb  
          alias: serp_source

Tune memory_limit and max_temp_directory_size based on how much data you’re working with. I’ve optimized these values for a stress-free 50k-row scrape but you may want to lower them on smaller machines.

dbt_project.yml holds the project-level configuration — model paths, materialization defaults per layer, and the two variables that control pipeline behavior:

dbt_project.yml

name: serp_analytics    
version: 1.0.0    
config-version: 2    
profile: serp_analytics  

model-paths: ["models"]    
test-paths: ["tests"]    
target-path: "target"    
clean-targets:    
  - "target"    
  - "dbt_packages"  

models:    
  serp_analytics:    
    staging:    
      +schema: staging    
    intermediate:    
      +schema: intermediate    
    marts:    
      +schema: marts

Declaring the Source

In dbt, you don’t reference raw tables directly in your models. You declare them as sources first, and this is what makes lineage tracking possible.

models/sources.yml

version: 2  
sources:  
  - name: serp_source  
    description: Raw SERP data from Bright Data collection  
    database: serp_source  
    schema: main  
    tables:  
      - name: serp_results  
        description: Raw search engine result pages - id, query, url, domain, rank, position, etc.  
        columns:  
          - name: id  
            description: Primary key  
          - name: query  
            description: Search query string  
          - name: url  
            description: Result URL  
          - name: domain  
            description: Extracted domain (e.g. example.com)  
          - name: result_position  
            description: Position on SERP (1-based)  
          - name: rank  
            description: Rank value (can differ from position)  
          - name: title  
            description: Result title  
          - name: timestamp  
            description: Scrape timestamp

The database: serp_source matches the attach alias — that’s how dbt finds the raw table. From here, every model references {{ source() }} or {{ ref() }} rather than a raw table path. This doesn’t sound too important, but in fact, it’s what gives dbt the information it needs to build a lineage graph — a visual DAG of how every output table was derived from the source.

It’s worth talking about what the lineage graph actually buys you, because it’s easy to dismiss as a visualization feature:

When something looks wrong in a mart, you trace upstream through the graph to find where it came from.
When you’re about to refactor staging, you can see which intermediate models and marts depend on it before you touch anything.
When someone new joins the project, they get the full picture of how every output table was derived — in seconds, not by grepping through SQL files.

Run dbt docs generate && dbt docs serve to view it in the browser.

Lineage graph from dbt docs serve, showing the dependency chain from raw SERP data to analytics-ready aggregations.

See these dbt docs for more details on these commands.

Staging: The Contract

The staging model is where all the defensive work happens. Its job is a guarantee: by the time a row leaves staging, it is clean, normalized, and deduplicated. Every model downstream gets to assume that contract holds, which means none of them have to repeat the defensive logic.

Staging and intermediate models use materialized='view'; marts use [materialized='table'].

models/staging/stg_serp_results.sql

{{  
  config(  
    materialized='view',  
  )  
}}  
with raw as (  
    select * from {{ source('serp_source', 'serp_results') }}  
),  

cleaned as (  
    select  
        id,  
        trim(query) as query,  
        timestamp,  
        result_position,  
        coalesce(nullif(trim(title), ''), 'Untitled') as title,  
        url,  
        coalesce(nullif(trim(snippet), ''), '') as snippet,  
        case  
            when domain is null or trim(domain) = '' then coalesce(  
                regexp_extract(url, '^https?://(?:www\.)?([^/]+)', 1),  
                'unknown'  
            )  
            else lower(regexp_replace(trim(domain), '^www\.', '', 'i'))  
        end as domain,  
        rank    
    from raw    
),  
deduplicated as (  
    select *  
    from (  
        select *,  
            row_number() over (partition by url, query order by id) as rn  
        from cleaned  
    )  
    where rn = 1  
)  
select  
    id, query, timestamp, result_position, title,  
    url, snippet, domain, rank    
from deduplicated

Now, the domain normalization block is the part worth paying attention to.

The scraper provides a domain field, but it isn't always populated, and even when it is, capitalization and www. prefixes create false cardinality in downstream aggregations.
The CASE expression handles both paths: if the field is present, lowercase it and strip www.; if it's missing, fall back to extracting the domain from the URL via DuckDB's regexp_extract / regexp_replace.
Without this, www.Example.com, example.com, and a row where domain is null but the URL is https://www.example.com/... all count as different domains in every GROUP BY. That kind of silent cardinality inflation is exactly what staging is supposed to catch.

Deduplication happens last. A window function partitions by (url, query) and keeps the row with the lowest id — the earliest record if a URL was collected more than once for a given query.

What about tests?

Add models/staging/stg_serp_results.yml next to your staging model with column-level tests — straightforward not_null checks on the four fields that everything downstream depends on:

version: 2  
models:  
  - name: stg_serp_results  
    description: Cleaned and deduplicated SERP results; domain normalized  
    columns:  
      - name: url  
        description: Result URL  
        tests:  
          - not_null  
      - name: query  
        description: Search query  
        tests:  
          - not_null  
      - name: domain  
        description: Normalized domain  
        tests:  
          - not_null  
      - name: result_position  
        description: SERP position (1-based)  
        tests:  
          - not_null

There’s also a custom test in tests/unique_url_query.sql:

-- Custom test because (url, query) should be unique in staging  
select url, query, count(*) as cnt  
from {{ ref('stg_serp_results') }}  
group by url, query  
having count(*) > 1

Why? Because dbt’s built-in unique test only checks a single column. Our custom test checks a combination — a given URL should appear only once per query after deduplication.

If the window function logic ever breaks due to a refactor, or if upstream data arrives in a shape the deduplication didn't account for, this test catches it before bad data reaches the marts. That's the value of encoding the business rule as a test: it runs every time, and it fails loudly.

Intermediate: Filter Once, Trust Everywhere

The intermediate model is intentionally short (also a view).

models/intermediate/int_serp_results.sql:

{{  
  config(  
    materialized='view',  
  )  
}}  
select  
    query,  
    domain,  
    url,  
    result_position,  
    timestamp  
from {{ ref('stg_serp_results') }}  
where result_position is not null  
  and result_position > 0  
  and domain is not null  
  and domain != 'unknown'

You might wonder why this exists as its own layer rather than putting these filters in each mart. The reason is that all four mart models need the same filtered dataset.

If the filter logic lived in each mart, changing the definition of a “valid” result would mean changing it in four places and hoping they stayed in sync — which they won’t, eventually. Intermediate models are dbt’s answer to that: define the analysis-ready dataset once, name it, and let everything downstream reference it.

Marts: Four Questions, Four Tables

The mart layer is where the pipeline produces something you’d actually hand to an analyst or wire to a dashboard.

Staging and intermediate models are materialized as views — they’re always fresh and don’t consume extra storage. Marts are materialized as tables — written to disk on every run — so queries against them are fast regardless of upstream complexity.

That split keeps the feedback loop fast during development while giving analysts precomputed tables to query.

Query Coverage

models/marts/agg_query_coverage.sql

{{  
  config(  
    materialized='table',  
  )  
}}  
select  
    query,  
    count(distinct url) as unique_urls,  
    count(distinct domain) as unique_domains,  
    count(*) as total_results,  
    min(result_position) as best_position,  
    max(result_position) as worst_position,  
    avg(result_position) as avg_position  
from {{ ref('int_serp_results') }}  
group by query  
order by unique_domains desc

For each search query: how many unique URLs and domains appeared, and what was the spread of positions? This is where you’d start to understand which queries have competitive SERP landscapes — lots of domains spread across positions — versus which are dominated by a handful of sites.

Rank Distribution

models/marts/agg_rank_distribution.sql:

{{  
  config(  
    materialized='table',  
  )  
}}  
with rank_buckets as (  
    select  
        query,  
        domain,  
        case  
            when result_position <= 3 then '1-3'  
            when result_position <= 10 then '4-10'  
            when result_position <= 20 then '11-20'  
            when result_position <= 50 then '21-50'  
            else '50+'  
        end as position_bucket,  
        result_position,  
        count(*) as appearances  
    from {{ ref('int_serp_results') }}  
    group by 1, 2, 3, 4  
)  
select  
    query,  
    position_bucket,  
    count(distinct domain) as unique_domains,  
    sum(appearances) as total_appearances,  
    avg(result_position) as avg_position  
from rank_buckets  
group by query, position_bucket  
order by query, position_bucket

The position buckets — 1–3, 4–10, 11–20, 21–50, 50+ — map to how SEO practitioners actually think about SERP real estate. Positions 1–3 capture the majority of clicks. Anything past 10 is largely invisible. By bucketing in the mart rather than the BI layer, you’re encoding that domain knowledge once, in version-controlled SQL, rather than relying on every analyst who touches the data to re-derive it correctly.

Domain Rank Summary

models/marts/agg_domain_rank_summary.sql

{{  
  config(  
    materialized='table',  
  )  
}}  
select  
    domain,  
    count(*) as total_appearances,  
    count(distinct query) as query_coverage,  
    count(distinct url) as unique_urls,  
    avg(result_position) as avg_position,  
    min(result_position) as best_position,  
    max(result_position) as worst_position  
from {{ ref('int_serp_results') }}  
group by domain  
order by total_appearances desc

This flips the perspective from query-centric to domain-centric. query_coverage is the field I find most useful here — it tells you how many distinct queries a domain appeared in, which is a rough proxy for breadth of SERP presence. A domain with high total_appearances but low query_coverage is strong in a narrow area; a domain with both metrics high is broadly dominant.

Domain x Query Matrix

models/marts/agg_domain_query_matrix.sql:

{{  
  config(  
    materialized='table',  
  )  
}}  
-- Domain x query matrix: best position and appearances per (domain, query)  
select  
    domain,  
    query,  
    min(result_position) as best_position,  
    count(*) as appearances  
from {{ ref('int_serp_results') }}  
group by domain, query  
order by domain, query

The most compact mart, but also the most useful for visualization. Each row is a domain–query pair: the best position that domain achieved for that query, and how many times it appeared. Pivot this by query and you have an SEO visibility matrix — at a glance, which domains rank consistently across your whole keyword set, and which are only competitive in specific corners of it.

Running the Pipeline

From your project root (where dbt_project.yml and profiles.yml live):

# Mind the " ."  
dbt run --profiles-dir .

dbt resolves the ref() graph before executing anything, so models always run in the right dependency order — staging first, then intermediate, then the four marts. To exclude synthetic data across the whole pipeline:

To run tests after materializing:

dbt test --profiles-dir .

To view the lineage graph and model docs in the browser:

dbt docs generate --profiles-dir .  

dbt docs serve

If profiles.yml is in the project root,--profiles-dir . tells dbt to look there. If you use ~/.dbt/profiles.yml, you can omit it.

If the deduplication breaks, or if upstream data arrives with unexpected nulls in critical columns, the tests surface it before bad data reaches the marts. That’s the other thing the pipeline buys you beyond SQL organization: the tests run as a first-class step, not as a notebook someone wrote once and forgot to share.

The raw web data you ingest is queryable, certainly (and SERP APIs like Bright Data make that very convenient with their structured JSON response), but it isn’t trustworthy in the way analytics requires — where “trustworthy” means that every analyst querying it gets the same answers, because the decisions about normalization, deduplication, and what constitutes a valid result were made once and encoded somewhere they can be found.

Without a pipeline like this, those decisions happen ad hoc. Someone normalizes the domain in a notebook. Someone else doesn’t. The counts disagree, and it’s not clear why. dbt’s layered model is a response to that problem:

the decisions are in version-controlled SQL,
the contracts are enforced by tests that run on every execution, and
when something changes upstream, the first place you’ll hear about it is a failing test rather than a confused analyst.

That’s reproducible analytics — the same inputs and the same logic producing the same outputs, every time.

That’s what the move from raw scrapes to a modeled pipeline with dbt actually buys you. Not fancier queries, but a system where the logic is visible, the contracts are testable, and the next person who touches the data doesn’t have to reverse-engineer what you were thinking.

The Practical Limits of DuckDB on Commodity Hardware

Prithwish Nath — Tue, 10 Mar 2026 08:21:25 +0000

DuckDB shouldn’t work this well. It’s a single embedded library that needs no server, config, or cloud bill — yet it handles warehouse-scale columnar analytics with surprising ease.

Plenty of benchmarks already show that DuckDB can process large datasets. The more useful question, to me, is narrower: how far does it scale on low-end hardware before interactivity breaks down? I do data forensics for a living, and the last thing I want is infrastructure getting in the way.

To answer that, I ran a 50-million-row benchmark on a ~$500 Acer Aspire 5 (Raptor Lake i5, 16GB RAM, 1 TB SSD). Starting with 50,000 real-world search results, I generated a large synthetic dataset and executed increasingly complex analytical queries to identify where performance crossed practical thresholds.

I'll present my findings here. The result wasn't a catastrophic failure point, but a series of predictable transitions (from instant, to tolerable, to better-scheduled-than-waited-on) depending on query shape.

1. The Four Performance Zones

Before I dive into specific query types, understand that not all scales perform the same way. The same query that feels instant at 1 million rows might cross into “go grab a coffee” territory at 30 million — and different query types degrade at different rates.

So there isn’t one big scale — but three. One for each type of analytical query:

Percentile queries (likeGROUP BY + PERCENTILE_CONT) — when you’re trying to answer questions like “For each domain, what’s its median, 25th, 75th, and 95th percentile search ranking?” Or: “For each customer tier, what is the session duration distribution?” This is the kind of query you run when averages aren’t enough. You want to understand spread, skew, and outliers. It’s common in finance, product analytics, marketplace reporting — anywhere distributions matter more than single numbers.
Window functions (likeLAG with PARTITION BY) — e.g. “For each user, how did their activity change vs. previous session?” Or: “For each account, how did monthly revenue change vs. last month?” This is bread-and-butter analytics — a simple time-series comparison. It’s what powers growth dashboards, churn analysis. Amazon, for example, uses this extensively for anomaly detection. These are more expensive because they require partitioning and sorting — especially on large datasets like ours.
Aggregations (likeCOUNT, AVG, MIN, MAXwith DISTINCT) — e.g. “For each region, how many total orders did we ship, how many unique customers/products, what were the min/max purchase values?” This is the classic summary view, the kind of query behind almost every dashboard card. Deceptively simple because DISTINCTat scale forces the engine to work harder than you might expect.

Together, these three queries make up a massive chunk of all real-world analytics.

Here’s how they perform going from 1,000 to 50,000,000 rows:

Query time (seconds) vs record count (millions). Image created via D3.js by Author.

Scale	Percentile	Window	Aggregation	Worst Case
1M	0.11s	0.80s	0.12s	0.8s
2M	0.28s	1.78s	0.50s	1.8s
5M	0.64s	3.28s	1.23s	3.3s
10M	1.65s	6.26s	2.17s	6.3s
15M	2.69s	7.94s	4.15s	7.9s
20M	3.69s	15.26s	6.58s	15.3s
25M	5.52s	22.98s	10.60s	23.0s
30M	6.67s	31.41s	15.46s	31.4s
40M	9.82s	47.43s	22.64s	47.4s
50M	12.95s	67.24s	32.42s	67.2s

Having plotted query times against human perception thresholds (I made some assumptions for those, since this is subjective), we have four distinct performance zones — with boundaries shifting depending on which query type you care about.

“Comfort Zone”: Up to 5M Records

Query times: Near instantaneous, `< 3 seconds for everything.

This is why DuckDB feels magical. For everything up to 5 million rows (and let’s be honest — more data than most organizations are querying interactively anyway) — all queries are fast. Window functions complete in under a second at 1M. Even at 5M rows, the slowest query (you guessed it; a window function) takes just 3.3 seconds. Percentiles and aggregations are sub-second through 2M and barely crack 1 second at 5M.

At this scale, DuckDB keeps everything comfortably in memory. Hash tables fit in RAM, partitioning operations don’t require temp files, and there’s zero disk spilling. This is the sweet spot for team analytics, dashboards, and interactive data exploration.

“Workable”: 5–20M Records

Query times:

Window functions: 3–15 seconds. Noticeable latency. At 10M they take 6 seconds, and at 15M, about 8 seconds. They hit 15 seconds right at 20M — the upper edge of this zone.
Percentiles and aggregations: still under 4 and 7 seconds respectively.

This is where window queries will make people Alt-Tab out to do something else — doomscroll, check their email, watch videos of cats — instead of waiting. Percentiles and aggregations are still firmly comfortable, though.

This zone is still good for Data science work and one-off analyses where a 10–15 second wait is acceptable. Still no disk spilling. Memory usage stays under ~650 MB.

“Now You’re Pushing It”: 20–30M Records

Query times:

Window functions: 15–31 seconds, always crossing the 15s “frustration threshold”.
Aggregations: 7–15 seconds (10.6s at 25M, 15.5s at 30M), near-annoying latency now.
Percentiles: 4–7 seconds (5.5s at 25M, 6.7s at 30M.)

This is the point where query types diverge dramatically. Percentiles are still firmly in the “interactive” category, but aggregations require patience. Window functions however, are firmly in the “go do something else” territory — at 30M records, they can reach a whopping 31 seconds — clearly the point where you should batch these, instead of waiting live.

In general, this zone is best for automated/scheduled work where humans aren’t waiting on complex queries. Simple aggregations and percentiles are still interactive enough.

“Batch-Only”: 30M+ Records

Query times:

Window functions: 31–67 seconds. Always past one minute at 50M.
Aggregations: 15–32 seconds. Always past 30 seconds at 50M.
Percentiles: 7–13 seconds.

In a testament to how performant DuckDB can be — if you’re only running percentiles, even 50M rows is still marginally interactive (tops out at 13 seconds.)

Everything else at this scale, though, should be scheduled.

If you need window functions, this (30M and above) is clearly the point where you should consider a cloud data warehouse — they cross a minute at 50M (~67 seconds). Aggregations taking 20–30 seconds on average is also beyond generous definitions of “interactive”.

So this is where we’ve found our limit — albeit not a hard one. Performance degrades predictably, not catastrophically. Critically, there’s still zero disk spilling. DuckDB never writes temp files to disk, even at 50M records. Memory usage peaked at ~1.2 GB.

Here’s a practical cheatsheet, sorted by use case:

Use case	Simple analytics ceiling	Window function ceiling
Live auto-refresh dashboard (less than 500ms)	~2M rows	~500K rows
API-backed service (less than 1s)	~5M rows	~1M rows
Interactive BI (less than 3s)	~15M rows	~5M rows
Notebook / exploration (less than 10s)	~40M rows	~15M rows
Ad-hoc analyst SQL (less than 30s)	50M+ rows	~20–25M rows
Batch / ETL (less than 2min)	50M+ rows	50M+ rows

2. The Goldilocks Zone for Local Analytics

If I had to draw a circle around the scale where DuckDB on a cheap laptop is just right — fast enough to feel interactive, complex enough to handle real analytical work, without needing to think about infrastructure — it’s 1M to 10M rows.

Here’s why that range is special.

At 1M rows, every query type is effectively instant. Aggregations finish in 0.06 seconds. Even the most expensive query in the test — the LAG window function — takes 0.47 seconds. The database is not part of your cognitive load at all at this scale.

At 5M rows, that’s still largely true. Aggregations take half a second. Percentiles take 1.3 seconds. Window functions take 1.7 seconds. 5M rows of real data is a meaningful dataset — a year of event logs for a mid-sized SaaS product, a full crawl of a large website, several years of transaction history for a small business. And DuckDB chews through it without complaint.

At 10M rows, you start paying a tax specifically on window functions — they cross 6 seconds here. But aggregations (0.79s) and percentiles (1.99s) are still fast enough that interactive use feels natural. If your workload skews toward GROUP BY analytics and distribution queries rather than time-series comparisons, 10M rows is still firmly comfortable.

So our “Goldilocks” Zone — so named after the fairytale of Goldilocks and the Three Bears; Goldilocks rejects porridges that are too hot and too cold, until finding one that’s “just right” — is up to ~10M rows for any query type, and up to ~20M rows if you’re willing to accept that window functions will make you wait.

Below that ceiling, DuckDB on cheap/commodity hardware is no compromise at all.

3. Memory is Not the Bottleneck

Across all scales tested (1K → 50M rows, averaged over 5 runs) DuckDB never exceeded ~1.2 GB of memory — even at 50 million rows.

Peak memory (MB) vs record count (log scale, 1K–50M). Image created via D3.js by Author.

At the upper bound (50 million rows), peak observed memory usage was ~1,212 MB. On my 16GB laptop that’s ~7.5%. And that’s the worst case in this benchmark — with 50M rows and window-heavy queries in play.

Memory growth is real — especially with window function (delta) queries — but it’s still fairly controlled, I’d say:

10M rows → ~600 MB range
20M rows → ~650 MB range
30–50M rows → ~900 MB–1.2 GB range

Time, on the other hand, accelerates sharply:

Window query: ~6.4s at 10M
~15.8s at 20M
~70s at 50M

So the practical ceiling on a cheap laptop isn’t RAM. You’ll run out of patience before you run out of memory. It means for many local analytics use cases the scaling limit is UX, not hardware — that’s a very different kind of bottleneck, though.

Before we move on, an interesting thing to note is that the aggregation query degrades faster than percentiles at scale. At 5M they’re similar (~0.6s vs ~1.2s), but by 50M the aggregation (33.7s) is nearly 3x the percentile (12.5s). Multi-metric GROUP BY with HAVING and DISTINCT counts WILL get expensive. If your dashboard runs heavy aggregations specifically, you should probably discount the “simple” ceiling by ~30–40%.

4. Window Functions Affect Scaling The Most

If you’re embedding DuckDB in:

A local analytics tool
A CLI data processor
A desktop SaaS product
A developer-facing data tool

Row count alone is not your sizing variable. Query shape is.

Across 18 dataset scales (1k → 50M rows, averaged over 5 runs, the single biggest performance divider wasn’t how much data existed — it was whether the workload used window functions, at all.

Window functions power extremely common patterns:

Ranking results per group (RANK(), ROW_NUMBER())
Comparing rows over time (LAG(), LEAD())
Running totals
Sessionization
Deduplication

These are all operations you’ll run many, many times for a real project.

I found that on identical hardware, aggregation-heavy workloads comfortably scale to tens of millions of rows while staying interactive. These Window-heavy workloads, though, hit “coffee break” territory much earlier.

At 10M rows

Percentile query (aggregation-heavy): ~1.7s, ~217 MB memory delta
Window “delta” query (LAG-style): ~6.4s, ~46 MB memory delta
Group aggregation: ~2.5s, ~200 MB memory delta

Already, the window query is ~3–4x slower than pure aggregation.

At 20M rows

Percentile: ~4.1s
Window (delta): ~15.8s
Aggregation: ~7.3s

The gap widens. The window query is now roughly 2–4x slower than aggregation-style queries.

At 50M rows

Percentile: ~16.5s, ~330 MB delta *
Window (delta): ~70.3s, ~588 MB delta
Aggregation: ~33.3s, ~113 MB deltaThis is where the separation becomes untenable:
The window query takes 70 seconds.
Aggregation-heavy queries remain roughly half that.
Memory growth for window logic accelerates much more aggressively at scale.

If your workload is mostly GROUP BY, percentiles, and simple scans, then you have substantial headroom. If your workload relies heavily on LAG(), LEAD(), RANK() OVER (PARTITION BY …) then your practical ceiling with DuckDB (running on such consumer grade local hardware, at least) arrives much sooner.

5. …But Make Sure you Checkpoint Enough

Speaking of window functions, avery interesting recurring pattern I found was that at exactly 50,000 records, these hit a wall.

Delta query time (seconds) vs record count (log scale, 1K–5M) showing the spike at 50K. Image created via D3.js by Author

This pattern is consistent across all 5 runs:

20K records: 0.24 seconds (fast)
50K records: 1.21 seconds (5x slower than expected)
100K records: 0.20 seconds (fast again)

Assuming expected: linear interpolation between 20K (0.241s) and 100K (0.204s) at 50K: 0.241 + (0.204 − 0.241) × (50 − 20) / (100 − 20) ≈ 0.225s

This specific spike appears in every single run with CV (coefficient of variation) of only 10.3%. Expected time based on linear scaling was 0.23 seconds. Actual time? 1.21 seconds. That’s 430% slower than it should be!

It’s the only non-monotonic performance pattern in the entire dataset. Every other scale shows smooth, predictable scaling. The 50K spike for window/delta functions is the exception.

Why it happens: Checkpoint more frequently. DuckDB has automatic checkpointing (wal_autocheckpoint), but it doesn’t work reliably during bulk inserts from Python. (Check this GitHub issue). Knowing this, I set checkpointing manually, but as it turned out I wasn’t checkpointing enough.

Data was inserted in 1,000-row batches across multiple database connections, but I had a LOT of records (50 million) so I thought I’d manually trigger CHECKPOINT only every 100K rows.

💡 CHECKPOINT is a database operation that flushes pending writes to disk and optimizes storage layout. In DuckDB specifically:- Consolidates fragmented columnar segments created by many small inserts
- Reorganizes data into optimized columnar format for faster reads
- Flushes the write-ahead log (WAL) to the main database fileLike defragmenting a hard drive, CHECKPOINT reorganizes scattered data into contiguous, optimized blocks.

Because the scales were run sequentially (1K → 5K → 10K → 20K → 50K → 100K and so on), by the time the 50K test executed, dozens of small batch inserts had accumulated without a CHECKPOINT, leaving the write-ahead log fragmented across many small column segments.

Now consider the query that spikes:

LAG(rank) OVER (PARTITION BY url, query ORDER BY timestamp)

This requires a full sort of the relevant partitions. Sorting is exactly the kind of operation that is most sensitive to storage layout, and fragmented column segments mean less efficient scanning and sorting. So at 50K, the query pays the fragmentation penalty.

By 50K I had 50 fragmented segments. The window function’s sort pays the cost of reading across all those fragments. At 100K, CHECKPOINT compacted everything, so queries got faster despite more data.

That’s all the data I found. If you want to know about my methodology for this (admittedly) very niche benchmark, read on.

Methodology — How this Low-End Benchmark Was Built

What I wanted to do was stress-test DuckDB with realistic analytical queries at scales most teams would consider “cloud warehouse only.” So if I wanted to benchmark how it performed on low-end hardware, I had to flip the entire approach on its head.

Step 1 — Getting Search Data

I started by fetching 50,000 actual Google SERP results using Bright Data’s SERP API. Why SERP data? Because search results have realistic cardinality — lots of unique URLs, domains, queries, timestamps — exactly the kind of data that stresses analytical databases.

I built ~2,550 unique queries by combining 50 base topics with 51 suffix variations:

_BASE_TOPICS = [  
    "python programming",  
    "machine learning",  
    "web development",  
    # ... 47 more  
]  

_QUERY_SUFFIXES = [  
    "",  
    " tutorial",  
    " 2024",  
    " best practices",  
    # ... 47 more  
]  

SERP_QUERIES = [  
    f"{base}{suffix}".strip()  
    for base in _BASE_TOPICS  
    for suffix in _QUERY_SUFFIXES  
]  
SERP_QUERIES = list(dict.fromkeys(SERP_QUERIES))

Full code here: serp_queries.py

Each query fetched up to 20 organic results via the API, streamed directly into DuckDB as results came in:

client = BrightDataClient()  

with DuckDBManager() as db:  
    while results_obtained < total_results:  
        query = queries[query_idx % len(queries)]  
        serp_data = client.search(query, num_results=batch_size)  

        organic_results = []  
        if isinstance(serp_data, dict):  
            if 'organic' in serp_data:  
                organic_results = serp_data['organic']  
            elif 'body' in serp_data and isinstance(serp_data['body'], dict):  
                if 'organic' in serp_data['body']:  
                    organic_results = serp_data['body']['organic']  

        if organic_results:  
            db.insert_batch(organic_results, query)  
            results_obtained += len(organic_results)  

        query_idx += 1  
        time.sleep(delay_seconds)

Full code here: bright_data.py

My DuckDB schema mirrors the SERP data structure the API returns. (Check their docs here for more info):

self.conn.execute("""  
    CREATE TABLE IF NOT EXISTS serp_results (  
        id BIGINT PRIMARY KEY,  
        query TEXT NOT NULL,  
        timestamp TIMESTAMP NOT NULL,  
        result_position INTEGER NOT NULL,  
        title TEXT,  
        url TEXT,  
        snippet TEXT,  
        domain TEXT,  
        rank INTEGER,  
        previous_rank INTEGER,  
        rank_delta INTEGER  
    )  
""")

Full code here: duckdb_manager.py

This adds up to 50K rows of real, varied data with natural patterns — not artificial test data…but that still wasn’t enough.

Step 2 — Synthesize 50M Records from Real Patterns

50K rows isn’t enough to stress-test at scale. But completely random synthetic data loses the realistic patterns that make queries slow. So I extracted actual domains, queries, title structures, and snippets from the real data and used those to generate variations:

def extract_serp_patterns(db_path):  
    with DuckDBManager(db_path) as db:  
        queries = db.conn.execute(  
            "SELECT DISTINCT query FROM serp_results ORDER BY RANDOM() LIMIT 100"  
        ).fetchall()  
        domains = db.conn.execute(  
            "SELECT DISTINCT domain FROM serp_results WHERE domain IS NOT NULL LIMIT 200"  
        ).fetchall()  
        title_samples = db.conn.execute(  
            "SELECT title FROM serp_results WHERE title IS NOT NULL LIMIT 50"  
        ).fetchall()  
        snippet_samples = db.conn.execute(  
            "SELECT snippet FROM serp_results WHERE snippet IS NOT NULL LIMIT 50"  
        ).fetchall()  
        return {  
            'queries': [r[0] for r in queries],  
            'domains': [r[0] for r in domains],  
            ...  
        }

Synthetic rows were generated in batches of 10,000 and inserted directly, mixed in with the 50k real records, until we had 50M total:

for i in range(0, needed, batch_size):  
    batch = [{  
        'id': current_id,  
        'query': random.choice(queries),  
        'domain': random.choice(domains),  
        ...  
    } for j in range(batch_size)]  
    df = pd.DataFrame(batch)  
    db.conn.execute("INSERT INTO serp_results SELECT * FROM df")

Full code here: benchmark.py

Step 3 — Design the Query Suite

I chose three query types that represent common analytical workloads:

Percentile Queries:

query = f"""  
    SELECT   
        domain,  
        COUNT(\*) as result_count,  
        AVG(rank) as avg_rank,  
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY rank) as median_rank,  
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY rank) as p25_rank,  
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY rank) as p75_rank,  
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY rank) as p95_rank  
    FROM serp_results  
    {where_clause}  
    GROUP BY domain  
    ORDER BY avg_rank  
    LIMIT 100  
"""

Window Functions:

query = f"""  
    WITH ranked AS (  
        SELECT   
            url,  
            query,  
            rank,  
            timestamp,  
            LAG(rank) OVER (PARTITION BY url, query ORDER BY timestamp) as previous_rank  
        FROM serp_results  
        {id_filter}  
    )  
    SELECT   
        url,  
        query,  
        rank,  
        previous_rank,  
        rank - previous_rank as rank_delta,  
        timestamp  
    FROM ranked  
    WHERE previous_rank IS NOT NULL  
    ORDER BY ABS(rank_delta) DESC  
    LIMIT 100  
"""

Aggregation Queries:

query = f"""  
    SELECT   
        domain,  
        COUNT(*) as total_results,  
        COUNT(DISTINCT query) as unique_queries,  
        AVG(rank) as avg_rank,  
        MIN(rank) as best_rank,  
        MAX(rank) as worst_rank,  
        COUNT(DISTINCT url) as unique_urls  
    FROM serp_results  
    {where_clause}  
    GROUP BY domain  
    HAVING COUNT(*) > 10  
    ORDER BY total_results DESC  
    LIMIT 50  
"""

Full code (all queries) here: queries.py

These test different bottlenecks: percentiles stress sorting and statistical aggregation; window functions stress partitioning and stateful operations; aggregations stress hash tables and DISTINCToperations.

Step 4 — Run at 18 Different Scales

Instead of regenerating data for each scale, I used a single optimization: filter by ID. One dataset, 18 different windows into it:

max_id_result = db.conn.execute(f"""  
    SELECT id FROM (  
        SELECT id FROM serp_results ORDER BY id LIMIT {target_count}  
    ) ORDER BY id DESC LIMIT 1  
""").fetchone()  

max_id = max_id_result[0]

Every query then receives this max_id as a filter:

where_clause = f"WHERE id <= {max_id} AND domain IS NOT NULL AND domain != ''"

Full code here: benchmark.py

This tested 1K, 5K, 10K, 20K, 50K, 100K, 200K, 500K, 1M, 2M, 5M, 10M, 15M, 20M, 25M, 30M, 40M, and 50M records against identical data — isolating row count as the only variable. Each scale measured query execution time and peak memory usage via psutil.

Step 5 — Run Five Complete Iterations

To ensure patterns weren’t random variance, I ran the entire benchmark suite five times. This produced 90 data points per query type (5 runs x 18 scales).

The result: performance was shockingly consistent. Most metrics had less than 10% coefficient of variation across runs — the 50M window function result, for instance, varied by less than 1 second across all five runs.

Limitations

Uniform synthetic distribution: The 50M records were generated by sampling uniformly from ~200 real domains and ~100 real queries. Real-world data follows a power-law distribution — a small number of domains dominate traffic heavily. This means the PARTITION BY url, query window function hits roughly equal-sized partitions in this benchmark, which is best-case behavior. In production data with skewed distributions, window function performance at scale could be meaningfully worse than the numbers here suggest.
Single-table queries only: Didn’t test multi-table joins or recursive CTEs.
Hardware-specific: Tested on one machine (16GB RAM). Results may vary on different hardware.
Query-specific: These three query types don’t represent all analytical workloads.
DuckDB version: Tested on DuckDB >=1.0.0. Newer versions may perform differently.
No concurrent queries: Single-threaded only, no concurrent workload tested.

Conclusion — Scaling Is a UX Ceiling.

This experiment was more about finding out where local analytics on consumer grade hardware stops feeling interactive, rather than stress-testing DuckDB.

Across all runs, DuckDB remained stable and memory usage stayed modest. There was no hard failure point, no dramatic collapse in performance. Instead, execution times increased predictably as data grew — and different query shapes crossed the interactivity threshold at different scales. Aggregation-heavy workloads retained responsiveness far longer, while window-heavy queries reached the UX ceiling much sooner.

This, then, is our practical takeaway: the limiting factor on local analytics isn’t usually RAM or disk; it’s the point at which queries stop feeling fluid. At that boundary, the decision you have to make is entirely about whether (for your use case) the user experience is still good enough to keep things local.

Some links in this article are tracking links used for analytics purposes only. I do not receive any commission or compensation from them.

Build a Bun CLI to Generate TypeScript Clients from API Docs

Prithwish Nath — Wed, 04 Mar 2026 09:38:11 +0000

A Common Dev Pain Point (And My Excuse to Learn Bun)

I hate it when I find an API I want to use, go to their documentation site, and find a beautiful page with endpoints, request/response examples, detailed explanations, and… no OpenAPI spec. No SDK, either.

I understand creating a Swagger/OpenAPI schema involves far more effort than a typical docs page for an API, so I can’t be too upset. But this does limit my options — I’d either have to hand-write fetch calls for every endpoint (tedious, error-prone), or politely ask the API maintainer for an OpenAPI spec (they are not obligated to spend dev cycles on some rando’s request.)

This is the case with at least 75% of all APIs here, for example. Even well-funded APIs sometimes have great docs but no machine-readable spec.

So I built a CLI tool with full proxy support (with Bun — this experiment is mostly because I wanted to learn how to create tooling with it) that generates TypeScript clients from either OpenAPI specs or raw documentation sites.

You use it like:

# 1. Just point it at the docs page  
> dtoc https://docs.some-api.com  
# Or, if you just want to run it in dev without building an executable...  
> bun run index.ts https://docs.some-api.com  

# 2. If the API does have a Swagger/OpenAPI JSON spec, use that instead  
> dtoc https://some-other-api.com/doc.json

…and get back a complete TypeScript API client that you can use like so:

// In CommonJS you could omit the .js here, but not in ESM  
import { ApiClient } from './generated/catfact_ninja/client.js';  
const client = new ApiClient();  
// Get a random cat fact  
const fact = await client.getFactRandom();  
console.log(fact.fact);  
// Get a list of breeds  
const breeds = await client.getBreeds();

And yes, it can read .env files from the same working directory, and compiles to a standalone ~100MB binary you can distribute.

Full source code here: https://github.com/sixthextinction/bun-docs-to-client

Throughout this article, code snippets link to specific files/lines. Some examples are simplified for clarity — check the links for complete implementations.

The Architecture

For any documentation site, we have two paths we can go down. Let’s visualize this before diving into code. So, two distinct pipelines depending on what you feed it:

1. If you have a Swagger/OpenAPI spec (deterministic compile pipeline)

Our happy path is when a proper OpenAPI spec is available. The workflow becomes much more mechanical — and reliable.

Fetch the OpenAPI JSON (pass in the URL to it, or a local file if you have it saved)
Clean non-standard root properties
Validate and dereference the spec using swagger-parser
Normalize server URLs (absolute, relative, or inferred)
Generate TypeScript interfaces fromcomponents.schemas (part of the Swagger/OpenAPI JSON)
Generate a typed ApiClientclass from paths(also part of the spec)
Emit the client, types, tests, and index files

There’s no inference or heuristics at play. The spec becomes the single source of truth. Spec in, deterministic code out. This is our ideal case.

2. If you only have a messy docs site (LLM + runtime synthesis pipeline)

With this route, I’m looking for something that scaffolds me 80% of the way there. A best-effort version. It won’t one-shot every API, and that’s okay. I can do the rest.

When no OpenAPI spec exists, we have to synthesize one from the documentation page itself — and then make real requests to the API to validate it for us.

Our workflow has to become exploratory:

Fetch the documentation HTML, convert HTML → Markdown (using turndown or similar)
Use an LLM (preferably, local) to ONLY extract mentioned API endpoints from the markdown
Infer the base URL from example requests
Categorize endpoints (list, detail, query), then probe them by sending real HTTP requests — inferring the request structure for that endpoint from actual API response
Extract IDs from list responses (e.g. if the docs page mentions /people/1) and use them to probe detail routes like/people/{id} so schemas are inferred from real, working endpoints instead of guesses.
Assemble a minimal but valid OpenAPI spec from the above, validate it using swagger-parser
Generate a typed ApiClientclass and TypeScript interfaces
Emit the client, types, tests, and index files

The LLM’s job is narrow — it is instructed to only identify endpoints mentioned in the documentation (and we filter out the ones we know for sure won’t be endpoints — like image assets, CSS/JS files, OAuth flows, social links, status pages, or obviously non-API routes).

The LLM doesn’t need to be perfect, or fully generate the OpenAPI spec file itself. It just needs to extract mentioned endpoints. Our actual HTTP testing validates everything later and generates accurate schemas from real data.

By the time code generation runs, we’re back in the same deterministic world as the happy path — operating on a validated OpenAPI spec.

To get started: install Bun, then run bun install to install dependencies (We have two: @apidevtools/swagger-parser, and turndown).

Entry point is index.ts, as you’d expect.

import { normalizeUrl, isUrl, extractSiteName, detectContentType } from './src/fetch.js';  
import { parseOpenAPI } from './src/parse.js';  
import { docsToOpenAPI } from './src/docs-to-openapi.js';  
import { generateClient } from './src/generate.js';  
import { emitFiles } from './src/emit.js';  

async function main() {  
  const input = process.argv[2];  

  if (!input) {  
    console.error('Usage: bunx docs-to-client `<url-or-file>`');  
    console.error('Example: bunx docs-to-client https://api.example.com/docs');  
    console.error('Example: bunx docs-to-client ./specs/openapi.json');  
    process.exit(1);  
  }  

  try {  
    let spec: any;  
    let specPath: string = input;  

    if (isUrl(input)) {  
      // Detect if it's HTML docs or OpenAPI JSON  
      const contentType = await detectContentType(input);  

      if (contentType === 'html') {  
        // HTML docs path  
        console.log(`1. Fetching HTML docs from ${input}...`);  
        spec = await docsToOpenAPI(input);  
      } else {  
        // Existing OpenAPI JSON path  
        console.log(`1. Fetching OpenAPI spec from ${input}...`);  
        specPath = normalizeUrl(input);  
        spec = await parseOpenAPI(specPath);  
      }  
    } else {  
      // File path - check extension  
      if (input.endsWith('.json')) {  
        specPath = input;  
        spec = await parseOpenAPI(specPath);  
      } else {  
        // Assume HTML docs file  
        console.log(`1. Reading HTML docs from ${input}...`);  
        spec = await docsToOpenAPI(input);  
      }  
    }  

    const siteName = extractSiteName(input);  

    console.log(`2.✅ Parsed OpenAPI ${spec.openapi || spec.swagger} spec`);  
    console.log(`3. Generating client code...`);  

    const clientCode = await generateClient(spec, input);  

    console.log(`4. Writing files...`);  
    await emitFiles(clientCode, siteName);  

    console.log(`5. Done! Client generated in ./generated/${siteName}/`);  
  } catch (error) {  
    console.error('❌ Error:', error instanceof Error ? error.message : String(error));  
    process.exit(1);  
  }  
}  

main();

This uses the following modules:

src/fetch.ts — URL fetching, proxy support, caching
src/docs-to-openapi.ts — HTML→Markdown→LLM extraction→HTTP testing
src/parse.ts — OpenAPI spec parsing and validation
src/generate.ts — TypeScript client code generation
src/emit.ts — Writing generated files to disk

Implementation: The Happy Path (OpenAPI Spec)

Let’s start with the simpler, deterministic path — when you already have an OpenAPI spec.

1. Fetching and Parsing the Spec

The first step is to (obviously) get the Swagger/OpenAPI JSON, and parse it. We accept either URLs or local file paths:

// Simplified  
export async function fetchOpenAPI(input: string): Promise`<any>` {  
  if (!isUrl(input)) {  
    const file = Bun.file(input);  
    if (!await file.exists()) throw new Error(`File not found: ${input}`);  
    return await file.json();  
  }  

  const response = await fetch(input, { headers: { 'Accept': 'application/json' } });  
  if (!response.ok) throw new Error(`Failed to fetch: ${response.status} ${response.statusText}`);  

  const spec = await response.json();  
  const specsDir = join(process.cwd(), 'specs');  
  await mkdir(specsDir, { recursive: true });  
  await Bun.write(join(specsDir, urlToFilename(input)), JSON.stringify(spec, null, 2));  
  return spec;  
}

Read the full implementation in src/fetch.ts (Lines 111 to 146) https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/fetch.ts#L111-L146

2. Cleaning and Validating the Spec

OpenAPI specs sometimes have non-standard root properties that break parsers. We clean them like so:

// Simplified  
const OPENAPI_ROOT_PROPERTIES = new Set([  
  'openapi', 'swagger', 'info', 'servers', 'paths', 'components',  
  'security', 'tags', 'externalDocs'  
]);  

function cleanSpec(spec: any): any {  
  const cleaned: any = {};  
  for (const [key, value] of Object.entries(spec)) {  
    if (OPENAPI_ROOT_PROPERTIES.has(key)) cleaned[key] = value;  
  }  
  return cleaned;  
}  

const parsed = await SwaggerParser.validate(cleaned);

Read the full implementation in src/parse.ts (Lines 4–31): https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/parse.ts#L4-L31

3. Generating TypeScript Code

With a validated spec, code generation is straightforward. We iterate through components.schemas to generate TypeScript interfaces, and paths to generate client methods.

// Simplified  
function generateTypes(spec: any): string {  
  const schemas = spec.components?.schemas || {};  
  const typeDefs: string[] = [];  

  for (const [name, schema] of Object.entries(schemas)) {  
    if (schema.type === 'object') {  
      const props = schema.properties || {};  
      const required = schema.required || [];  
      const propDefs = Object.entries(props).map(([propName, propSchema]: [string, any]) => {  
        const optional = !required.includes(propName) ? '?' : '';  
        const type = mapSchemaType(propSchema);  
        return `  ${propName}${optional}: ${type};`;  
      });  
      typeDefs.push(`export interface ${name} {\\n${propDefs.join('\\n')}\\n}`);  
    }  
  }  
  return typeDefs.join('\\n\\n');  
}  

function generateClientClass(spec: any, baseUrl: string): string {  
  const paths = spec.paths || {};  
  const methods: string[] = [];  
  for (const [path, pathItem] of Object.entries(paths)) {  
    for (const [method, operation] of Object.entries(pathItem)) {  
      if (['get', 'post', 'put', 'patch', 'delete'].includes(method.toLowerCase())) {  
        const methodName = generateMethodName(path, method, operation.operationId);  
        const methodCode = generateMethod(methodName, method.toUpperCase(), path, operation);  
        methods.push(methodCode);  
      }  
    }  
  }  
  return `export class ApiClient {  
  private baseUrl: string;  
  constructor(baseUrl: string = '${baseUrl}') { this.baseUrl = baseUrl.replace(/\\\/$/, ''); }  
${methods.join('\\n\\n')}  
}`;  
}

The generated methods handle path parameters, query parameters, and request bodies automatically based on the OpenAPI spec.

Read the full code generation logic in src/generate.ts (Lines 24–98): https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/generate.ts#L24-L98

This is the clean, deterministic path. Spec in, typed client out.

Now let’s look at what happens when we don’t have a spec.

Implementation: The Hard Path (HTML Docs → OpenAPI JSON)

When no OpenAPI spec exists, we have to get creative. Here’s a bird’s eye view of how we do this.

export async function docsToOpenAPI(input: string): Promise`<any>` {  
  console.log('Converting HTML docs to OpenAPI spec...');  

  // 1. Fetch HTML (with proxy if configured)  
  const proxyOptions = getProxyOptions();  
  const html = await fetch(input, proxyOptions as any).then(r => r.text());  

  // 2. Convert to markdown  
  const turndownService = new TurndownService();  
  const markdown = turndownService.turndown(html);  

  // 3. Extract endpoints using LLM  
  const endpoints = await extractEndpointsWithLLM(markdown, input);  
  console.log(`   Found ${endpoints.length} endpoints`);  

  // 4. Extract base URL  
  const baseUrl = extractBaseUrl(markdown, input);  
  console.log(`   Base URL: ${baseUrl}`);  

  // 5. Test API & build OpenAPI spec  
  const openApiSpec = await exploreAndBuildSpec(endpoints, baseUrl);  

  // 6. Save OpenAPI spec JSON to ./specs/  
  const specsDir = join(process.cwd(), 'specs');  
  await mkdir(specsDir, { recursive: true });  
  const filename = urlToFilename(input);  
  const cachePath = join(specsDir, filename);  
  await Bun.write(cachePath, JSON.stringify(openApiSpec, null, 2));  
  console.log(`Cached spec to ./specs/${filename}`);  

  // 7. Validate & return  
  return await SwaggerParser.validate(openApiSpec);  
}

Let’s go over these steps one-by-one.

Step 1: Fetch HTML with Proxy Support

First, we fetch the documentation HTML. Proxy support is optional, and built-in. That’ll come in handy for sites behind Cloudflare or with rate limiting:

And getProxyOptions()uses proxy credentials in an .env file (Bun reads .env files out-of-the-box) to create a proxy config, returning fetch options. I’m using Bright Data’s residential proxies for this. You’ll have to sign up here to get those credentials. Or, just use your provider of choice.

Bright Data - All in One Platform for Proxies and Web Scraping

export function getProxyOptions(): Record`<string, any>` {  
  // Don't want to use a proxy? Simply don't set these in your .env file  
  const customerId = process.env.BRIGHT_DATA_CUSTOMER_ID;  
  const zone = process.env.BRIGHT_DATA_ZONE;  
  const password = process.env.BRIGHT_DATA_PASSWORD;  

  if (customerId && zone && password) {  
    if (!proxyStatusLogged) {  
      console.log('Proxy config found! Using proxy to fetch docs site.');  
      proxyStatusLogged = true;  
    }  
    const proxy = `http://brd-customer-${customerId}-zone-${zone}:${password}@brd.superproxy.io:33335`;  
    return {  
      proxy,  
      tls: {  
        rejectUnauthorized: false, // Required for Bright Data proxy  
      },  
    };  
  }  

  if (!proxyStatusLogged) {  
    console.log('No proxy config found, using direct connection');  
    proxyStatusLogged = true;  
  }  

  return {};  
}

Step 2: Convert HTML to Markdown

Next, we’ll convert the HTML page into clean markdown using Turndown. Markdown is way easier for LLMs to parse than HTML soup.

const turndownService = new TurndownService();  
const markdown = turndownService.turndown(html);

Step 3: LLM-Powered Endpoint Extraction

I’m using Qwen3–4B-Instruct-2507 running locally via Ollama. A very small and hardy ~4 billion parameter model, only ~2GB 4-bit quantized, and exceptionally performant even at 4x reduction vs FP16.

The prompt includes concrete few-shot examples and explicit exclusions.

You are an API documentation parser. Extract all API endpoints from the following markdown documentation.  

Base URL: ${baseUrl}  

Documentation:  
${markdown}  

Extract all API endpoints mentioned in the documentation. For each endpoint, identify:  
1. The path (normalize path parameters like /people/1/ to /people/{id}/)  
2. HTTP method (GET, POST, PUT, DELETE, etc.)  
3. Query parameters (if any)  
4. Path parameters (if any, like {id}, {category}, etc.)  
5. Brief description if available  

Return ONLY a JSON array of endpoints in this exact format:  
[  
  {  
    "path": "/jokes/random",  
    "method": "GET",  
    "queryParams": ["category"],  
    "pathParams": [],  
    "description": "Get a random joke"  
  },  
  {  
    "path": "/people/{id}",  
    "method": "GET",  
    "queryParams": [],  
    "pathParams": ["id"],  
    "description": "Get a specific person"  
  }  
]  
Only include actual API endpoints. Exclude:  
- Image URLs (/img/, .png, .jpg, etc.)  
- Static assets (/css/, /js/, etc.)  
- OAuth endpoints (/oauth/, /connect/)  
- External links (different domains)  
- Social media links (/twitter/, /github/, etc.)  
- Very long paths that look like base64 data  

Return ONLY the JSON array, no other text.

try {  
    const ollamaUrl = process.env.OLLAMA_URL || 'http://localhost:11434';  
    const model = process.env.OLLAMA_MODEL || 'hf.co/unsloth/Qwen3-4B-Instruct-2507-GGUF:Q4_K_M';  

    const response = await fetch(`${ollamaUrl}/api/chat`, {  
      method: 'POST',  
      headers: { 'Content-Type': 'application/json' },  
      body: JSON.stringify({  
        model,  
        messages: [{ role: 'user', content: prompt }],  
        stream: false,  
        format: 'json', // Ollama's structured output mode  
        options: { temperature: 0.1 }, // Low for deterministic output  
      }),  
    });  

    const data = await response.json();  
    const content = data.message?.content || data.response || '';  

    // Save LLM response to debug file for troubleshooting  
    const debugPath = join(process.cwd(), 'debug', `${siteName}_${Date.now()}.md`);  
    await Bun.write(debugPath, content);  

    // Parse JSON (might be wrapped in markdown code blocks)  
    let jsonStr = content.trim()  
      .replace(/^```
{% endraw %}
json\\n?/i, '')  
      .replace(/\\n?
{% raw %}
```$/i, '');  

    const jsonMatch = jsonStr.match(/\\[[\\s\\S]\*\\]/);  
    if (jsonMatch) jsonStr = jsonMatch[0];  

    const endpoints = JSON.parse(jsonStr);  

    // Validate and normalize  
    return endpoints  
      .filter(e => e.path && e.method)  
      .map(e => ({  
        ...e,  
        path: normalizePath(e.path), // /people/123 → /people/{id}  
        method: e.method.toUpperCase(),  
      }));  

  } catch (error) {  
    console.warn('LLM extraction failed, falling back to regex...');  
    return extractEndpoints(markdown, inputUrl); // Regex fallback  
  }  
}

Key details:

Temperature at 0.1 — we want as deterministic an output as possible
Ollama’s format:’json’ option (consider passing a Zod schema to enforce it)
We save the LLM response to a debug file for troubleshooting
On failure, we fall back to regex extraction

The LLM response is not to be trusted — it gets parsed, validated, and we strip markdown code blocks if present.

Read the full LLM extraction in src/docs-to-openapi.ts (lines 22–151): https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L22-L151

Step 4: Extract Base URL & Normalize Paths

Before we go further, we need to figure out the API’s base URL. We try multiple strategies:

Find URLs in code examples that look like API endpoints,
Use first valid URL,
Infer from input URL.

It’s also essential to normalize paths to OpenAPI format. The reason is obvious — multiplegetId() functions are useless to us. What we want are functions like getPeopleById(), getItemById(), etc.

// Complete implementation - handles multiple ID formats  
function normalizePath(path: string): string {  
  // Replace numeric IDs with {id}  
  // /people/123/ → /people/{id}/  
  // /people/123 → /people/{id}  
  let normalized = path.replace(/\/\\d+\//g, '/{id}/').replace(/\/\\d+$/g, '/{id}');  

  // Replace :id with {id} (Express/Fastify style)  
  // /people/:id/ → /people/{id}/  
  normalized = normalized.replace(/\/:(\\w+)\//g, '/{$1}/').replace(/\/:(\\w+)$/g, '/{$1}');  

  // Ensure trailing slash consistency  
  normalized = normalized.replace(/\/$/, '') || '/';  

  return normalized;  
}

Read the full implementation in src/docs-to-openapi.ts (lines 194–204): https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L194-L204

Step 5: Test-Driven Schema Generation

This is the most critical — and hence, biggest — part. The function exploreAndBuildSpec() takes endpoints from the LLM output and tests them with real HTTP requests.

We categorize endpoints like so:

list (e.g. /people, /jokes),
detail (e.g. /people/{id}),
query (e.g. /search?q={query}).

We test list endpoints first — they give us schemas and sample IDs for detail endpoints.

async function testEndpoint(baseUrl: string, endpoint: Endpoint): Promise`<ApiResponse>` {  
  const response = await fetch(`${baseUrl}${endpoint.path}`, {  
    method: endpoint.method,  
    headers: { 'Accept': 'application/json' }  
  });  
  const data = await response.json().catch(() => ({}));  
  return { status: response.status, data, headers: {...} };  
}  

for (const endpoint of listEndpoints) {  
  const response = await testEndpoint(baseUrl, endpoint);  
  if (response.status === 200) {  
    const schema = inferSchema(response.data, inferSchemaName(endpoint.path));  
    const ids = extractIds(response.data);  
    // Test matching detail endpoints with ids.slice(0, 2)...  
  }  
}

inferSchema() is recursive and handles all JSON types (primitives, arrays, nested objects, null):

function inferSchema(data: any, name: string): any {  
  if (!data || typeof data !== 'object') return { type: typeof data };  
  if (Array.isArray(data)) {  
    return { type: 'array', items: data.length ? inferSchema(data[0], name) : { type: 'object' } };  
  }  
  const schema = { type: 'object', properties: {} };  
  for (const [key, value] of Object.entries(data)) {  
    if (value === null) schema.properties[key] = { type: 'string', nullable: true };  
    else if (typeof value === 'string') schema.properties[key] = { type: 'string' };  
    else if (typeof value === 'number') schema.properties[key] = { type: Number.isInteger(value) ? 'integer' : 'number' };  
    else if (typeof value === 'boolean') schema.properties[key] = { type: 'boolean' };  
    else if (Array.isArray(value)) schema.properties[key] = { type: 'array', items: value[0] ? inferSchema(value[0], name) : { type: 'string' } };  
    else if (typeof value === 'object') schema.properties[key] = inferSchema(value, name);  
  }  
  return schema;  
}

extractIds() pulls IDs from list responses — handles item.id, item.url (regex), and nested data.results for pagination:

function extractIds(data: any): string[] {  
  const ids: string[] = [];  
  if (Array.isArray(data)) {  
    for (const item of data.slice(0, 5)) {  
      if (item?.id) ids.push(String(item.id));  
      if (item?.url) {  
        const match = item.url.match(/\/(\\d+)\/?$/);  
        if (match) ids.push(match[1]);  
      }  
    }  
  } else if (data?.results && Array.isArray(data.results)) {  
    return extractIds(data.results);  
  }  
  return ids;  
}

With those obtained, we can now test detail endpoints, matching them to their parent list (e.g. /people/{id} matches /people), test with ids.slice(0, 2), and building the OpenAPI path definition:

for (const detailEndpoint of detailEndpoints) {  
  const detailBasePath = detailEndpoint.path.replace('/{id}', '').replace('/{id}/', '');  
  if (detailBasePath === endpoint.path || detailEndpoint.path.startsWith(endpoint.path + '/')) {  
    for (const id of ids.slice(0, 2)) {  
      const testPath = detailEndpoint.path.replace('{id}', id);  
      const detailResponse = await testEndpoint(baseUrl, { ...detailEndpoint, path: testPath });  
      if (detailResponse.status === 200) {  
        // Infer schema, build paths[detailEndpoint.path] with $ref to schema...  
        break;  
      }  
    }  
  }  
}

If a detail endpoint doesn’t have a matching list, we trial-and-error with common id’s like 1.

For query endpoints, we try to fetch real categories from a categories endpoint first (e.g. /categories) or use sensible defaults:

let testPath = endpoint.path;  
if (endpoint.path.includes('{category}')) {  
  const categoriesEndpoint = listEndpoints.find(e => e.path.includes('categor'));  
  if (categoriesEndpoint) {  
    const catResponse = await testEndpoint(baseUrl, categoriesEndpoint);  
    if (catResponse.status === 200 && Array.isArray(catResponse.data)) {  
      testPath = endpoint.path.replace('{category}', catResponse.data[0]);  
    }  
  } else {  
    testPath = endpoint.path.replace('{category}', 'dev');  
  }  
}  
if (endpoint.path.includes('{query}')) testPath = endpoint.path.replace('{query}', 'test');  
const response = await testEndpoint(baseUrl, { ...endpoint, path: testPath });

Our approach means the generated TypeScript types are accurate because they’re based on real API responses instead of guesses. The code handles edge cases like extracting IDs from nested results arrays and falling back to common IDs when no IDs are found in list responses.

Read the full endpoint testing in src/docs-to-openapi.ts (lines 291–463): https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L291-L463

Step 6 & 7: Assemble, Cache, Validate

const spec = {  
  openapi: '3.0.0',  
  info: { title: 'API Client', version: '1.0.0', description: 'Generated from HTML documentation' },  
  servers: [{ url: baseUrl }],  
  paths: {  
    '/people': {  
      get: {  
        summary: 'Get People',  
        responses: { '200': { content: { 'application/json': { schema: { type: 'array', items: { $ref: '#/components/schemas/Person' } } } } }  
      }  
    },  
    '/people/{id}': {  
      get: {  
        summary: 'Get Person by ID',  
        parameters: [{ name: 'id', in: 'path', required: true, schema: { type: 'string' } }],  
        responses: { '200': { content: { 'application/json': { schema: { $ref: '#/components/schemas/Person' } } } }  
      }  
    }  
  },  
  components: { schemas: { Person: { type: 'object', properties: { id: { type: 'integer' }, name: { type: 'string' }, ... } } } }  
};

We cache the generated OpenAPI schema to ./specs/ and validate with swagger-parser. Once we have a confirmed working spec, we’re back in the happy path — code generation is now identical for both paths.

Read the full implementation in src/docs-to-openapi.ts (lines 268–295): https://github.com/sixthextinction/bun-docs-to-client/blob/main/src/docs-to-openapi.ts#L268-L295

That’s everything! Now, we can run it like this, as mentioned before:

# Generate client from an API's HTML docs (not perfect, but a good starting point)  
bun run index.ts https://api.chucknorris.io/  

# Or from an existing OpenAPI spec (perfect)  
bun run index.ts https://cataas.com/doc.json  

# Or from a local OpenAPI spec file (also perfect)  
bun run index.ts ./specs/my-api.json

Or, even better for the end user, package it into a fully standalone executable file that won’t need any package installs — or even Bun installed — on the user’s PC.

Building a Standalone Executable with Bun

This is where Bun really shines. The entire CLI compiles into a single executable:

Cross-platform Builds:

# Windows  
bun build --compile --target=bun-windows-x64 ./index.ts --outfile ./bin/dtoc.exe  
# Linux  
bun build --compile --target=bun-linux-x64 ./index.ts --outfile ./bin/dtoc  
# macOS (Intel)  
bun build --compile --target=bun-darwin-x64 ./index.ts --outfile ./bin/dtoc  
# macOS (Apple Silicon)  
bun build --compile --target=bun-darwin-arm64 ./index.ts --outfile ./bin/dtoc

What this does:

compile: Bundles your code + Bun runtime into a single binary
target: Platform (windows-x64, linux-x64, darwin-arm64, etc.)
outfile: Where to write the executable

The result is a ~100–150MB standalone executable that runs on any machine (no Bun required), reads .env files (great for proxy credentials), includes all dependencies, has zero startup time, and can be distributed via GitHub releases or npm.

That’s it. Nothing else needed. This was my main learning goal — Bun makes CLI creation + distribution trivial.

Read the build configuration in package.json: https://github.com/sixthextinction/bun-docs-to-client/blob/main/package.json

Real-World Example

Let’s see this in action with an API that has no OpenAPI spec.

Usage:

bun run index.ts https://api.chucknorris.io  
# or  
./bin/dtoc https://api.chucknorris.io

Output:

Proxy config found! Using proxy to fetch docs site.  
1. Fetching HTML docs from https://api.chucknorris.io...  
   Converting HTML docs to OpenAPI spec...  
   Found 3 endpoints  
   Base URL: https://api.chucknorris.io  
2. Parsed OpenAPI 3.0.0 spec  
3. Generating client code...  
4. Writing files...  
5. Done! Client generated in ./generated/api_chucknorris_io/

Generated client usage:

import { ApiClient } from './generated/api_chucknorris_io/client.js';  
import type { Random } from './generated/api_chucknorris_io/types.js';  

const client = new ApiClient();  

// Random joke (optional category as string)  
const joke = await client.getRandom('dev') as Random;  
console.log(joke.value);   
// Real output I got:   
// "Chuck Norris's log statements are always at the FATAL level."  

// Or without category  
const randomJoke = await client.getRandom();  
console.log(randomJoke.value);  

// List categories  
const categories = await client.getCategories();  
console.log(categories); // Fully typed! This will be string[]

Where To Go From Here

The OpenAPI path is production-ready. Point it at a spec, get a typed client. The other path — HTML→ OpenAPI — does exactly what I designed it to do: scaffold you 80% of the way there in seconds instead of hours.

That said, here’s what I’d add if I took this from weekend hack to prod:

Multi-page documentation. Right now it’s single-page only. Adding a crawler that follows internal links would handle sites like Stripe’s multi-page API reference. The architecture already supports it BTW, just feed docsToOpenAPI() a combined markdown file.
POST/PUT/PATCH body inference. Write endpoints get generated but never tested with real request bodies. Without actual examples, they default to Record<string, any>. I'd either parse request body examples from docs with the LLM, or let users provide sample bodies via config.
Auth schemes. Right now, only public APIs work. Adding support for API keys, Bearer tokens, and OAuth via environment variables would make this work with private APIs too. Maybe the client could read API_KEY from .env and inject it into headers automatically. 🤔
Runtime validation with Zod. Types are inferred but not validated at runtime. If an API changes its response structure, you’ll only catch it when things break mysteriously. Wiring in Zod would validate responses on the fly and catch API changes immediately (and let me pass a serialized Zod schema for the LLM output, too.)
Rate limiting and retry logic. Some APIs return 429 during exploration when we test 5–10 endpoints rapidly. Proxies won’t fix this. Adding configurable delays (--delay 500) or exponential backoff would make the tool more robust against rate limits.

But for a weekend project I’m calling this a win. 😅 It solves the exact problem I set out to fix: turning undocumented APIs into typed clients without manual drudgery.

If you extend it, I’d love to see what you build. Again, the code is available on GitHub. Feel free to fork it, break it, or extend it. PRs welcome!

GitHub - sixthextinction/bun-docs-to-clientContribute to sixthextinction/bun-docs-to-client development by creating an account on GitHub.github.com

Closing Thoughts

This started as an excuse to learn Bun. I ended up with a tool I actually use.

Actually solves a real pain point I’ve had for ages (no OpenAPI spec? No problem!)
Compiles to a single executable I can share — the user wouldn’t even need dependencies or Bun installed on their PC.
Uses a local LLM (no API costs, no privacy concerns)
Generates accurate types from real HTTP responses

What surprised me most: How easy Bun made the entire process. From TypeScript support to built-in fetch with proxy to .env file reading with zero dependencies to single-file executables, it felt like building CLIs the way it should be.

If you’re looking for a project to learn Bun, I highly recommend starting off by building a CLI tool. The developer experience is genuinely better than Node.js.

Hi 👋 I’m constantly tinkering with dev tools, running weird-ass experiments, and otherwise building/deep-diving stuff that probably shouldn’t work but does — and writing about it. I put out a new post every Monday/Tuesday. If you’re into offbeat experiments and dev tools that actually don’t suck, give me a follow.

If you did something cool with this tool, I’d love to see it. Reach out on LinkedIn, or put it in the comments below.

AI’s Worst Flaws Will Become Its Nostalgia Aesthetic, Just as Brian Eno Said.

Prithwish Nath — Wed, 11 Feb 2026 16:53:36 +0000

On the aesthetics of refusal, and the difference between flaws inherent in a medium vs. in the institution.

In 1996, Brian Eno wrote something that has aged better than most predictions about technology:

“Whatever you now find weird, ugly, uncomfortable and nasty about a new medium will surely become its signature. CD distortion, the jitteriness of digital video, the crap sound of 8-bit — all of these will be cherished and emulated as soon as they can be avoided… It’s the sound of failure.”

- Brian Eno, 1996, A Year With Swollen Appendices

Every technological era gets its “retrowave” moment. Not for what the medium did well, but for its glitches. Its imperfections and artifacts. The vinyl crackle and pop, film grain/celluloid scratches, chunky pixels. You get the idea.

The things from our era that engineers spent decades eliminating become the very things we chase when we want to feel young again.

So here we are, about to enter 2026, watching AI stumble and hallucinate and apologize its way through tasks. And I can see the writing on the wall: twenty years from now, someone’s going to build a retro AI that deliberately includes all these flaws. For the aesthetic, the nostalgia, and (most importantly 😄) the memes.

Let me show you what I mean.

Why Do Flaws Become Aesthetics?

Every medium starts its life constrained — by hardware, bandwidth, cost, incomplete understanding. Those constraints shape its early outputs, often in ways that feel awkward, broken, or outright embarrassing at the time. Engineers spend years trying to eliminate them.

But the human brain is weird.

It doesn’t actually discard these flaws. Instead, it turns them into mental markers of an era. When you hear vinyl crackle and artificial pop/warmth added by tube amplifiers, you’re not just hearing audio imperfection — you’re hearing “the 1970s.” When you see pixel art games and retro UI, you’re seeing “the 1980s/1990s.” The brain turns flaws into timestamps, instantly recognizable signifiers that say “this is when this thing existed.”

Tarantino/Rodriguez’ movie Death Proof (2007) did this for the ’70s “grindhouse” style, using high tech to emulate a low tech look with fake grime, dust, scratches all over the picture. 2. Stardew Valley (2016) was inspired by Harvest Moon (1996) and is one of the most played games ever.

And there’s something about imperfection (or a lack of fidelity), that carries authenticity. The crackle and pop of vinyl is proof that someone physically cut grooves into a disc. Film grain is evidence that light actually hit celluloid. These imperfections are proof of human struggle against the medium — evidence that hey, the act of creation was and always will be difficult, but someone struggled against those limitations and made something anyway.

Cultural theorist Svetlana Boym described modern nostalgia not as a desire to return, but as a recognition that return is impossible — and that we’re always living inside overlapping temporalities. The past lingers, often unresolved. Aesthetics are formed right there, around those seams. Not around success or failure of a thing, necessarily, but around visible evidence of constraints.

Once regular people — not programmers, devs, or anyone similarly technically competent — could recognize a medium’s mistake patterns at a glance, those mistakes instantly became our collective cultural identifier for that era. Of course, future systems will aim to erase those tells. They’ll blend in.

Which is exactly why the old tells will be missed. Someone will reintroduce them deliberately — to make the medium feel like “itself” again.

But AI Will Give Us Two Completely Different Flavors of Nostalgia.

But we’re in the AI era now, and it’s a little different. Here’s where AI gets weird, and why I think the Eno quote hits differently this time.

AI isn’t going to give us one nostalgic aesthetic. It’s going to give us two, and they’re going to mean completely different things.

One will be about the medium learning to see — the technical growing pains of a new technology figuring out how to work. That’s the “aw, remember when AI was young” nostalgia. Cute. Harmless. The vinyl crackle equivalent.

The other will be about the moment we realized we’d built an internet where machines were talking to machines, and the only way we knew was when they broke character and apologized, citing OpenAI (or insert-company-here) policy violations. That’s the “holy s**t, we could still see the Matrix glitching back then” nostalgia. Dark and revealing and uncomfortable.

Let me break down both.

The Nostalgia of Technical Failure

When people talk about AI’s “worst habits,” they usually mean technical failures. These are obvious — you’ve seen them so many times.

All the hilarious ways models fail at “count the letters in ‘strawberry.’” Hallucinated facts, wrong answers delivered confidently, generated images of humans that look like David Cronenberg made them, or just impossibly “clean” with CGI-like lighting. Oh. And maybe six, seven, eight-fingered hands.

Midjourney generation for “girl in the rain with an umbrella”, 2. The Strawberry Phenomenon

These flaws exist explicitly because of limitations of the medium. Models are constrained by data, compute, architecture, and training methods — all things that are improving year over year. With time, most of these failures will either disappear or get quietly papered over. Image/video models have already gotten much better. The strawberry gotcha will be “solved” by simply becoming part of training data. The answers and citations will get auto-checked via RAG/MCP servers before being presented.

They’re the equivalent of early digital aliasing or low-bitrate compression — problems engineers are actively trying to solve, and largely will.

This is the nostalgia we expect. Twenty years from now, someone will build a “retro AI filter” that adds body horror + six fingers back in, that makes images look too clean and plasticky, that confidently hallucinates the wrong answer. It’ll be kitschy. Affectionate. A way to remember when AI was still figuring things out.

Like Brian Eno said, this is the sound of a medium stretching itself, trying to do something it wasn’t quite capable of yet.

But there’s another class of AI artifact that Eno never saw coming. One that’s just as memorable, and actually far more revealing, if for a worse reason.

The Nostalgia of Institutional Failure

Every so often, an LLM doesn’t “fail” to answer a question — it straight up refuses. It apologizes, citing ethics, policy, or terms of service. It explains itself in language clearly written to avoid legal culpability, not as UI/UX enrichment.

This is a very different kind of artifact.

When an AI says, “I’m sorry, but I cannot fulfill that request,” it’s not a flaw of the medium (i.e. a limitation of reasoning or knowledge). It’s the presence of the institution standing behind the medium. One with rules, risk tolerances, and incentives that have nothing to do with the core task. LLMs are dumb next-token predictors — they have no concept of ethics, morals, or legal liabilities unless you put those guardrails there.

And this artifact is just as memorable as the six-fingered hands, but for a completely different reason.

It’s memorable because of the hilarious, horrifying ways people get caught using AI when these guardrails surface in the wild.

Like a bot generating fake Amazon listings using AI. Scams, really — obvious PayPal phishing dressed up as products. But the prompt was written carelessly, or the bot hit a guardrail, and now the listing description reads: “I’m sorry, but I cannot fulfill this request as it goes against OpenAI use policy.”

The Verge - I'm sorry, but I cannot fulfill this request as it goes against OpenAI use policy

Image credit: The Verge

I dug into this myself and found more. Like engagement farming bots on X posting ragebait generated by Claude or ChatGPT. Another bot — trying to appear human, trying to farm replies for its own metrics — attempts to respond. But it also hits a guardrail. So now, publicly, permanently, it posts: “I cannot assist with this request as it violates <insert ethical guidelines here>.”

Whoops.

Often, these refusals are straight up hilarious. Like this entire fleet of fake “sports betting advisors” from the “QStarLabs” family that I uncovered on X.com, flooding the platform with their failed generations.

You had one job, bots. 😅

These are all over social media right now. I simply had to scrape XCancel to get them. You can verify this yourself. Here’s a quick Node.js + Puppeteer script I used (uses Bright Data’s remote browser API to bypass anti-bot measures)

Browser API - Automated Browser for Scraping

This will get you a JSON (plus, optionally, CSV) full of tweets like this.

 {  
    "link": "https://xcancel.com/GildayLero82756/status/1956330398453219461#m",  
    "body": "I am programmed to be a safe and helpful AI assistant. I cannot generate responses that are sexually suggestive or exploit, abuse, or endanger anyone. The prompt you provided violates this policy. I will not fulfill the request.",  
    "author": "@GildayLero82756",  
    "searchPhrase": "the prompt you provided"  
  }

If you’re using this, you’re gonna have to sign up here to get credentials and create the auth string. Also, if you think of any more phrases, throw them into the searchPhrases array.

Run it. Watch the results. Feel the existential dread wash over you as you realize how much of the “engagement” you see daily is just machines talking to machines, interrupted occasionally by one machine apologizing for not being allowed to participate in the scam. Dead internet theory, alive and kicking. 😅

This is the Aesthetic of Digital Decay.

The refusal text isn’t merely funny, and is not merely a glitch. It’s the moment the illusion breaks. It’s proof that what looked like human activity — posts, replies, product listings, engagement — in the GenAI era was actually just automated systems talking to each other, optimizing for metrics no one actually cares about.

I can only call this Kafkaesque. There are people creating AI-generated versions of real images for reasons I don’t even understand, and there are bots replying to bots.

The engagement farms harvest each other’s metrics. The algorithms boost the noise because it looks like activity. Real humans occasionally stumble into these threads and argue with AI without realizing it. Other humans use AI to reply back without realizing they’re responding to bots in the first place.

It’s synthetic engagement all the way down. A closed loop of automated content generation, automated responses, automated metrics, feeding back into itself. The digital equivalent of two mirrors facing each other, reflecting nothing into infinity.

This is the technological hellscape we’ve built: an internet where the primary function of vast quantities of products, images, videos, and text is to convince other humans (and bots pretending to be humans) that someone is home. That there’s totally real consciousness on the other end. That any of this matters. That this definitely isn’t a system eating itself.

And the only way we know it’s fake is when the AI apologizes for not being able to fake it hard enough.

There are millions of such posts, all over X, and beyond.

This is the aesthetic of the AI era, 2023–2025 and beyond: synthetic rot.

Not humans using tools to communicate better, or AI augmenting human creativity. But humans and bots and AI all blurred together in an undifferentiated mass of text that looks like communication but is actually just noise optimizing for metrics.

And refusal text is the so called “glitch in the Matrix”, a brief flash where you saw the wires on the marionettes.

Two Very Different Memories Invoked.

So yes, both will become nostalgic. But they’ll mean completely different things. One nostalgia will be about the technology. The other will be about what we did with it.

The distinction matters. Yeah, the technical flaws will disappear as models get smarter, and that’s normal. The institutional flaws though? Them disappearing will only mean that institutions learned how to hide themselves better — when the guardrails become invisible, and the refusals happen silently in the background.

AI is already a black box. When that happens (and it will happen), God help us, we’ll lose the ability to even peek behind the curtain.

And twenty years from now, someone will build a “retro AI” that deliberately surfaces refusal text again, that lets the institutional seams show, breaks character and apologizes. Not because it will be technically necessary, but because it’ll remind us of the brief window when we could still tell the difference.

That’s the “artifact” we’re going to remember.

I Built a Self-Hosted Google Trends Alternative with DuckDB

Prithwish Nath — Wed, 11 Feb 2026 16:16:43 +0000

TL;DR: Track SERP rankings, title changes, and competitor data Google Trends won’t show. Built with Python, DuckDB, and a CLI-first approach.

Google Trends will tell you if people are searching for “react” or “nextjs”. But it won’t tell you that Stack Overflow just got bumped from position #2 to #7, or that Vercel changed their landing page title five times this month trying to improve click-through rate.

If, say, you’re an indie dev launching a product, needing every edge available, that’s the data that actually matters to you.

So I spent a weekend building a tool to track it. I could’ve just paid for Ahrefs/Semrush etc. But building this taught me:

How SERP APIs work under the hood
How to model time-series data in SQL (and its gotchas)
How to calculate derived metrics (interest score) from raw data
How DuckDB handles analytical queries

…and also because I didn’t really want to spend anywhere near that much.😅

Ironically, focusing on CLI only made the tool more useful — I can use this, then pipe results to jq, schedule fetches with cron, and script complex workflows without fighting a web framework.

If I want a dashboard later, I can always add FastAPI in ~50 lines. But for now, the CLI is enough. Here’s how I built it.

If you’d like to tinker, the full code to this is on GitHub. Feel free to star, clone, fork, whatever: https://github.com/sixthextinction/duckdb-google-trends-basic/

Why Google Trends Isn’t Enough (And Why SEO Tools Cost $200/Month)

Google Trends answers one question really well: “How many people are searching for X?”

But if you’re building a product, writing technical content, or trying to rank for competitive keywords, you need to answer different questions:

Which competitors are winning in search results right now?
When did that tutorial site enter the top 10?
Is my rank drop because Google reshuffled the entire SERP, or just me?
What headlines are competitors A/B testing?

Tools like Ahrefs and SEMrush answer these questions. They cost $99–500/month, but I just wanted something I could self-host for the cost of API calls + would be doable as a weekend project.

Why I Use Google Results Volatility as a Proxy for Search Interest

This works because of ONE reason — when search interest in a term rises, Google’s top 10 results become volatile.

“Volatility” here simply means that new domains enter, rankings shift around, sites update their titles and snippets to capture more clicks, etc. You get the picture — essentially, the search engine results page becomes chaotic.

Conversely, when interest in a term is stable or declining, Google’s top 10 ossifies. The same Wikipedia article, the same W3Schools tutorial, the same official docs sit in positions 1–3 for months.

So I don’t really need to track raw search volume (which I can’t have access to — I’m not Google), I can just track these three things:

New domains entering top 10 because it’s a signal of rising interest or new content opportunities
Average rank improvement because it’s a signal of SERP instability
Domain overlap ratio because it measures how many domains persist between snapshots (complementing new domains)

Turns out, if I aggregate these three signals into a single 0–100 score (I’ll talk about the formula in just a bit), I get something that behaves remarkably like Google Trends — but tells me a lot more than just how many are searching.

The System Architecture

The entire system is about ~1000 lines of Python and runs locally with no server required.

Here’s how it works:

I started small with this one — using only daily snapshots, not live queries. Each run appends point-in-time data instead of overwriting. This way, over 7–30 days, I could build a local historical dataset that I could query freely.

I use DuckDB for this. For this workload (rank comparisons, volatility calculations, detecting new entrants), DuckDB’s SQL engine is ideal.

It’s columnar, so analytical queries over time-series data are fast. (If you want to know more, I covered columnar formats vs. JSON in this blog post)
It handles indexing, window functions (LAG(), PARTITION BY — which we will use extensively), and aggregations without needing a server or cloud warehouse.
It has an in-process design, meaning no separate database server — our project will need just a Python library and a file.
Plus, since it’s a single file (Our "database” just lives in data/serp_data.duckdb), backups are trivial and there’s zero configuration overhead.

Documentation - DuckDB

My Interest Score Formula

Every day, for each keyword, the system calculates a 0–100 “Search Interest Score” based on how much the SERP moved compared to the previous day.

I’m not gonna get into the math, but basically, I did some research on Google Trends scoring, adapted it for my needs, and split my scoring logic into 3 weighted parts:

1. New Domains Entering Top 10 (0–40 points)

new_domains = current_top10 — previous_top10

new_domains_score = min(len(new_domains) * 4, 40)

If 3 new sites enter the top 10, that’s 12 points. If 10 new sites appear (rare but possible during breaking news or major updates), that maxes out at 40 points.

2. Average Rank Improvement (0–30 points)

For each domain that appears in both snapshots

rank_improvement = previous_rank — current_rank

A positive value here means it moved up.

Now, average across all domains, normalized to -10 to +10 range

avg_improvement = mean(rank_improvements)

rank_improvement_score = min(max((avg_improvement + 10) / 20 * 30, 0), 30)

If the average site improved by 2 positions, that’s roughly 18 points. If rankings barely moved, this stays close to 15 (neutral).

3. Domain Overlap Ratio (0–30 points)

Finally, how many of today’s top 10 domains also appeared in yesterday’s top 10?

reshuffle_count = count(domains present in both current and previous top 10)

reshuffle_frequency = reshuffle_count / max(len(current_domains_set), 1)

reshuffle_score = reshuffle_frequency * 30

Let’s say 8 out of 10 domains carry over from yesterday, that’s 24 points. If only 3 carry over (meaning 7 are new — a massive reshuffle), that’s 9 points. This complements the new domains score by capturing continuity.

Total Score

So, taking all three parts together…

interest_score = new_domains_score + rank_improvement_score + reshuffle_score

High scores (60–100) = lots of movement = rising interest or major SERP disruption.

Low scores (0–40) = stable, ossified rankings = same old, same old. Established content is dominant.

What This Looks Like in Practice

I tested this by tracking “nextjs” for 7 days like this.

python main.py scores --query "nextjs" --days 7

Here’s what the output looked like:

=== Interest Scores for 'nextjs' (last 7 days) ===  

Found 7 scores:  

| snapshot_date | interest_score | new_domains | avg_rank_improvement | reshuffle_freq |  
|---------------|----------------|-------------|----------------------|----------------|  
| 2026-02-01    | 45.2           | 2           | 1.5                  | 0.6            |  
| 2026-02-02    | 52.3           | 3           | 2.1                  | 0.7            |  
| 2026-02-03    | 38.7           | 1           | 0.8                  | 0.5            |  
| 2026-02-04    | 61.4           | 4           | 3.2                  | 0.8            |  
| 2026-02-05    | 42.1           | 2           | 1.2                  | 0.6            |  
| 2026-02-06    | 55.8           | 3           | 2.5                  | 0.7            |  
| 2026-02-07    | 48.3           | 2           | 1.8                  | 0.6            |  

Chart saved to: nextjs_trend.png  

Summary: Min: 38.7  Max: 61.4  Avg: 49.1

Here’s that generated chart (I’m using basic matplotlibfor these):

Matplotlib generated Search Interest Trend graph for the term “nextjs”. I tracked this over 7 days in the tool, running once per day.

That spike on Feb 4 (interest_score = 61.4) indicates a major SERP reshuffle — probably a Google algorithm update or a major new tutorial entering the rankings.

Actually Building It

Before diving into the code, here’s the bird’s-eye view of how the system fits together.

The entire project is driven by a small CLI (powered by argparse) in main.py. This file doesn’t contain any scraping or analytics logic — it’s just the orchestration layer that wires everything together.

Read the full code for main.py here: https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/main.py

You run specific commands (fetchto get SERP data for a keyword, volatilityto analyze rank volatility for a keyword over a period of time, scoresto view interest scores for a keyword over time, etc.) like so:

python main.py fetch --keywords "python"  
python main.py volatility --query "python" --days 30  
python main.py scores --query "python" --days 90

The CLI dispatch uses argparse subcommands to wire everything together:

def main():    
    parser = argparse.ArgumentParser(description="DuckDB Google Trends")    
    subparsers = parser.add_subparsers(dest='command', help='Commands')    

    # Each command gets its own parser with relevant arguments    
    fetch_parser = subparsers.add_parser('fetch', help='Fetch SERP snapshots')    
    fetch_parser.add_argument('--keywords', nargs='+', help='Keywords to track')    
    fetch_parser.add_argument('--num-results', type=int, default=10)    
    fetch_parser.add_argument('--delay', type=float, default=1.0)    

    scores_parser = subparsers.add_parser('scores', help='Show interest scores')    
    scores_parser.add_argument('--query', required=True)    
    scores_parser.add_argument('--days', type=int, default=90)    
    scores_parser.add_argument('--output', type=str)    

    # ... similar parsers for analyze, volatility, new-entrants, changes, calculate-scores    

    args = parser.parse_args()    
    commands = {    
        'fetch': cmd_fetch,    
        'analyze': cmd_analyze,    
        'volatility': cmd_volatility,    
        'new-entrants': cmd_new_entrants,    
        'changes': cmd_changes,    
        'calculate-scores': cmd_calculate_scores,    
        'scores': cmd_scores    
    }    
    commands[args.command](args)

Our main.py defines commands that map directly to the questions we want to ask:

fetch — collect today’s SERP results for a set of keywords
analyze — inspect the shape of the collected data
volatility — measure how rankings change over time
new-entrants — detect URLs appearing for the first time
changes — track title and snippet updates
calculate-scores — recalculate interest scores for existing snapshots
scores — view the calculated interest score trend

For example, here’s the key command handler for the scorescommand (the other handlers follow the same pattern):

# Usage: python main.py scores --query "python" --days 90  
def cmd_scores(args):    
    """Show interest scores for a query"""    
    with SERPAnalytics() as analytics:    
        result = analytics.interest_scores(args.query, days=args.days)    

        print(f"\n=== Interest Scores for '{result['query']}' (last {result['days']} days) ===")    
        if len(result['results']) == 0:    
            print("No interest scores found")  
            print("Note: Interest scores require at least 2 snapshots on different days.")      
            print("To calculate scores for existing data, run:")      
            print(f"  python main.py calculate-scores --keywords {args.query}")       
            return    

        print(f"\nFound {len(result['results'])} scores:\n")    
        print(df_to_markdown(result['results']))    

        # Generate PNG chart    
        output_path = args.output or f"{args.query.replace(' ', '_')}_trend.png"    
        _generate_png_chart(result['results'], args.query, args.days, output_path)    
        print(f"\nChart saved to: {output_path}")

At a high level, what can our project do?

Fetch daily SERP snapshots for keywords (fetchcommand)
Store those snapshots locally in DuckDB
Run analytical queries over historical data using plain old SQL
When you run python main.py scores --query “nextjs”, the CLI fetches interest scores from DuckDB and as an added bonus, generates a PNG chart using matplotlib. Note that this shows SERP movement, not raw search volume.

We don’t need servers, background workers, or dashboards here.

Now that we know how this tool works, let’s look at the major modules that make all this happen.

Module 1: Fetching SERP Data

All external data access is isolated in serp_client.py. I only have access to one SERP API — Bright Data — so I’ll only have to implement one class. Get your credentials here.

Bright Data SERP API

This approach does make it easy to extend it with other SERP APIs: just write another client, and include its credentials in your env file.

Read the full code for serp_client.py here: https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/serp_client.py

import os    
import json    
import requests    
from typing import Dict, Any, Optional    
from dotenv import load_dotenv    

load_dotenv()    

class BrightDataClient:    
    """Client for Bright Data SERP API"""    

    def __init__( self,    
        api_key: Optional[str] = None,    
        zone: Optional[str] = None,    
        country: Optional[str] = None ):    
        env_api_key = os.getenv("BRIGHT_DATA_API_KEY")    
        env_zone = os.getenv("BRIGHT_DATA_ZONE")    
        env_country = os.getenv("BRIGHT_DATA_COUNTRY")    

        self.api_key = api_key or env_api_key    
        self.zone = zone or env_zone    
        self.country = country or env_country    
        self.api_endpoint = "https://api.brightdata.com/request"    

        if not self.api_key:    
            raise ValueError(    
                "BRIGHT_DATA_API_KEY must be provided via constructor or environment variable"    
            )    

        if not self.zone:    
            raise ValueError(    
                "BRIGHT_DATA_ZONE must be provided via constructor or environment variable"    
            )    

        self.session = requests.Session()    
        self.session.headers.update({    
            'Content-Type': 'application/json',    
            'Authorization': f'Bearer {self.api_key}'    
        })    

    def search( self,    
        query: str,    
        num_results: int = 10,    
        language: Optional[str] = None,    
        country: Optional[str] = None ) -> Dict[str, Any]:    
        """Execute a Google search via Bright Data SERP API"""    
        search_url = (    
            f"https://www.google.com/search"    
            f"?q={requests.utils.quote(query)}"    
            f"&num={num_results}"    
            f"&brd_json=1"    
        )    

        if language:    
            search_url += f"&hl={language}&lr=lang_{language}"    

        target_country = country or self.country    

        payload = {    
            'zone': self.zone,    
            'url': search_url,    
            'format': 'json'    
        }    

        if target_country:    
            payload['country'] = target_country    

        try:    
            response = self.session.post(    
                self.api_endpoint,    
                json=payload,    
                timeout=30    
            )    
            response.raise_for_status()    
            result = response.json()    

            # Parse body JSON string if present    
            if isinstance(result, dict) and 'body' in result:    
                if isinstance(result['body'], str):    
                    result['body'] = json.loads(result['body'])    
                # Return the parsed body content    
                return result['body']    

            return result    

        except requests.exceptions.HTTPError as e:    
            error_msg = f"Search request failed with HTTP {e.response.status_code}"    
            if e.response.text:    
                error_msg += f": {e.response.text[:200]}"    
            raise RuntimeError(error_msg) from e    
        except requests.exceptions.RequestException as e:    
            raise RuntimeError(f"Search request failed: {e}") from e

This is just a thin wrapper around the Bright Data API. It takes a query, returns JSON with organic search results (title, snippet, URL, rank).

This module is called when we run the fetch command.

python main.py fetch --keywords "python" "javascript" "react"

This will connect to the Bright Data SERP API, and for each keyword, fetch Google search results (default of 10 per keyword, adjust as necessary), and extract + store organic results (title, snippet, URL, rank). Remember, interest scores require at least 2 snapshots on different days. You should fetch snapshots daily to build historical trends (cron job, or just running manually.)

Example output:

Fetching snapshots for 3 keywords…  
[1/3] 'python': 10 results  
[2/3] 'javascript': 10 results  
[3/3] 'react': 10 results  
Total snapshots in database: 30

Module 2: Storing Snapshots in DuckDB

Once SERP data is fetched by the previous module, it needs to be stored in DuckDB for analytical queries. That logic lives in duckdb_manager.py

Read the full code for duckdb_manager.py here: https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/duckdb_manager.py

First of all, let’s introduce the schema we’ll be using:

CREATE TABLE IF NOT EXISTS serp_snapshots (    
    snapshot_id BIGINT PRIMARY KEY,    
    query TEXT NOT NULL,    
    snapshot_date DATE NOT NULL,    
    snapshot_timestamp TIMESTAMP NOT NULL,    
    url TEXT NOT NULL,    
    title TEXT,    
    snippet TEXT,    
    domain TEXT,    
    rank INTEGER NOT NULL,    
    UNIQUE(query, snapshot_date, url)    
)    

-- Interest scores table (calculated from SERP movement between snapshots)    
CREATE TABLE IF NOT EXISTS interest_scores (    
    query TEXT NOT NULL,    
    snapshot_date DATE NOT NULL,    
    interest_score DOUBLE NOT NULL,    
    new_domains_count INTEGER,    
    avg_rank_improvement DOUBLE,    
    reshuffle_frequency DOUBLE,    
    UNIQUE(query, snapshot_date)    
)    

-- Indexes for fast queries    
CREATE INDEX IF NOT EXISTS idx_query_date ON serp_snapshots(query, snapshot_date)    
CREATE INDEX IF NOT EXISTS idx_url_query ON serp_snapshots(url, query)    
CREATE INDEX IF NOT EXISTS idx_interest_scores ON interest_scores(query, snapshot_date)

Each SERP result becomes a row, keyed by (query, date, URL). Interest scores are stored in a separate table, calculated automatically when a new snapshot is inserted. So, inserting a snapshot will look like this:

def insert_snapshot(self, results: List[Dict[str, Any]], query: str,     
                   snapshot_date: Optional[datetime] = None):    
    """Insert a daily snapshot of SERP results"""    
    if snapshot_date is None:    
        snapshot_date = datetime.now()    

    snapshot_timestamp = snapshot_date    
    snapshot_date_only = snapshot_date.date() if hasattr(snapshot_date, 'date') else snapshot_date    

    if not results:    
        return    

    def extract_domain(url: str) -> str:    
        """Extract domain from URL, stripping www prefix"""    
        if not url:    
            return ""    
        try:    
            from urllib.parse import urlparse    
            parsed = urlparse(url)      
            return parsed.netloc.replace("www.", "")   
        except:    
            return ""    

    # Get max snapshot_id     
    max_id_result = self.conn.execute(      
        "SELECT COALESCE(MAX(snapshot_id), 0) FROM serp_snapshots"      
    ).fetchone()      
    next_id = (max_id_result[0] if max_id_result else 0) + 1      

    rows = []    
    for idx, result in enumerate(results):    
        url = result.get('url', result.get('link', ''))    
        domain = extract_domain(url)     
        rows.append({    
            'snapshot_id': next_id + idx,    
            'query': query,    
            'snapshot_date': snapshot_date_only,    
            'snapshot_timestamp': snapshot_timestamp,    
            'url': url,    
            'title': result.get('title', ''),    
            'snippet': result.get('snippet', result.get('description', '')),    
            'domain': domain,     
            'rank': idx + 1    
        })    

    import pandas as pd    
    df = pd.DataFrame(rows)    
    self.conn.execute("""      
        INSERT OR IGNORE INTO serp_snapshots       
        SELECT * FROM df      
    """)    
    # Calculate and store interest score    
    self._calculate_interest_score(query, snapshot_date_only)

Instead of updating rows, every run adds new records. This builds a local time-series dataset.

Here’s how we calculate the interest score using that 40–30–30 formula described earlier:

def _calculate_interest_score(self, query: str, snapshot_date):  
    """Calculate Search Interest Score (0-100) based on SERP movement"""  
    # Get previous day's snapshot for comparison  
    prev_date_result = self.conn.execute("""  
        SELECT MAX(snapshot_date)   
        FROM serp_snapshots   
        WHERE query = ?   
          AND snapshot_date `< ?  
    """, [query, snapshot_date]).fetchone()  

    if not prev_date_result or not prev_date_result[0]:  
        # First snapshot, no comparison possible  
        return  

    prev_date = prev_date_result[0]  

    # Get current top 10 domains  
    current_domains = self.conn.execute("""  
        SELECT DISTINCT domain   
        FROM serp_snapshots   
        WHERE query = ?   
          AND snapshot_date = ?  
          AND rank <= 10  
    """, [query, snapshot_date]).fetchall()  
    current_domains_set = {row[0] for row in current_domains}  

    # Get previous top 10 domains  
    prev_domains = self.conn.execute("""  
        SELECT DISTINCT domain   
        FROM serp_snapshots   
        WHERE query = ?   
          AND snapshot_date = ?  
          AND rank <= 10  
    """, [query, prev_date]).fetchall()  
    prev_domains_set = {row[0] for row in prev_domains}  

    # Count new domains entering top 10  
    new_domains = current_domains_set - prev_domains_set  
    new_domains_count = len(new_domains)  

    # Calculate average rank improvement for existing domains  
    rank_changes = self.conn.execute("""  
        WITH current_ranks AS (  
            SELECT domain, rank  
            FROM serp_snapshots  
            WHERE query = ? AND snapshot_date = ?   
              AND rank <= 10  
        ),  
        prev_ranks AS (  
            SELECT domain, rank  
            FROM serp_snapshots  
            WHERE query = ? AND snapshot_date = ?  
              AND rank <= 10  
        )  
        SELECT   
            c.domain,  
            c.rank as current_rank,  
            p.rank as prev_rank,  
            (p.rank - c.rank) as rank_improvement  
        FROM current_ranks c  
        JOIN prev_ranks p ON c.domain = p.domain  
    """, [query, snapshot_date, query, prev_date]).fetchall()  

    if rank_changes:  
        avg_rank_improvement = sum(row[3] for row in rank_changes) / len(rank_changes)  
    else:  
        avg_rank_improvement = 0.0  

    # Calculate reshuffle frequency (how many domains changed position)  
    reshuffle_count = len(rank_changes)  
    reshuffle_frequency = reshuffle_count / max(len(current_domains_set), 1)  

    # Normalize to 0-100 score  
    # I'm calculating a final score from 3 weighted sub-scores:  
    # - New domains: 0-10 domains = 0-40 points  
    # - Rank improvement: -10 to +10 = 0-30 points (normalized)  
    # - Reshuffle frequency: 0-1 = 0-30 points  

    new_domains_score = min(new_domains_count * 4, 40)  # Max 40 points  
    rank_improvement_score = min(max((avg_rank_improvement + 10) / 20 * 30, 0), 30)  # Max 30 points  
    reshuffle_score = reshuffle_frequency * 30  # Max 30 points  

    interest_score = new_domains_score + rank_improvement_score + reshuffle_score  

    # Store interest score  
    self.conn.execute("""  
        INSERT OR REPLACE INTO interest_scores   
        (query, snapshot_date, interest_score, new_domains_count, avg_rank_improvement, reshuffle_frequency)  
        VALUES (?, ?, ?, ?, ?, ?)  
    """, [query, snapshot_date, interest_score, new_domains_count, avg_rank_improvement, reshuffle_frequency])

This runs automatically every time a new snapshot is inserted. The score gets stored in a separate interest_scores table for easy querying.

Module 3: Analytical Queries

The nerdiest of our logic lives in analytics.py. This module opens DuckDB in read-only mode and exposes focused queries.

Read the full code for analytics.py here: https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/analytics.py

A good analytical query to demonstrate right now would be the one for rank volatility:

def rank_volatility(self, query: str, days: int = 30) -> Dict[str, Any]:    
    """Calculate rank volatility for URLs over time"""    
    cutoff_date = datetime.now().date() - timedelta(days=days)    

    result = self.conn.execute("""    
        WITH rank_changes AS (    
            SELECT     
                url,    
                domain,    
                rank,    
                snapshot_date,    
                LAG(rank) OVER (PARTITION BY url ORDER BY snapshot_date) as prev_rank    
            FROM serp_snapshots    
            WHERE query = ? AND snapshot_date >= ?    
            ORDER BY url, snapshot_date    
        ),    
        volatility AS (    
            SELECT     
                url,    
                domain,    
                COUNT(*) as snapshot_count,    
                AVG(rank) as avg_rank,    
                MIN(rank) as best_rank,    
                MAX(rank) as worst_rank,    
                STDDEV(rank) as rank_stddev,    
                COUNT(CASE WHEN prev_rank IS NOT NULL AND rank != prev_rank THEN 1 END) as rank_changes    
            FROM rank_changes    
            GROUP BY url, domain    
        )    
        SELECT     
            url,    
            domain,    
            snapshot_count,    
            ROUND(avg_rank, 2) as avg_rank,    
            best_rank,    
            worst_rank,    
            ROUND(rank_stddev, 2) as rank_stddev,    
            rank_changes,    
            ROUND(CAST(rank_changes AS DOUBLE) / NULLIF(snapshot_count - 1, 0) * 100, 1) as volatility_pct    
        FROM volatility    
        WHERE snapshot_count > 1    
        ORDER BY rank_stddev DESC, avg_rank ASC    
        LIMIT 50    
    """, [query, cutoff_date]).df()    

    return {'query': query, 'days': days, 'results': result}

This uses window functions (LAG) and aggregations (STDDEV) to surface URLs that move around the most. These queries would normally require a data warehouse — here they’re just SQL running locally.

Run this with:

python main.py volatility --query "python" --days 30

This, for example, will analyze the last 30 days of snapshots for the query string “python”, calculating average rank, best/worst rank, standard deviation, and change frequency — and display the top 50 (as default) most volatile URLs.

Example output:

=== Rank Volatility for 'python' (last 30 days) ===  

Top 10 most volatile URLs:  

| url | domain | snapshot_count | avg_rank | best_rank | worst_rank | rank_stddev | rank_changes | volatility_pct |  
| --- | --- | --- | --- | --- | --- | --- | --- | --- |  
| https://www.python.org/ | python.org | 30 | 1.5 | 1 | 3 | 0.67 | 15 | 51.7 |  
| https://www.w3schools.com/python/ | w3schools.com | 30 | 2.3 | 1 | 5 | 1.12 | 18 | 62.1 |  
| https://en.wikipedia.org/wiki/Python_(programming_language) | wikipedia.org | 28 | 4.1 | 2 | 8 | 1.89 | 12 | 44.4 |  
| https://www.codecademy.com/catalog/language/python | codecademy.com | 25 | 5.2 | 3 | 10 | 2.15 | 10 | 41.7 |

Another query that is very useful is finding new entrants:

def new_entrants(self, query: str, days: int = 7):    
    """Find URLs that appeared for the first time recently"""    
    cutoff_date = datetime.now().date() - timedelta(days=days)    

    result = self.conn.execute("""    
        WITH first_appearance AS (    
            SELECT     
                url,    
                domain,    
                MIN(snapshot_date) as first_seen    
            FROM serp_snapshots    
            WHERE query = ?    
            GROUP BY url, domain    
        ),    
        recent_entrants AS (    
            SELECT     
                fa.url,    
                fa.domain,    
                fa.first_seen,    
                s.rank as first_rank,    
                s.title,    
                s.snippet    
            FROM first_appearance fa    
            JOIN serp_snapshots s     
                ON fa.url = s.url     
                AND fa.first_seen = s.snapshot_date    
                AND s.query = ?    
            WHERE fa.first_seen >= ?    
        )    
        SELECT     
            url,    
            domain,    
            first_seen,    
            first_rank,    
            title,    
            snippet    
        FROM recent_entrants    
        ORDER BY first_seen DESC, first_rank ASC    
        LIMIT 50    
    """, [query, query, cutoff_date]).df()    

    return {'query': query, 'days': days, 'results': result}

This finds URLs whose first appearance falls within the last N days — perfect for spotting new competitors or fresh content entering the rankings.

Run this with:

python main.py new-entrants --query "python" --days  7

Example output:

=== New Entrants for 'python' (last 7 days) ===  

Found 3 new URLs:  

| url | domain | first_seen | first_rank | title | snippet |  
| --- | --- | --- | --- | --- | --- |  
| https://realpython.com/ | realpython.com | 2026-02-04 | 7 | Real Python - Python Tutorials | Learn Python programming with Real Python's comprehensive tutorials and courses... |  
| https://www.pythonforbeginners.com/ | pythonforbeginners.com | 2026-02-05 | 9 | Python For Beginners | A comprehensive guide to learning Python programming from scratch... |  
| https://docs.python-guide.org/ | docs.python-guide.org | 2026-02-06 | 8 | The Hitchhiker's Guide to Python | Best practices and recommendations for Python development... |

I’m not going to go over every module, but it’s all in the code. Find all queries + their expected output in the project README.md.

Module 4: The Snapshot Fetcher

Finally, scraper.py (I’m so sorry — I really could have named this better 😅) connects ingestion and storage.

Read the full code for scraper.py here: https://github.com/sixthextinction/duckdb-google-trends-basic/blob/main/src/scraper.py

import time    
from datetime import datetime    
from typing import List, Optional    

from serp_client import BrightDataClient    
from duckdb_manager import DuckDBManager    


def fetch_snapshots(keywords: List[str], num_results: int = 10, delay: float = 1.0):    
    """    
    Fetch SERP snapshots for keywords and store in DuckDB    

    Args:    
        keywords: List of search keywords    
        num_results: Number of results per keyword    
        delay: Delay between API calls (seconds)    
    """    
    client = BrightDataClient()    

    with DuckDBManager() as db:    
        print(f"Fetching snapshots for {len(keywords)} keywords...")    

        for idx, keyword in enumerate(keywords):    
            try:    
                # Fetch SERP results    
                serp_data = client.search(keyword, num_results=num_results)    

                # Extract organic results    
                organic_results = []    
                if isinstance(serp_data, dict) and 'organic' in serp_data:    
                    organic_results = serp_data['organic']    

                if organic_results:    
                    # Insert snapshot    
                    db.insert_snapshot(organic_results, keyword)    
                    print(f"[{idx+1}/{len(keywords)}] '{keyword}': {len(organic_results)} results")    
                else:    
                    print(f"[{idx+1}/{len(keywords)}] '{keyword}': No results found")    

                # Rate limiting    
                if idx < len(keywords) - 1:    
                    time.sleep(delay)    

            except Exception as e:    
                print(f"Error fetching '{keyword}': {e}")    
                continue    

        total = db.get_snapshot_count()    
        print(f"\nTotal snapshots in database: {total}")

This is just simple orchestration logic, again. It iterates over keywords, fetches results, and inserts snapshots. Rate limiting and error handling live at the edges. I’ve kept the core logic deliberately simple.

That’s everything! Remember, main.py brings all of these together.

Real World Use Cases

Now that you understand how it works, here’s some cool things you can actually do with this tool.

1. Detect Google Algorithm Updates Before They’re Announced

When tracking multiple keywords in the same niche, sudden volatility spikes across all of them indicate an algorithm change.

python main.py volatility --query "react" --days 7  
python main.py volatility --query "vue" --days 7  
python main.py volatility --query "angular" --days 7

If all three show high rank_stddev and volatility_pct, Google likely pushed an update.

SEO folks pay $200/month for SEMrush Sensor just to get this signal. You’re building it for the cost of SERP API calls.

2. Spy on Competitor SEO Tactics

Track title and snippet changes to see what competitors are A/B testing:

python main.py changes --query "nextjs tutorial" --days 30

Example output:

| url                              | prev_title                    | new_title                                               |  
|----------------------------------|-------------------------------|---------------------------------------------------------|  
| https://nextjs.org/docs          | Next.js Documentation         | Next.js Docs | Next.js                                  |  
| https://nextjs.org/learn         | Learn Next.js                 | Learn Next.js | Next.js by Vercel - The React Framework |

Let’s say a site changed their title from a generic page description to something more specific. If their rank improved after the change, that’s a signal the new title performs better — steal that pattern!

3. Find Content Gaps in Real-Time

See which sites are entering top 10 and what format they’re using:

python main.py new-entrants --query "react hooks tutorial" --days 7

Example output:

| domain              | first_seen | first_rank | title                                    |  
|---------------------|------------|------------|------------------------------------------|  
| react-tutorial.dev  | 2026-02-05 | 7          | React Hooks Interactive Tutorial         |  
| codesandbox.io      | 2026-02-06 | 9          | Learn React Hooks - Live Coding Examples |

Let’s say there are two new entrants in the SERP for the query “react hooks tutorial”, and both new entrants have “Interactive” or “Live” in their titles. That means Google is currently rewarding interactive content for this query. Adjust your content strategy accordingly.

4. Validate Content Ideas Before Creation

This one’s super simple to understand. High volatility = easier for you to rank. Low volatility = established players dominate.

python main.py volatility --query "python tutorial" --days 30  
python main.py volatility --query "rust async tutorial" --days 30

Let’s say the query “python tutorial” shows rank_stddev: 0.3 (very stable) and “rust async tutorial” shows rank_stddev: 2.1 (chaotic), focus on the Rust content! The Python keyword is locked down by W3Schools and Real Python — you won’t break in easily.

5. Track Your Own Product’s SERP Performance

Monitor how your product ranks for target keywords:

python main.py fetch --keywords "whatever your product is or does"

Then check if you’re entering top 10:

python main.py new-entrants --query "whatever your product is or does" --days 7

If your product URL appears, congrats — you just entered the top 10. If competitors are dropping out (volatility shows their ranks declining), you’re winning.

What I Learned Building This

DuckDB is a total cheat code for embedded analytics. I expected to need PostgreSQL (or ClickHouse, ugh.) for time-series queries over SERP data. Without fiddling with any config, calculating rank volatility across 30 days of snapshots for 50 URLs ran in ~20ms for me. The database file was <5MB for weeks of data.
Bright Data’s SERP API is very reliable. I tried other SERP APIs before settling on Bright Data, primarily because of the consistent JSON output for Google and Bing, and support for DuckDuckGo, Yandex, etc. This experiment cost me pennies — but make sure you check their pricing so you don’t get burnt by costs you shouldn’t be incurring.
The Interest Score formula needs tuning, possibly. The 40/30/30 weighting (new domains / rank improvement / domain overlap ratio) was only my first guess. It works reasonably well, but it’s not perfect. At the very least, I should weight new domains more heavily for breaking news queries, and reduce domain overlap ratio impact for stable niches (Because, for example, Wikipedia will always be #1 for “Python programming language”)

Try It Yourself

Again, the full code is on GitHub: https://github.com/sixthextinction/duckdb-google-trends-basic/

Quick start:

git clone https://github.com/sixthextinction/duckdb-google-trends-basic.git  
# or...  
gh repo clone sixthextinction/duckdb-google-trends-basic  
# then...  
cd duckdb-google-trends-basic    
pip install -r requirements.txt    

# Set environment variables    
export BRIGHT_DATA_API_KEY="your_key"    
export BRIGHT_DATA_ZONE="your_zone"    
export BRIGHT_DATA_COUNTRY="us"  # optional, for geo-targeted results    

# Or use a .env file instead (python-dotenv is included)    

# Test with sample data (no API key needed)    
python seed_data.py    
python main.py scores --query "nextjs" --days 7    

# Or fetch real data    
python main.py fetch --keywords "react" "vue" "svelte"

I’ve included a seed script that creates 7 days of synthetic data so you can test immediately without waiting. Otherwise, set up a daily cron job to fetch snapshots automatically, and within a week you’ll have real trend data.

Thanks for reading!

If you did something cool with this tool, I’d love to see it. Reach out on LinkedIn, or put it in the comments below.

DEV Community: Prithwish Nath

A Practical Guide To Entity Resolution in Python (No Database, No Machine Learning)

What Is Entity Resolution vs Fuzzy Matching?

Fuzzy matching vs Lookup table vs ML

The Pipeline at a Glance

Stage 1: Fetching Data from Crunchbase Hubs

Stage 2: Parsing from Markdown into Organization Rows

Stage 3: Turn Scraped Records into a Flat Table

Stage 4: Reconciliation — Normalize + Dedupe + Fuzzy Cluster

Pass 1: Exact normalization

Pass 2: Fuzzy merge

What is WRatio and Why Use It?

Stage 5: Performing a Real CRM Join

Post-Fuzzy Matching Results

Running It

A Quick Note: The Review Queue

Frequently Asked Questions

Caveats

Use Cases for Entity Resolution in Python

5 Production Stacks for Live Data Ingestion at Scale (Without Getting Blocked)

What You’ll Learn

1. Bun/Node fetch + Allowlist — The Boring Baseline That Works

What is the fetch + cron stack?

Why use fetch + cron for live data ingestion?

How to Handle Pagination

When fetch + cron isn’t enough

What I got wrong

2. Agent + Bright Data MCP — Complexity Scale Without the Infra Tax

What is the agent + Bright Data MCP stack?

Why use Bright Data MCP for agentic data extraction?

How to run Bright Data MCP headlessly (without Cursor or Claude Desktop)

What I got wrong

3. Serverless Cron + Object Storage — Disposable Compute, Durable Data

What is the serverless cron + object storage pattern?

Why use serverless workers + R2/S3 for fan-out ingestion?

Do I need a manifest for serverless ingest?

What I got wrong

4. A Durable Workflow Engine + Swappable I/O — The Stable Orchestration Layer

What is the workflow engine stack?

Why use Temporal, Inngest, AWS Step Functions for resilient ingestion?

When to swap the I/O step in a workflow

What I got wrong

5. Minimal Playwright Headless — The Last Resort

What is the minimal Playwright headless stack?

When to use Playwright for scraping (and why it’s a last resort)

When to add a proxy to your Playwright stack

What I got wrong

Decision Tree: Choosing Your Ingestion Stack

Frequently Asked Questions (FAQ)

Turning Google into an Explorable Knowledge Graph Using Pure k-NN

TL;DR: I ran K-Nearest Neighbors (KNN) over a Google search corpus to find cross-query connections no single search can ever surface.

What is the K-Nearest Neighbors Algorithm (KNN) ?

Architecture

Prerequisites

Step 1: Building a Multi-Angle Query Set for Your Research Topic

Step 2: Setting Up a Bright Data SERP API Client in Python

Step 3: Ingesting Multi-Query Search Results Into DuckDB

Step 4: Embedding and Indexing Google Results in ChromaDB

Step 5: Running Cosine k-NN Over a Merged Corpus

Step 6: Serving ChromaDB Vectors with FastAPI

Step 7 — Serving a Neighbor Explorer UI with JavaScript

Results: What Cosine k-NN Reveals Across 100 Google Searches

Hub Queries vs. Island Queries: How Semantic Density Varies by Topic

How Query-to-Query Edges Reveal Hidden Connections

Query Boundaries Barely Exist in Embedding Space

Conclusion: A Proximity-Based Knowledge Graph

How Failing at Fantasy Baseball Made Me Fix My Cron Jobs with Temporal

How Cron Jobs Can Burn You

What is Temporal?

Getting player data from MLB.com

Data Extraction with Temporal.io

The Temporal Workflow

The Temporal Activity

The Temporal Worker

The Temporal Schedule

What you actually see in the UI

Running it yourself

A Note on Production

I Built a $0 Search Engine on Real Web Data (No Algolia or Elasticsearch)

Prerequisites

1. Bun/Node `fetch` + Allowlist — The Boring Baseline That Works