I’ve been reviewing some old RAG code I wrote a year ago and boy has it not aged well. For context: frequently, I’ll need to do what I call a fast “literature” pass. What’s the latest opinion on long context vs. RAG? What’s new for agentic retrieval? Where does hybrid search fit in 2026?
I depend heavily on arxiv papers for this. Using a SERP API (Bright Data) for Google, running site:arxiv.org plus a well-framed query gets me recent, relevant papers. I’d run four or five of these, collate the results…and then inevitably get bogged down scrolling JSON, opening new tabs, running grep. Just godawful UX for what is, at its core, a search problem (not data).
So I spent this past week refactoring it all, and eventually ended up building a local faceted search surface over real web data.
- I fetch organic Google results (
site:arxiv.org + query) - Index them with Typesense (a FOSS, lightning-fast, local-first Algolia alternative)
- A few lines of Python server code to proxy queries from the browser.
This gives me a live, filterable index where I now quickly search across all my query runs at once, see which papers surfaced under which query angle, and spot overlaps in seconds instead of minutes.
This pattern should work for any research domain where Google is a better discovery layer than the source’s own search, so I’m open sourcing it and writing this up. I hope it’s useful!
Find the full code here:
Prerequisites
I wasn’t kidding about keeping the Python side minimal. We’ll only need Docker, and three Python packages (no frameworks):
requests>=2.28.0
python-dotenv>=1.0.0
typesense>=0.21.0
The only one worth mentioning is the Typesense Python client — it handles schema creation, JSONL import, and search. The other two are bog-standard Requests and python-dotenv.
Typesense itself runs in Docker Compose. Let’s make it one container with a persistent volume that survives restarts:
docker-compose.yml
services:
typesense:
image: typesense/typesense:26.0
restart: unless-stopped
ports:
- "8108:8108"
volumes:
- typesense-data:/data
command: >
--data-dir /data
--api-key devtypesense
--listen-port 8108
--enable-cors
volumes:
typesense-data:
# and then you can do
docker compose up -d
.env
BRIGHT_DATA_API_KEY=your_api_key
BRIGHT_DATA_ZONE=serp
BRIGHT_DATA_COUNTRY=us
TYPESENSE_API_KEY=devtypesense
The TYPESENSE_API_KEY can be anything really — it just has to match the --api-key flag in the compose file. I'll explain why the browser never sees it when we get to serve.py.
Bright Data credentials come from your account. If you’re swapping in another SERP API, this is the only file you’d change.
How the pieces fit together
We have four files:
bright_data_serp.py # Bright Data SERP client
ingest.py # fetch → transform → upsert into Typesense
serve.py # /api/search proxy + static file server
static/index.html # search UI
ingest.py sends queries to Bright Data, maps each organic result to a Typesense document, and bulk-imports the batch. After that, serve.py sits between the browser and Typesense — authenticated calls go out, plain JSON comes back. The browser never talks to Typesense directly.
Let’s go through each.
How to get structured SERP data from Bright Data
Our client will just POST to https://api.brightdata.com/request with a Bearer token, a zone name, and a Google URL string.
Critically, you have to include brd_json=1. Without it you get raw HTML. With it, you get a parsed organic JSON array — each row has title, link, description, rank, and usually more.
bright_data_serp.py
import json
import os
import time
from typing import Any, Dict, Optional
import requests
from dotenv import load_dotenv
load_dotenv()
def limit_organic(data: Dict[str, Any], max_results: int) -> Dict[str, Any]:
"""Keep at most ``max_results`` organic rows. Google/Bright Data often ignore ``&num=``; slice client-side."""
if max_results `<= 0:
return data
organic = data.get("organic")
if isinstance(organic, list) and len(organic) >` max_results:
return {\*\*data, "organic": organic[:max_results]}
return data
class BrightDataSERPClient:
def __init__(
self,
api_key: Optional[str] = None,
zone: Optional[str] = None,
country: Optional[str] = None,
):
self.api_key = api_key or os.getenv("BRIGHT_DATA_API_KEY")
self.zone = zone or os.getenv("BRIGHT_DATA_ZONE")
self.country = country or os.getenv("BRIGHT_DATA_COUNTRY")
self.api_endpoint = "https://api.brightdata.com/request"
if not self.api_key:
raise ValueError("BRIGHT_DATA_API_KEY is required.")
if not self.zone:
raise ValueError("BRIGHT_DATA_ZONE is required.")
self.session = requests.Session()
self.session.headers.update(
{
"Content-Type": "application/json",
"Authorization": f"Bearer {self.api_key}",
}
)
def search(
self,
query: str,
num_results: int = 10,
language: Optional[str] = None,
country: Optional[str] = None,
max_retries: int = 2,
) -> Dict[str, Any]:
last_err: Optional[Exception] = None
for attempt in range(max_retries + 1):
try:
return self._do_search(query, num_results, language, country)
except Exception as e:
last_err = e
if attempt `< max_retries:
time.sleep(0.5 \* (attempt + 1))
assert last_err is not None
raise last_err
def _do_search(
self,
query: str,
num_results: int,
language: Optional[str],
country: Optional[str],
) ->` Dict[str, Any]:
# Omit &num=: deprecated by Google (Bright Data strips it); use limit_organic after fetch.
search_url = (
f"https://www.google.com/search"
f"?q={requests.utils.quote(query)}"
f"&brd_json=1"
)
if language:
search_url += f"&hl={language}&lr=lang_{language}"
target_country = country or self.country
payload: Dict[str, Any] = {
"zone": self.zone,
"url": search_url,
"format": "json",
}
if target_country:
payload["country"] = target_country
response = self.session.post(self.api_endpoint, json=payload, timeout=60)
response.raise_for_status()
result = response.json()
if not isinstance(result, dict):
raise RuntimeError(f"Bright Data unexpected response type: {type(result)}")
inner_status = result.get("status_code")
if inner_status is not None and inner_status != 200:
raise RuntimeError(f"Bright Data SERP status_code={inner_status}")
if "body" in result:
body = result["body"]
if isinstance(body, str):
if not body.strip():
raise RuntimeError("Bright Data SERP empty body")
result = json.loads(body)
else:
result = body
elif "organic" not in result:
raise RuntimeError("Bright Data response missing 'body' and 'organic'")
return limit_organic(result, num_results)
A very common gotcha: no matter how intuitive it might feel, do not put &num= on that search URL to request N results like this:
search_url = (
f"https://www.google.com/search"
f"?q={requests.utils.quote(query)}"
f"&num=50"
f"&brd_json=1"
)
Google deprecated the num parameter for ordinary web search back in September 2025. Now, you typically get about one page of organics (~10). So we'll have to cap rows in code with limit_organic(..., num_results) — slice organic after the response, not via the URL.
With "format": "json", the JSON you parse from the HTTP response is an envelope, not the SERP object itself: status_code, headers, and body. The real SERP payload is inside body, usually as a JSON string you must json.loads again.
That means only a 200 from api.brightdata.com is not enough: check the inner status_code (e.g. 401 → empty body). The client rejects non-200 inner status, empty body, and missing organic after unwrap so ingest doesn’t silently index nothing.
result = response.json()
inner = result.get("status_code")
if inner is not None and inner != 200:
raise RuntimeError(f"Bright Data SERP status_code={inner}")
if "body" in result:
body = result["body"]
if isinstance(body, str):
if not body.strip():
raise RuntimeError("Bright Data SERP empty body")
result = json.loads(body)
else:
result = body
# ...
return limit_organic(result, num_results)
If you skip the unwrap and pass the top-level dict to organic_to_documents, there's no organic key — and with no check, you get an empty index and no error message. It just silently indexes nothing. (Ask me how I know.🙃)
Finally, our client retries with a short backoff — 0.5s * (attempt + 1) — so a transient failure on one query doesn't kill the whole run.
How to design the Typesense Schema
Typesense needs a collection before anything can go in. The schema maps directly to the shape of an organic SERP result — I didn’t add any fields I wasn’t already getting for free:
ingest.py
# Fetches Google SERP via Bright Data THEN indexes organic results into Typesense.
# Use --append to upsert into an existing index.
# Use --query and/or --queries-file to override the built-in demo query list.
import argparse
import hashlib
import json
import os
import time
from pathlib import Path
from typing import Any, Dict, List
from urllib.parse import urlparse
import typesense
from dotenv import load_dotenv
from typesense.exceptions import ObjectNotFound
from bright_data_serp import BrightDataSERPClient
load_dotenv()
COLLECTION = "serp_results"
# Some obvious "RAG and retrieval" topics
DEFAULT_QUERIES = [
"site:arxiv.org retrieval augmented generation 2026",
"site:arxiv.org hybrid search reranking 2026",
"site:arxiv.org agentic RAG 2026",
"site:arxiv.org long context vs RAG 2026",
]
def typesense_client() -> typesense.Client:
return typesense.Client(
{
"nodes": [
{
"host": os.getenv("TYPESENSE_HOST", "localhost"),
"port": os.getenv("TYPESENSE_PORT", "8108"),
"protocol": os.getenv("TYPESENSE_PROTOCOL", "http"),
}
],
"api_key": os.environ["TYPESENSE_API_KEY"],
"connection_timeout_seconds": 30,
}
)
def collection_schema() -> Dict[str, Any]:
return {
"name": COLLECTION,
"fields": [
{"name": "title", "type": "string"},
{"name": "url", "type": "string"},
{"name": "snippet", "type": "string", "optional": True},
{"name": "source_query", "type": "string", "facet": True},
{"name": "domain", "type": "string", "facet": True},
{"name": "position", "type": "int32"},
],
"default_sorting_field": "position",
}
def organic_to_documents( data: Dict[str, Any], source_query: str) -> List[Dict[str, Any]]:
organic = data.get("organic")
if not isinstance(organic, list):
return []
out: List[Dict[str, Any]] = []
for i, row in enumerate(organic):
if not isinstance(row, dict):
continue
url = row.get("link") or row.get("url") or ""
if not url:
continue
title = (row.get("title") or "")[:8000]
snippet = (row.get("description") or row.get("snippet") or "") or ""
snippet = snippet[:16000]
pos = row.get("rank") or row.get("position") or (i + 1)
try:
position = int(pos)
except (TypeError, ValueError):
position = i + 1
domain = urlparse(url).netloc or ""
doc_id = hashlib.sha256(f"{url}\t{source_query}".encode()).hexdigest()
out.append(
{
"id": doc_id,
"title": title,
"url": url,
"snippet": snippet,
"source_query": source_query,
"domain": domain,
"position": position,
}
)
return out
def ensure_collection(client: typesense.Client, \*, recreate: bool) -> None:
if recreate:
try:
client.collections[COLLECTION].delete()
except ObjectNotFound:
pass
client.collections.create(collection_schema())
return
try:
client.collections[COLLECTION].retrieve()
except ObjectNotFound:
client.collections.create(collection_schema())
def load_queries(args: argparse.Namespace) -> List[str]:
queries: List[str] = []
if args.queries_file:
text = Path(args.queries_file).read_text(encoding="utf-8")
for line in text.splitlines():
line = line.strip()
if not line or line.startswith("#"):
continue
queries.append(line)
extra = args.queries or []
queries.extend(extra)
if not queries:
return list(DEFAULT_QUERIES)
return queries
def main() -> None:
p = argparse.ArgumentParser(description="Ingest Bright Data SERP into Typesense.")
p.add_argument(
"--num-results",
type=int,
default=8,
help="Max organic rows to index per query after fetch (Google ignores &num=; we slice client-side).",
)
p.add_argument(
"--delay",
type=float,
default=0.6,
help="Seconds between Bright Data requests.",
)
p.add_argument(
"--append",
action="store_true",
help="Do not drop the collection; create it only if missing. Use for multiple ingest runs into one index.",
)
p.add_argument(
"--query",
action="append",
dest="queries",
metavar="Q",
help="SERP query string (repeatable). Default: built-in demo queries if no --queries-file/--query.",
)
p.add_argument(
"--queries-file",
type=str,
default=None,
help="Path to a file with one query per line (# and blank lines ignored).",
)
args = p.parse_args()
client = typesense_client()
ensure_collection(client, recreate=not args.append)
bd = BrightDataSERPClient()
all_docs: List[Dict[str, Any]] = []
query_list = load_queries(args)
for q in query_list:
print(f"Query: {q!r}")
try:
raw = bd.search(q, num_results=args.num_results)
except Exception as e:
print(f" error: {e}")
continue
docs = organic_to_documents(raw, q)
print(f" indexed {len(docs)} organic rows")
all_docs.extend(docs)
time.sleep(args.delay)
if not all_docs:
print("No documents to import. Check Bright Data credentials and SERP response.")
return
jsonl = "\n".join(json.dumps(d, ensure_ascii=False) for d in all_docs)
imp = client.collections[COLLECTION].documents.import_(jsonl, {"action": "upsert"})
# import_ returns one JSON object per line
errors = [line for line in imp.split("\n") if line and '"success":false' in line]
if errors:
print("Import reported errors (first few):", errors[:3])
print(f"Done. Total documents: {len(all_docs)}")
if __name__ == "__main__":
main()
Two fields have facet: True: source_query and domain. These are what the filter chips in the UI are built on. source_query is the exact string sent to the SERP API — i.e. not a label you add later, the actual query. domain is extracted from the URL at ingest time.
Both become filterable for free here, which is a huge win for us.
Also, default_sorting_field: "position" means results come back in the same order Google returned them. I do want that as a default — it's the ranking signal I'm using Bright Data to get in the first place.
Some Common Gotchas
When you’re mapping organic results to documents, the first question is how to generate document IDs. The move that feels right is to simply hash the URL — deduplicate on URL, one document per link.
Don’t listen to that instinct. Don’t do this:
doc_id = hashlib.sha256(url.encode()).hexdigest()
What you should do is bake the query into the ID so the same link under two Bright Data runs is two documents, each tagged with the query that surfaced it:
doc_id = hashlib.sha256(f"{url}\t{source_query}".encode()).hexdigest()
The ID is sha256(url + source_query), so the same paper appearing under two different queries becomes two separate documents. Search for a paper title and both facet chips show up — you can see exactly which of your Bright Data runs found it. If you hash on URL alone, that's gone permanently. The index looks cleaner but you've thrown away the only thing that makes the source_query facet meaningful.
One more thing that will ruin your day if you miss it: Bright Data returns link in most payloads but url in some. description and snippet both map to the snippet field depending on the response. Handle both, else some batches might index with blank snippets, no errors or warnings:
url = row.get("link") or row.get("url") or ""
snippet = (row.get("description") or row.get("snippet") or "")[:16000]
The difference between a snapshot vs. a corpus
ingest.py runs in two modes:
python ingest.py # drops and recreates the collection
python ingest.py --append # creates only if missing, then upserts
Running without --append wipes and recreates the collection every time — that's probably fine for exploration, throwaway by design. --append creates the collection only if it doesn't exist, then upserts into it.
That matters because let’s say I have a scenario where I ran the default four queries on Monday. Thursday I wanted to add site:arxiv.org graph RAG 2026 to the same index — compare it against what I'd already collected rather than start over. With --append, the new results land alongside the originals and the new seed query shows up as a chip immediately. Without it, I'd be choosing between Monday's index and Thursday's.
That’s what I meant by “collect once, query many times” — the index accumulates and doesn’t reset or get overwritten each time.
Custom queries work inline or from a file:
python ingest.py --append --query "site:arxiv.org graph RAG 2026"
python ingest.py --append --queries-file my_queries.txt
Keeping the Typesense API key server-side
You could point the browser straight at Typesense and skip serve.py entirely. The problem is that Typesense's API key is an admin key — the same one that can drop your collection. Put it in client-side JS and anyone who opens devtools has it.
So serve.py is just a proxy. The browser calls /api/search, the server makes the authenticated Typesense request and JSON comes back.
I kept it as stdlib [http.server](https://docs.python.org/3/library/http.server.html) — no Flask or FastAPI. Adding a framework to wrap thirty lines of routing is honestly just adding a dependency for the sake of having a dependency. If you want to build on top of this, swapping in your preferred framework takes an hour.
The search parameters passed to Typesense are set once.
serve.py
import json
import os
import urllib.parse
from http.server import BaseHTTPRequestHandler, HTTPServer
from pathlib import Path
import typesense
from dotenv import load_dotenv
load_dotenv()
STATIC = Path(__file__).resolve().parent / "static"
COLLECTION = "serp_results"
PORT = int(os.getenv("SERVE_PORT", "8765"))
def client() -> typesense.Client:
return typesense.Client(
{
"nodes": [
{
"host": os.getenv("TYPESENSE_HOST", "localhost"),
"port": os.getenv("TYPESENSE_PORT", "8108"),
"protocol": os.getenv("TYPESENSE_PROTOCOL", "http"),
}
],
"api_key": os.environ["TYPESENSE_API_KEY"],
"connection_timeout_seconds": 10,
}
)
class Handler(BaseHTTPRequestHandler):
_ts: typesense.Client | None = None
@classmethod
def typesense(cls) -> typesense.Client:
if cls._ts is None:
cls._ts = client()
return cls._ts
def log_message(self, fmt: str, \*args: object) -> None:
print(f"[{self.address_string()}] {fmt % args}")
def do_GET(self) -> None:
parsed = urllib.parse.urlparse(self.path)
if parsed.path == "/api/search":
self._search(parsed.query)
return
if parsed.path == "/" or parsed.path == "/index.html":
self._file(STATIC / "index.html", "text/html; charset=utf-8")
return
self.send_error(404, "Not found")
def _file(self, path: Path, content_type: str) -> None:
if not path.is_file():
self.send_error(404, "Not found")
return
data = path.read_bytes()
self.send_response(200)
self.send_header("Content-Type", content_type)
self.send_header("Content-Length", str(len(data)))
self.end_headers()
self.wfile.write(data)
def _search(self, query: str) -> None:
qs = urllib.parse.parse_qs(query)
q = (qs.get("q") or [""])[0].strip()
fq = (qs.get("filter_by") or [""])[0].strip()
if not q:
payload = {
"hits": [],
"found": 0,
"facet_counts": [],
"q": q,
}
self._json(payload)
return
# Text search spans four stored fields (see ingest schema). Weights tune BM25-style
# ranking: a term in the title should matter more than the same term buried in the
# snippet, and more than an incidental match in the URL or domain string.
# Order MUST match query_by — Typesense applies weights positionally.
query_by = "title,snippet,url,domain"
query_by_weights = "4,3,1,1" # so titles are more important than snippets, which are more important than urls, which are more important than domains
params: dict = {
"q": q,
"query_by": query_by,
"query_by_weights": query_by_weights,
"facet_by": "source_query,domain",
"max_facet_values": 40,
"per_page": 25,
}
if fq:
params["filter_by"] = fq
try:
result = self.typesense().collections[COLLECTION].documents.search(params)
except Exception as e:
self.send_response(500)
self.send_header("Content-Type", "application/json")
self.end_headers()
self.wfile.write(json.dumps({"error": str(e)}).encode())
return
self._json(result)
def _json(self, obj: object) -> None:
data = json.dumps(obj, ensure_ascii=False).encode("utf-8")
self.send_response(200)
self.send_header("Content-Type", "application/json; charset=utf-8")
self.send_header("Content-Length", str(len(data)))
self.end_headers()
self.wfile.write(data)
def main() -> None:
server = HTTPServer(("127.0.0.1", PORT), Handler)
print(f"SERP demo UI: http://127.0.0.1:{PORT}/")
server.serve_forever()
if __name__ == "__main__":
main()
query_by_weights runs in the same order as query_by. A match in title outscores the same match in snippet, which outscores a match in url or domain. That nudges ranking toward "this is what the page is about" rather than "this word appears somewhere in the metadata" — no embeddings, no extra service, just the standard keyword-search lever.
domain being in query_by is a small trick: searching arxiv.org directly returns everything from that domain. Useful when you've mixed sources in one index, and costs nothing.
facet_by returns counts alongside every search response — the UI builds chips from those without a second request.
One UX detail I cared about is if a facet filter produces zero results, the UI reruns the query without filter_by, keeps the chips populated from those broader counts, and tells you that your filters might be hiding matches. You don’t want a blank screen with zero explanation, do you? 🙃
Typesense Facets in Vanilla JavaScript
The UI can just be regular JavaScript/CSS. I don’t need to go into too much detail, frontend UI design isn’t the point of this post. All you need is some sort of JS logic that hits /api/search, renders hits, and builds chips from facet_counts.
Facet state is two variables:
let filterSq = ""; // active source_query filter
let filterDom = ""; // active domain filter
Clicking a chip toggles the relevant variable and re-runs the search. Multiple active filters compose with &&:
function buildFilterBy() {
var parts = [];
if (filterSq) parts.push("source_query:=`" + filterSq + "`");
if (filterDom) parts.push("domain:=`" + filterDom + "`");
return parts.join(" && ");
}
That string goes straight into Typesense’s filter_by — the UI is just a thin layer over native filter syntax. Nothing to maintain on the client side.
Each result card shows the title, snippet, domain, and the seed query that produced it. That last tag is the thing. You can see at a glance which Bright Data run each result came from — i.e. which question you were asking when you made the query.
Running it
# 1. Start Typesense
docker compose up -d
# 2. Install Python deps
pip install -r requirements.txt
# 3. Ingest the demo queries
python ingest.py
You'll see:
Query: 'site:arxiv.org retrieval augmented generation 2026'
indexed 8 organic rows
Query: 'site:arxiv.org hybrid search reranking 2026'
indexed 8 organic rows
Query: 'site:arxiv.org agentic RAG 2026'
indexed 8 organic rows
Query: 'site:arxiv.org long context vs RAG 2026'
indexed 8 organic rows
Done. Total documents: 32
# 4. Start the UI
python serve.py
Open http://127.0.0.1:8765/ (or whatever you set with SERVE_PORT). You should see the empty search shell first:
Search for memory, chunk, graph, RAG. Click a seed query chip to isolate a single SERP run. If you've mixed domains, the domain chips filter those too.
Second pass, same index:
python ingest.py --append --query "site:arxiv.org graph RAG 2026"
New seed query appears as a chip immediately. Everything you indexed before is still there.
What query "provenance" actually means
The default run collects ~32 arxiv results tagged across four seed queries. Search for RAG or memoryand you get hits from all four runs mixed together.
Now the interesting question is this: are the results under “agentic RAG 2026” the same papers as under “long context vs RAG 2026”?
We can verify this quickly.
Click the site:arxiv.org agentic RAG 2026 chip — that’s one shortlist. Clear it, then click site:arxiv.org long context vs RAG 2026 — another. Some papers appear in both, and you quickly inspect them this way. Those are the ones Google considers relevant regardless of how you framed the question. The ones in only one list are specific to that framing.
This is what I mean by provenance. The source_query facet isn't a topic label, but can be considered a record of which question you were asking when you collected the data. Meaning a paper showing up under multiple seeds is telling you something, and not a deduplication problem.
One honest caveat, though: this is navigation over SERP metadata — titles, snippets, URLs. It can’t search inside the PDFs. What it does is let me triage thirty papers in two minutes instead of twenty, which is the problem I actually had.
Frequently Asked Questions (FAQ)
Q: How do I get Google search results as JSON with Bright Data?
A: POST to https://api.brightdata.com/request with a Bearer token, your zone name, and a Google URL that includes &brd_json=1. That flag is what flips the response from raw HTML to a parsed organic array (each row has title, link, description, rank). The JSON you get back is an envelope — the SERP payload is inside body, usually as a JSON string you have to json.loads a second time.
Q: Typesense vs Meilisearch vs Elasticsearch — which should I pick for a local search index?
A: For this kind of workload (a small, local, faceted index over web data) Typesense and Meilisearch are both reasonable but Elasticsearch is overkill. Typesense is in-memory C++, sub-millisecond latency, facets and typo tolerance on by default, one Docker container, no JVM. Meilisearch is Rust, disk-backed (LMDB), handles larger corpora on less RAM, and has arguably nicer defaults for developer UX. Elasticsearch is what you use when you have a dedicated ops team, billions of documents, or log-analytics workloads.
Q: Why is the same URL indexed twice if it appears under two queries?
A: Because I want it that way. The document ID is sha256(url + source_query), so the same paper surfacing under "agentic RAG 2026" and under "long context vs RAG 2026" becomes two documents — each tagged with the query that found it. Searching for the title shows both facet chips, which is how you see which Bright Data run produced each hit. Hash on URL alone and that provenance is gone permanently.
Q: Does this actually search inside the papers, or just the search-result metadata?
A: Just metadata — titles, snippets, URLs, domains, and the seed query. It’s navigation over SERP rows, not full-text search over PDFs. If you need to search inside the papers, you’d add a second stage — download the PDFs, chunk, embed — on top of this index, using the URLs it surfaces as the candidate set.
Q: Can I use this pipeline for non-arxiv sources?
A: Yes. The pipeline has no opinion about what the queries are. site:arxiv.org is just the scenario I needed; swap in site:github.com, site:news.ycombinator.com, mix site: operators, or drop the filter entirely. The domain field is extracted from the URL at ingest time, so mixed-domain runs get a second facet chip for free.
Q: Why stdlib _http.server_ instead of Flask or FastAPI?
A: Because the proxy is small enough that a framework import would be bigger than the logic it wraps. One handler, two routes (/ and /api/search), no middleware, no router — stdlib is enough. If you're building on top of this, swapping in FastAPI or your preferred framework takes an hour; I just didn't want to pay the dependency tax for a demo.
Key Takeaways
Bright Data solves the hard part of web data collection — proxy rotation, bot detection, structured extraction. Yadda, yadda.
What you do with that JSON is a different question. Export it and it answers the questions you had when you wrote the query. Or, index it, and it answers questions you haven’t even thought of yet.
From a collection as an endpoint to a collection as the start of something you can actually explore as you’re researching — is what I was going for here. It took me a week to refactor something I’d been doing badly for a year, and about twenty minutes to run once it was done. This can scale quite well so, more queries, more domains, more --append runs — and you have the option to make the Typesense index grow with the research instead of resetting every time.



Top comments (0)