DEV Community

Cover image for I Built a $0 Search Engine on Real Web Data (No Algolia or Elasticsearch)
Prithwish Nath
Prithwish Nath

Posted on • Originally published at Medium

I Built a $0 Search Engine on Real Web Data (No Algolia or Elasticsearch)

I’ve been reviewing some old RAG code I wrote a year ago and boy has it not aged well. For context: frequently, I’ll need to do what I call a fast “literature” pass. What’s the latest opinion on long context vs. RAG? What’s new for agentic retrieval? Where does hybrid search fit in 2026?

I depend heavily on arxiv papers for this. Using a SERP API (Bright Data) for Google, running site:arxiv.org plus a well-framed query gets me recent, relevant papers. I’d run four or five of these, collate the results…and then inevitably get bogged down scrolling JSON, opening new tabs, running grep. Just godawful UX for what is, at its core, a search problem (not data).

So I spent this past week refactoring it all, and eventually ended up building a local faceted search surface over real web data.

  • I fetch organic Google results (site:arxiv.org + query)
  • Index them with Typesense (a FOSS, lightning-fast, local-first Algolia alternative)
  • A few lines of Python server code to proxy queries from the browser.

This gives me a live, filterable index where I now quickly search across all my query runs at once, see which papers surfaced under which query angle, and spot overlaps in seconds instead of minutes.

screenshot showing Faceted search over indexed SERP rows: keyword query, seed-query chips, domain chip, and result cards with provenance

This pattern should work for any research domain where Google is a better discovery layer than the source’s own search, so I’m open sourcing it and writing this up. I hope it’s useful!

Find the full code here:

GitHub - sixthextinction/typesense: POC for local-first Algolia-style search but FOSS. Ingests…POC for local-first Algolia-style search but FOSS.

Prerequisites


I wasn’t kidding about keeping the Python side minimal. We’ll only need Docker, and three Python packages (no frameworks):

requests>=2.28.0  
python-dotenv>=1.0.0  
typesense>=0.21.0
Enter fullscreen mode Exit fullscreen mode

The only one worth mentioning is the Typesense Python client — it handles schema creation, JSONL import, and search. The other two are bog-standard Requests and python-dotenv.

Typesense itself runs in Docker Compose. Let’s make it one container with a persistent volume that survives restarts:

docker-compose.yml

services:  
  typesense:  
    image: typesense/typesense:26.0  
    restart: unless-stopped  
    ports:  
      - "8108:8108"  
    volumes:  
      - typesense-data:/data  
    command: >  
      --data-dir /data  
      --api-key devtypesense  
      --listen-port 8108  
      --enable-cors  

volumes:  
  typesense-data:
Enter fullscreen mode Exit fullscreen mode
# and then you can do  
docker compose up -d
Enter fullscreen mode Exit fullscreen mode

.env

BRIGHT_DATA_API_KEY=your_api_key  
BRIGHT_DATA_ZONE=serp  
BRIGHT_DATA_COUNTRY=us  
TYPESENSE_API_KEY=devtypesense
Enter fullscreen mode Exit fullscreen mode

The TYPESENSE_API_KEY can be anything really — it just has to match the --api-key flag in the compose file. I'll explain why the browser never sees it when we get to serve.py.

Bright Data credentials come from your account. If you’re swapping in another SERP API, this is the only file you’d change.

How the pieces fit together


We have four files:

bright_data_serp.py   # Bright Data SERP client  
ingest.py             # fetch → transform → upsert into Typesense  
serve.py              # /api/search proxy + static file server  
static/index.html     # search UI
Enter fullscreen mode Exit fullscreen mode

ingest.py sends queries to Bright Data, maps each organic result to a Typesense document, and bulk-imports the batch. After that, serve.py sits between the browser and Typesense — authenticated calls go out, plain JSON comes back. The browser never talks to Typesense directly.

Let’s go through each.

How to get structured SERP data from Bright Data


Our client will just POST to https://api.brightdata.com/request with a Bearer token, a zone name, and a Google URL string.

Critically, you have to include brd_json=1. Without it you get raw HTML. With it, you get a parsed organic JSON array — each row has title, link, description, rank, and usually more.

bright_data_serp.py

import json  
import os  
import time  
from typing import Any, Dict, Optional  

import requests  
from dotenv import load_dotenv  

load_dotenv()  


def limit_organic(data: Dict[str, Any], max_results: int) -> Dict[str, Any]:  
    """Keep at most ``max_results`` organic rows. Google/Bright Data often ignore ``&num=``; slice client-side."""  
    if max_results `<= 0:  
        return data  
    organic = data.get("organic")  
    if isinstance(organic, list) and len(organic) >` max_results:  
        return {\*\*data, "organic": organic[:max_results]}  
    return data  


class BrightDataSERPClient:  
    def __init__(  
        self,  
        api_key: Optional[str] = None,  
        zone: Optional[str] = None,  
        country: Optional[str] = None,  
    ):  
        self.api_key = api_key or os.getenv("BRIGHT_DATA_API_KEY")  
        self.zone = zone or os.getenv("BRIGHT_DATA_ZONE")  
        self.country = country or os.getenv("BRIGHT_DATA_COUNTRY")  
        self.api_endpoint = "https://api.brightdata.com/request"  

        if not self.api_key:  
            raise ValueError("BRIGHT_DATA_API_KEY is required.")  
        if not self.zone:  
            raise ValueError("BRIGHT_DATA_ZONE is required.")  

        self.session = requests.Session()  
        self.session.headers.update(  
            {  
                "Content-Type": "application/json",  
                "Authorization": f"Bearer {self.api_key}",  
            }  
        )  

    def search(  
        self,  
        query: str,  
        num_results: int = 10,  
        language: Optional[str] = None,  
        country: Optional[str] = None,  
        max_retries: int = 2,  
    ) -> Dict[str, Any]:  
        last_err: Optional[Exception] = None  
        for attempt in range(max_retries + 1):  
            try:  
                return self._do_search(query, num_results, language, country)  
            except Exception as e:  
                last_err = e  
                if attempt `< max_retries:  
                    time.sleep(0.5 \* (attempt + 1))  
        assert last_err is not None  
        raise last_err  

    def _do_search(  
        self,  
        query: str,  
        num_results: int,  
        language: Optional[str],  
        country: Optional[str],  
    ) ->` Dict[str, Any]:  
        # Omit &num=: deprecated by Google (Bright Data strips it); use limit_organic after fetch.  
        search_url = (  
            f"https://www.google.com/search"  
            f"?q={requests.utils.quote(query)}"  
            f"&brd_json=1"  
        )  
        if language:  
            search_url += f"&hl={language}&lr=lang_{language}"  
        target_country = country or self.country  
        payload: Dict[str, Any] = {  
            "zone": self.zone,  
            "url": search_url,  
            "format": "json",  
        }  
        if target_country:  
            payload["country"] = target_country  

        response = self.session.post(self.api_endpoint, json=payload, timeout=60)  
        response.raise_for_status()  
        result = response.json()  
        if not isinstance(result, dict):  
            raise RuntimeError(f"Bright Data unexpected response type: {type(result)}")  
        inner_status = result.get("status_code")  
        if inner_status is not None and inner_status != 200:  
            raise RuntimeError(f"Bright Data SERP status_code={inner_status}")  
        if "body" in result:  
            body = result["body"]  
            if isinstance(body, str):  
                if not body.strip():  
                    raise RuntimeError("Bright Data SERP empty body")  
                result = json.loads(body)  
            else:  
                result = body  
        elif "organic" not in result:  
            raise RuntimeError("Bright Data response missing 'body' and 'organic'")  
        return limit_organic(result, num_results)
Enter fullscreen mode Exit fullscreen mode

A very common gotcha: no matter how intuitive it might feel, do not put &num= on that search URL to request N results like this:

search_url = (  
    f"https://www.google.com/search"  
    f"?q={requests.utils.quote(query)}"  
    f"&num=50"  
    f"&brd_json=1"  
)
Enter fullscreen mode Exit fullscreen mode

Google deprecated the num parameter for ordinary web search back in September 2025. Now, you typically get about one page of organics (~10). So we'll have to cap rows in code with limit_organic(..., num_results) — slice organic after the response, not via the URL.

With "format": "json", the JSON you parse from the HTTP response is an envelope, not the SERP object itself: status_code, headers, and body. The real SERP payload is inside body, usually as a JSON string you must json.loads again.

That means only a 200 from api.brightdata.com is not enough: check the inner status_code (e.g. 401 → empty body). The client rejects non-200 inner status, empty body, and missing organic after unwrap so ingest doesn’t silently index nothing.

result = response.json()  
inner = result.get("status_code")  
if inner is not None and inner != 200:  
    raise RuntimeError(f"Bright Data SERP status_code={inner}")  
if "body" in result:  
    body = result["body"]  
    if isinstance(body, str):  
        if not body.strip():  
            raise RuntimeError("Bright Data SERP empty body")  
        result = json.loads(body)  
    else:  
        result = body  
# ...  
return limit_organic(result, num_results)
Enter fullscreen mode Exit fullscreen mode

If you skip the unwrap and pass the top-level dict to organic_to_documents, there's no organic key — and with no check, you get an empty index and no error message. It just silently indexes nothing. (Ask me how I know.🙃)

Finally, our client retries with a short backoff — 0.5s * (attempt + 1) — so a transient failure on one query doesn't kill the whole run.

How to design the Typesense Schema


Typesense needs a collection before anything can go in. The schema maps directly to the shape of an organic SERP result — I didn’t add any fields I wasn’t already getting for free:

ingest.py

# Fetches Google SERP via Bright Data THEN indexes organic results into Typesense.  
# Use --append to upsert into an existing index.   
# Use --query and/or --queries-file to override the built-in demo query list.  

import argparse  
import hashlib  
import json  
import os  
import time  
from pathlib import Path  
from typing import Any, Dict, List  
from urllib.parse import urlparse  

import typesense  
from dotenv import load_dotenv  
from typesense.exceptions import ObjectNotFound  

from bright_data_serp import BrightDataSERPClient  

load_dotenv()  

COLLECTION = "serp_results"  

# Some obvious "RAG and retrieval" topics  
DEFAULT_QUERIES = [  
    "site:arxiv.org retrieval augmented generation 2026",  
    "site:arxiv.org hybrid search reranking 2026",  
    "site:arxiv.org agentic RAG 2026",  
    "site:arxiv.org long context vs RAG 2026",  
]  


def typesense_client() -> typesense.Client:  
    return typesense.Client(  
        {  
            "nodes": [  
                {  
                    "host": os.getenv("TYPESENSE_HOST", "localhost"),  
                    "port": os.getenv("TYPESENSE_PORT", "8108"),  
                    "protocol": os.getenv("TYPESENSE_PROTOCOL", "http"),  
                }  
            ],  
            "api_key": os.environ["TYPESENSE_API_KEY"],  
            "connection_timeout_seconds": 30,  
        }  
    )  


def collection_schema() -> Dict[str, Any]:  
    return {  
        "name": COLLECTION,  
        "fields": [  
            {"name": "title", "type": "string"},  
            {"name": "url", "type": "string"},  
            {"name": "snippet", "type": "string", "optional": True},  
            {"name": "source_query", "type": "string", "facet": True},  
            {"name": "domain", "type": "string", "facet": True},  
            {"name": "position", "type": "int32"},  
        ],  
        "default_sorting_field": "position",  
    }  


def organic_to_documents(    data: Dict[str, Any], source_query: str) -> List[Dict[str, Any]]:  
    organic = data.get("organic")  
    if not isinstance(organic, list):  
        return []  
    out: List[Dict[str, Any]] = []  
    for i, row in enumerate(organic):  
        if not isinstance(row, dict):  
            continue  
        url = row.get("link") or row.get("url") or ""  
        if not url:  
            continue  
        title = (row.get("title") or "")[:8000]  
        snippet = (row.get("description") or row.get("snippet") or "") or ""  
        snippet = snippet[:16000]  
        pos = row.get("rank") or row.get("position") or (i + 1)  
        try:  
            position = int(pos)  
        except (TypeError, ValueError):  
            position = i + 1  
        domain = urlparse(url).netloc or ""  
        doc_id = hashlib.sha256(f"{url}\t{source_query}".encode()).hexdigest()  
        out.append(  
            {  
                "id": doc_id,  
                "title": title,  
                "url": url,  
                "snippet": snippet,  
                "source_query": source_query,  
                "domain": domain,  
                "position": position,  
            }  
        )  
    return out  


def ensure_collection(client: typesense.Client, \*, recreate: bool) -> None:  
    if recreate:  
        try:  
            client.collections[COLLECTION].delete()  
        except ObjectNotFound:  
            pass  
        client.collections.create(collection_schema())  
        return  
    try:  
        client.collections[COLLECTION].retrieve()  
    except ObjectNotFound:  
        client.collections.create(collection_schema())  


def load_queries(args: argparse.Namespace) -> List[str]:  
    queries: List[str] = []  
    if args.queries_file:  
        text = Path(args.queries_file).read_text(encoding="utf-8")  
        for line in text.splitlines():  
            line = line.strip()  
            if not line or line.startswith("#"):  
                continue  
            queries.append(line)  
    extra = args.queries or []  
    queries.extend(extra)  
    if not queries:  
        return list(DEFAULT_QUERIES)  
    return queries  


def main() -> None:  
    p = argparse.ArgumentParser(description="Ingest Bright Data SERP into Typesense.")  
    p.add_argument(  
        "--num-results",  
        type=int,  
        default=8,  
        help="Max organic rows to index per query after fetch (Google ignores &num=; we slice client-side).",  
    )  
    p.add_argument(  
        "--delay",  
        type=float,  
        default=0.6,  
        help="Seconds between Bright Data requests.",  
    )  
    p.add_argument(  
        "--append",  
        action="store_true",  
        help="Do not drop the collection; create it only if missing. Use for multiple ingest runs into one index.",  
    )  
    p.add_argument(  
        "--query",  
        action="append",  
        dest="queries",  
        metavar="Q",  
        help="SERP query string (repeatable). Default: built-in demo queries if no --queries-file/--query.",  
    )  
    p.add_argument(  
        "--queries-file",  
        type=str,  
        default=None,  
        help="Path to a file with one query per line (# and blank lines ignored).",  
    )  
    args = p.parse_args()  

    client = typesense_client()  
    ensure_collection(client, recreate=not args.append)  

    bd = BrightDataSERPClient()  
    all_docs: List[Dict[str, Any]] = []  
    query_list = load_queries(args)  

    for q in query_list:  
        print(f"Query: {q!r}")  
        try:  
            raw = bd.search(q, num_results=args.num_results)  
        except Exception as e:  
            print(f"  error: {e}")  
            continue  
        docs = organic_to_documents(raw, q)  
        print(f"  indexed {len(docs)} organic rows")  
        all_docs.extend(docs)  
        time.sleep(args.delay)  

    if not all_docs:  
        print("No documents to import. Check Bright Data credentials and SERP response.")  
        return  

    jsonl = "\n".join(json.dumps(d, ensure_ascii=False) for d in all_docs)  
    imp = client.collections[COLLECTION].documents.import_(jsonl, {"action": "upsert"})  
    # import_ returns one JSON object per line  
    errors = [line for line in imp.split("\n") if line and '"success":false' in line]  
    if errors:  
        print("Import reported errors (first few):", errors[:3])  
    print(f"Done. Total documents: {len(all_docs)}")  


if __name__ == "__main__":  
    main()
Enter fullscreen mode Exit fullscreen mode

Two fields have facet: True: source_query and domain. These are what the filter chips in the UI are built on. source_query is the exact string sent to the SERP API — i.e. not a label you add later, the actual query. domain is extracted from the URL at ingest time.

Both become filterable for free here, which is a huge win for us.

Also, default_sorting_field: "position" means results come back in the same order Google returned them. I do want that as a default — it's the ranking signal I'm using Bright Data to get in the first place.

Some Common Gotchas


When you’re mapping organic results to documents, the first question is how to generate document IDs. The move that feels right is to simply hash the URL — deduplicate on URL, one document per link.

Don’t listen to that instinct. Don’t do this:

doc_id = hashlib.sha256(url.encode()).hexdigest()
Enter fullscreen mode Exit fullscreen mode

What you should do is bake the query into the ID so the same link under two Bright Data runs is two documents, each tagged with the query that surfaced it:

doc_id = hashlib.sha256(f"{url}\t{source_query}".encode()).hexdigest()
Enter fullscreen mode Exit fullscreen mode

The ID is sha256(url + source_query), so the same paper appearing under two different queries becomes two separate documents. Search for a paper title and both facet chips show up — you can see exactly which of your Bright Data runs found it. If you hash on URL alone, that's gone permanently. The index looks cleaner but you've thrown away the only thing that makes the source_query facet meaningful.

One more thing that will ruin your day if you miss it: Bright Data returns link in most payloads but url in some. description and snippet both map to the snippet field depending on the response. Handle both, else some batches might index with blank snippets, no errors or warnings:

url     = row.get("link") or row.get("url") or ""  
snippet = (row.get("description") or row.get("snippet") or "")[:16000]
Enter fullscreen mode Exit fullscreen mode

The difference between a snapshot vs. a corpus


ingest.py runs in two modes:

python ingest.py            # drops and recreates the collection  
python ingest.py --append   # creates only if missing, then upserts
Enter fullscreen mode Exit fullscreen mode

Running without --append wipes and recreates the collection every time — that's probably fine for exploration, throwaway by design. --append creates the collection only if it doesn't exist, then upserts into it.

That matters because let’s say I have a scenario where I ran the default four queries on Monday. Thursday I wanted to add site:arxiv.org graph RAG 2026 to the same index — compare it against what I'd already collected rather than start over. With --append, the new results land alongside the originals and the new seed query shows up as a chip immediately. Without it, I'd be choosing between Monday's index and Thursday's.

That’s what I meant by “collect once, query many times” — the index accumulates and doesn’t reset or get overwritten each time.

Custom queries work inline or from a file:

python ingest.py --append --query "site:arxiv.org graph RAG 2026"  
python ingest.py --append --queries-file my_queries.txt
Enter fullscreen mode Exit fullscreen mode

Keeping the Typesense API key server-side


You could point the browser straight at Typesense and skip serve.py entirely. The problem is that Typesense's API key is an admin key — the same one that can drop your collection. Put it in client-side JS and anyone who opens devtools has it.

So serve.py is just a proxy. The browser calls /api/search, the server makes the authenticated Typesense request and JSON comes back.

I kept it as stdlib [http.server](https://docs.python.org/3/library/http.server.html) — no Flask or FastAPI. Adding a framework to wrap thirty lines of routing is honestly just adding a dependency for the sake of having a dependency. If you want to build on top of this, swapping in your preferred framework takes an hour.

The search parameters passed to Typesense are set once.

serve.py

import json  
import os  
import urllib.parse  
from http.server import BaseHTTPRequestHandler, HTTPServer  
from pathlib import Path  

import typesense  
from dotenv import load_dotenv  

load_dotenv()  

STATIC = Path(__file__).resolve().parent / "static"  
COLLECTION = "serp_results"  
PORT = int(os.getenv("SERVE_PORT", "8765"))  


def client() -> typesense.Client:  
    return typesense.Client(  
        {  
            "nodes": [  
                {  
                    "host": os.getenv("TYPESENSE_HOST", "localhost"),  
                    "port": os.getenv("TYPESENSE_PORT", "8108"),  
                    "protocol": os.getenv("TYPESENSE_PROTOCOL", "http"),  
                }  
            ],  
            "api_key": os.environ["TYPESENSE_API_KEY"],  
            "connection_timeout_seconds": 10,  
        }  
    )  


class Handler(BaseHTTPRequestHandler):  
    _ts: typesense.Client | None = None  

    @classmethod  
    def typesense(cls) -> typesense.Client:  
        if cls._ts is None:  
            cls._ts = client()  
        return cls._ts  

    def log_message(self, fmt: str, \*args: object) -> None:  
        print(f"[{self.address_string()}] {fmt % args}")  

    def do_GET(self) -> None:  
        parsed = urllib.parse.urlparse(self.path)  
        if parsed.path == "/api/search":  
            self._search(parsed.query)  
            return  
        if parsed.path == "/" or parsed.path == "/index.html":  
            self._file(STATIC / "index.html", "text/html; charset=utf-8")  
            return  
        self.send_error(404, "Not found")  

    def _file(self, path: Path, content_type: str) -> None:  
        if not path.is_file():  
            self.send_error(404, "Not found")  
            return  
        data = path.read_bytes()  
        self.send_response(200)  
        self.send_header("Content-Type", content_type)  
        self.send_header("Content-Length", str(len(data)))  
        self.end_headers()  
        self.wfile.write(data)  

    def _search(self, query: str) -> None:  
        qs = urllib.parse.parse_qs(query)  
        q = (qs.get("q") or [""])[0].strip()  
        fq = (qs.get("filter_by") or [""])[0].strip()  

        if not q:  
            payload = {  
                "hits": [],  
                "found": 0,  
                "facet_counts": [],  
                "q": q,  
            }  
            self._json(payload)  
            return  

        # Text search spans four stored fields (see ingest schema). Weights tune BM25-style  
        # ranking: a term in the title should matter more than the same term buried in the  
        # snippet, and more than an incidental match in the URL or domain string.  
        # Order MUST match query_by — Typesense applies weights positionally.  
        query_by = "title,snippet,url,domain"  
        query_by_weights = "4,3,1,1" # so titles are more important than snippets, which are more important than urls, which are more important than domains  

        params: dict = {  
            "q": q,  
            "query_by": query_by,  
            "query_by_weights": query_by_weights,  
            "facet_by": "source_query,domain",  
            "max_facet_values": 40,  
            "per_page": 25,  
        }  
        if fq:  
            params["filter_by"] = fq  

        try:  
            result = self.typesense().collections[COLLECTION].documents.search(params)  
        except Exception as e:  
            self.send_response(500)  
            self.send_header("Content-Type", "application/json")  
            self.end_headers()  
            self.wfile.write(json.dumps({"error": str(e)}).encode())  
            return  

        self._json(result)  

    def _json(self, obj: object) -> None:  
        data = json.dumps(obj, ensure_ascii=False).encode("utf-8")  
        self.send_response(200)  
        self.send_header("Content-Type", "application/json; charset=utf-8")  
        self.send_header("Content-Length", str(len(data)))  
        self.end_headers()  
        self.wfile.write(data)  


def main() -> None:  
    server = HTTPServer(("127.0.0.1", PORT), Handler)  
    print(f"SERP demo UI: http://127.0.0.1:{PORT}/")  
    server.serve_forever()  


if __name__ == "__main__":  
    main()
Enter fullscreen mode Exit fullscreen mode

query_by_weights runs in the same order as query_by. A match in title outscores the same match in snippet, which outscores a match in url or domain. That nudges ranking toward "this is what the page is about" rather than "this word appears somewhere in the metadata" — no embeddings, no extra service, just the standard keyword-search lever.

domain being in query_by is a small trick: searching arxiv.org directly returns everything from that domain. Useful when you've mixed sources in one index, and costs nothing.

facet_by returns counts alongside every search response — the UI builds chips from those without a second request.

One UX detail I cared about is if a facet filter produces zero results, the UI reruns the query without filter_by, keeps the chips populated from those broader counts, and tells you that your filters might be hiding matches. You don’t want a blank screen with zero explanation, do you? 🙃

Typesense Facets in Vanilla JavaScript


The UI can just be regular JavaScript/CSS. I don’t need to go into too much detail, frontend UI design isn’t the point of this post. All you need is some sort of JS logic that hits /api/search, renders hits, and builds chips from facet_counts.

Facet state is two variables:

let filterSq  = "";  // active source_query filter  
let filterDom = "";  // active domain filter
Enter fullscreen mode Exit fullscreen mode

Clicking a chip toggles the relevant variable and re-runs the search. Multiple active filters compose with &&:

function buildFilterBy() {  
  var parts = [];  
  if (filterSq)  parts.push("source_query:=`" + filterSq + "`");  
  if (filterDom) parts.push("domain:=`" + filterDom + "`");  
  return parts.join(" && ");  
}
Enter fullscreen mode Exit fullscreen mode

That string goes straight into Typesense’s filter_by — the UI is just a thin layer over native filter syntax. Nothing to maintain on the client side.

Each result card shows the title, snippet, domain, and the seed query that produced it. That last tag is the thing. You can see at a glance which Bright Data run each result came from — i.e. which question you were asking when you made the query.

Running it


# 1. Start Typesense  
docker compose up -d  

# 2. Install Python deps  
pip install -r requirements.txt  

# 3. Ingest the demo queries  
python ingest.py
Enter fullscreen mode Exit fullscreen mode

You'll see:

Query: 'site:arxiv.org retrieval augmented generation 2026'  
  indexed 8 organic rows  
Query: 'site:arxiv.org hybrid search reranking 2026'  
  indexed 8 organic rows  
Query: 'site:arxiv.org agentic RAG 2026'  
  indexed 8 organic rows  
Query: 'site:arxiv.org long context vs RAG 2026'  
  indexed 8 organic rows  
Done. Total documents: 32
Enter fullscreen mode Exit fullscreen mode
# 4. Start the UI  
python serve.py
Enter fullscreen mode Exit fullscreen mode

Open http://127.0.0.1:8765/ (or whatever you set with SERVE_PORT). You should see the empty search shell first:

Landing state. The search box and short explainer before you run a query.

Search for memory, chunk, graph, RAG. Click a seed query chip to isolate a single SERP run. If you've mixed domains, the domain chips filter those too.

Second pass, same index:

python ingest.py --append --query "site:arxiv.org graph RAG 2026"
Enter fullscreen mode Exit fullscreen mode

New seed query appears as a chip immediately. Everything you indexed before is still there.

What query "provenance" actually means


The default run collects ~32 arxiv results tagged across four seed queries. Search for RAG or memoryand you get hits from all four runs mixed together.

Same keywords, narrowed to one seed query via a facet chip — shortlist is the papers that surfaced under that SERP API run.

Now the interesting question is this: are the results under “agentic RAG 2026” the same papers as under “long context vs RAG 2026”?

We can verify this quickly.

Click the site:arxiv.org agentic RAG 2026 chip — that’s one shortlist. Clear it, then click site:arxiv.org long context vs RAG 2026 — another. Some papers appear in both, and you quickly inspect them this way. Those are the ones Google considers relevant regardless of how you framed the question. The ones in only one list are specific to that framing.

This is what I mean by provenance. The source_query facet isn't a topic label, but can be considered a record of which question you were asking when you collected the data. Meaning a paper showing up under multiple seeds is telling you something, and not a deduplication problem.

One honest caveat, though: this is navigation over SERP metadata — titles, snippets, URLs. It can’t search inside the PDFs. What it does is let me triage thirty papers in two minutes instead of twenty, which is the problem I actually had.

Frequently Asked Questions (FAQ)


Q: How do I get Google search results as JSON with Bright Data?

A: POST to https://api.brightdata.com/request with a Bearer token, your zone name, and a Google URL that includes &brd_json=1. That flag is what flips the response from raw HTML to a parsed organic array (each row has title, link, description, rank). The JSON you get back is an envelope — the SERP payload is inside body, usually as a JSON string you have to json.loads a second time.

Q: Typesense vs Meilisearch vs Elasticsearch — which should I pick for a local search index?

A: For this kind of workload (a small, local, faceted index over web data) Typesense and Meilisearch are both reasonable but Elasticsearch is overkill. Typesense is in-memory C++, sub-millisecond latency, facets and typo tolerance on by default, one Docker container, no JVM. Meilisearch is Rust, disk-backed (LMDB), handles larger corpora on less RAM, and has arguably nicer defaults for developer UX. Elasticsearch is what you use when you have a dedicated ops team, billions of documents, or log-analytics workloads.

Q: Why is the same URL indexed twice if it appears under two queries?

A: Because I want it that way. The document ID is sha256(url + source_query), so the same paper surfacing under "agentic RAG 2026" and under "long context vs RAG 2026" becomes two documents — each tagged with the query that found it. Searching for the title shows both facet chips, which is how you see which Bright Data run produced each hit. Hash on URL alone and that provenance is gone permanently.

Q: Does this actually search inside the papers, or just the search-result metadata?

A: Just metadata — titles, snippets, URLs, domains, and the seed query. It’s navigation over SERP rows, not full-text search over PDFs. If you need to search inside the papers, you’d add a second stage — download the PDFs, chunk, embed — on top of this index, using the URLs it surfaces as the candidate set.

Q: Can I use this pipeline for non-arxiv sources?

A: Yes. The pipeline has no opinion about what the queries are. site:arxiv.org is just the scenario I needed; swap in site:github.com, site:news.ycombinator.com, mix site: operators, or drop the filter entirely. The domain field is extracted from the URL at ingest time, so mixed-domain runs get a second facet chip for free.

Q: Why stdlib _http.server_ instead of Flask or FastAPI?

A: Because the proxy is small enough that a framework import would be bigger than the logic it wraps. One handler, two routes (/ and /api/search), no middleware, no router — stdlib is enough. If you're building on top of this, swapping in FastAPI or your preferred framework takes an hour; I just didn't want to pay the dependency tax for a demo.

Key Takeaways


Bright Data solves the hard part of web data collection — proxy rotation, bot detection, structured extraction. Yadda, yadda.

What you do with that JSON is a different question. Export it and it answers the questions you had when you wrote the query. Or, index it, and it answers questions you haven’t even thought of yet.

From a collection as an endpoint to a collection as the start of something you can actually explore as you’re researching — is what I was going for here. It took me a week to refactor something I’d been doing badly for a year, and about twenty minutes to run once it was done. This can scale quite well so, more queries, more domains, more --append runs — and you have the option to make the Typesense index grow with the research instead of resetting every time.

Top comments (0)