We added web search to a local RAG pipeline. It looked straightforward.
It wasn’t.
What followed:
- broken assumptions about scoring
- a safety model that behaved differently than we expected
- a “temporary” bypass that stuck around way too long
- and a cross-encoder that happily ranked navigation menus instead of content
Here’s what actually broke, and how we ended up fixing it.
TL;DR
- Treat web retrieval as another retrieval leg—not a special case
- Safety needs layers, but the baseline can’t be configurable away
- Fix score normalization before touching thresholds
- Pre-filter aggressively before the expensive steps
- Any workaround will outlive the issue it was meant to patch
1. Adding a fourth retrieval leg
The starting point was a pretty standard hybrid RAG setup: vector search (ChromaDB), BM25, and a graph index, all merged via Reciprocal Rank Fusion.
Adding web search felt like an obvious extension: just another retriever. Fetch results, wrap them, feed them into RRF, then let the same reranking and filtering pipeline handle everything.
That part worked fine. Web results flowed through the system exactly like local ones.
The important part, in retrospect, was this:
Nothing downstream needed to know whether a document came from the web or from the local corpus. That kept the architecture clean—but it also meant that any flaws in scoring or filtering would affect everything at once.
2. Safety isn’t just “a filter” (safeguards)
One thing that became clear quickly: web retrieval isn’t just retrieval. It’s a side-effect. You’re turning a user query into a network call.
Once you look at it that way, safety becomes part of system correctness.
We ended up with a layered set of safeguards:
- hard blocks for categories that should never pass
- pattern filters for injection and obvious attacks
- a lightweight intent classifier for borderline cases
- query sanitation before sending anything externally
The key decision was where the baseline lives. We kept it in code, not config. Config can only tighten things—never relax them.
That wasn’t about compliance as much as predictability. If behavior can change silently via config, it becomes difficult to reason about what the system will actually do in production.
3. The two-booleans trap
We made the classic mistake of representing one concept with two booleans.
_ALLOW_WEB_SEARCH
_WEB_SEARCH_DRY_RUN
We wanted three states:
- off
- dry run
- live
Two booleans give you four. One of them didn’t mean anything, but it existed anyway.
We eventually replaced this with a single explicit mode flag. In practice, once the rest of the pipeline stabilized and safeguards were in place, the dry-run mode turned out not to be necessary anymore and was removed.
The broader lesson still holds: control flow should make invalid states impossible to express.
4. Filtering before reranking
Once web results were in the system, the next issue showed up in performance.
Cross-encoder reranking is expensive, and web results are noisy. Running a batch of mediocre snippets through it just to discard most of them later was wasteful.
We added two lightweight filters before reranking:
- BM25 over the snippets themselves
- cosine similarity using the embedding model
Nothing fancy—just cheap ways to discard obviously weak candidates early.
That reduced load on the cross-encoder and improved output quality at the same time. Probably the lowest-effort, highest-impact change in the whole pipeline.
5. The bypass that stayed too long
Early on, we let web results skip the rerank threshold entirely.
The reasoning seemed fine: search engines already rank results, and snippets often look worse to a cross-encoder than full documents.
In practice, this was compensating for something else—our normalization was wrong.
Once we fixed normalization, the bypass became a problem:
- low-quality snippets passed straight through
- they crowded out better local results
- the model started favoring short, clean web content over more relevant local data
We removed the bypass and introduced a separate threshold for web results instead.
The important bit wasn’t the solution—it was realizing that the bypass had outlived its purpose.
A workaround introduced for a broken system will still be there after the system is fixed, unless you remove it deliberately.
6. Fixing normalization
This was the biggest issue.
Cross-encoders output raw logits, and we initially applied a sigmoid to map them into [0, 1]. It looked principled.
It didn’t work.
Technical content—especially non-Q&A-style text—ended up with low scores everywhere. After thresholding, nothing survived. The system quietly fell back to the model’s internal knowledge.
What worked instead was splitting the approach:
- local documents: min–max normalization, relative within the result pool
- web results: sigmoid, treated as absolute
This isn’t mathematically perfect, but it matched the behavior we actually needed in practice:
- local results compete with each other
- web results are judged on their own merit
7. What the cross-encoder actually sees
At one point we enabled full-page fetching for web results, thinking it would improve relevance.
It didn’t.
The cross-encoder only sees ~512 tokens. For a web page, that’s usually:
- headers
- navigation
- metadata
The useful paragraph is often much further down.
So the model was effectively ranking page chrome.
The fix was simple:
- always keep the original search snippet
- use the snippet for reranking
- optionally send the full page to the LLM
Different stages need different inputs. That became obvious only after trying to unify them.
8. Intent filtering, narrowly scoped
The intent classifier ended up being a small guardrail.
It doesn’t try to be perfect—it just catches things that slip past the other safeguards:
- entity + action combinations
- queries that are syntactically clean but clearly problematic
Two constraints kept it predictable:
- the baseline is fixed in code
- config can only add to it or tighten it
It’s not a replacement for other safety measures—just an additional layer.
9. Things we didn’t optimize (yet)
There are a few obvious extensions we didn’t pursue immediately:
- multiple search providers
- translating queries before filtering
- multi-query expansion for web search
All of those have tradeoffs, especially once network side-effects are involved. We chose to stabilize the core pipeline first.
The one-liner
Web search isn’t a feature you bolt on. It changes the behavior of the entire pipeline.
Once queries turn into network calls, scoring, filtering, and safeguards all become coupled in ways they weren’t before. Most of the complexity came from making that interaction predictable.
Curious how others handle this:
- Do you normalize cross-encoder scores globally or per-query?
- Has anyone had success calibrating logits instead of mixing normalization strategies?
If you want to dig into the implementation:
Top comments (0)