HarinezumIgel

Posted on Jun 2

Adding web search to our RAG pipeline: what broke and why

#rag #ai #architecture #cybersecurity

We added web search to a local RAG pipeline. It looked straightforward.

It wasn’t.

What followed:

broken assumptions about scoring
a safety model that behaved differently than we expected
a “temporary” bypass that stuck around way too long
and a cross-encoder that happily ranked navigation menus instead of content

Here’s what actually broke, and how we ended up fixing it.

TL;DR

Treat web retrieval as another retrieval leg—not a special case
Safety needs layers, but the baseline can’t be configurable away
Fix score normalization before touching thresholds
Pre-filter aggressively before the expensive steps
Any workaround will outlive the issue it was meant to patch

1. Adding a fourth retrieval leg

The starting point was a pretty standard hybrid RAG setup: vector search (ChromaDB), BM25, and a graph index, all merged via Reciprocal Rank Fusion.

Adding web search felt like an obvious extension: just another retriever. Fetch results, wrap them, feed them into RRF, then let the same reranking and filtering pipeline handle everything.

That part worked fine. Web results flowed through the system exactly like local ones.

The important part, in retrospect, was this:

Nothing downstream needed to know whether a document came from the web or from the local corpus. That kept the architecture clean—but it also meant that any flaws in scoring or filtering would affect everything at once.

2. Safety isn’t just “a filter” (safeguards)

One thing that became clear quickly: web retrieval isn’t just retrieval. It’s a side-effect. You’re turning a user query into a network call.

Once you look at it that way, safety becomes part of system correctness.

We ended up with a layered set of safeguards:

hard blocks for categories that should never pass
pattern filters for injection and obvious attacks
a lightweight intent classifier for borderline cases
query sanitation before sending anything externally

The key decision was where the baseline lives. We kept it in code, not config. Config can only tighten things—never relax them.

That wasn’t about compliance as much as predictability. If behavior can change silently via config, it becomes difficult to reason about what the system will actually do in production.

3. The two-booleans trap

We made the classic mistake of representing one concept with two booleans.

_ALLOW_WEB_SEARCH
_WEB_SEARCH_DRY_RUN

We wanted three states:

off
dry run
live

Two booleans give you four. One of them didn’t mean anything, but it existed anyway.

We eventually replaced this with a single explicit mode flag. In practice, once the rest of the pipeline stabilized and safeguards were in place, the dry-run mode turned out not to be necessary anymore and was removed.

The broader lesson still holds: control flow should make invalid states impossible to express.

4. Filtering before reranking

Once web results were in the system, the next issue showed up in performance.

Cross-encoder reranking is expensive, and web results are noisy. Running a batch of mediocre snippets through it just to discard most of them later was wasteful.

We added two lightweight filters before reranking:

BM25 over the snippets themselves
cosine similarity using the embedding model

Nothing fancy—just cheap ways to discard obviously weak candidates early.

That reduced load on the cross-encoder and improved output quality at the same time. Probably the lowest-effort, highest-impact change in the whole pipeline.

5. The bypass that stayed too long

Early on, we let web results skip the rerank threshold entirely.

The reasoning seemed fine: search engines already rank results, and snippets often look worse to a cross-encoder than full documents.

In practice, this was compensating for something else—our normalization was wrong.

Once we fixed normalization, the bypass became a problem:

low-quality snippets passed straight through
they crowded out better local results
the model started favoring short, clean web content over more relevant local data

We removed the bypass and introduced a separate threshold for web results instead.

The important bit wasn’t the solution—it was realizing that the bypass had outlived its purpose.

A workaround introduced for a broken system will still be there after the system is fixed, unless you remove it deliberately.

6. Fixing normalization

This was the biggest issue.

Cross-encoders output raw logits, and we initially applied a sigmoid to map them into [0, 1]. It looked principled.

It didn’t work.

Technical content—especially non-Q&A-style text—ended up with low scores everywhere. After thresholding, nothing survived. The system quietly fell back to the model’s internal knowledge.

What worked instead was splitting the approach:

local documents: min–max normalization, relative within the result pool
web results: sigmoid, treated as absolute

This isn’t mathematically perfect, but it matched the behavior we actually needed in practice:

local results compete with each other
web results are judged on their own merit

7. What the cross-encoder actually sees

At one point we enabled full-page fetching for web results, thinking it would improve relevance.

It didn’t.

The cross-encoder only sees ~512 tokens. For a web page, that’s usually:

headers
navigation
metadata

The useful paragraph is often much further down.

So the model was effectively ranking page chrome.

The fix was simple:

always keep the original search snippet
use the snippet for reranking
optionally send the full page to the LLM

Different stages need different inputs. That became obvious only after trying to unify them.

8. Intent filtering, narrowly scoped

The intent classifier ended up being a small guardrail.

It doesn’t try to be perfect—it just catches things that slip past the other safeguards:

entity + action combinations
queries that are syntactically clean but clearly problematic

Two constraints kept it predictable:

the baseline is fixed in code
config can only add to it or tighten it

It’s not a replacement for other safety measures—just an additional layer.

9. Things we didn’t optimize (yet)

There are a few obvious extensions we didn’t pursue immediately:

multiple search providers
translating queries before filtering
multi-query expansion for web search

All of those have tradeoffs, especially once network side-effects are involved. We chose to stabilize the core pipeline first.

The one-liner

Web search isn’t a feature you bolt on. It changes the behavior of the entire pipeline.

Once queries turn into network calls, scoring, filtering, and safeguards all become coupled in ways they weren’t before. Most of the complexity came from making that interaction predictable.

Curious how others handle this:

Do you normalize cross-encoder scores globally or per-query?
Has anyone had success calibrating logits instead of mixing normalization strategies?

If you want to dig into the implementation:

Repo: https://github.com/HarinezumIgel/RAG-LCC

Top comments (2)

Abdullah Shahin • Jun 3

The cross-encoder logit normalization gotcha is one of those things that quietly degrades a system for weeks before anyone notices. One pattern that helps: instead of normalizing scores within a batch (min-max or sigmoid), keep an absolute floor calibrated from a labeled dev set — within-batch normalization makes every retrieval look ~1.0 to a downstream gate, including the "no good match" case, which is exactly when you want the gate to fire. RRF sidesteps this for ordering but doesn't help when a consumer needs a confidence score to decide whether to answer at all. The bypass-logic-that-outlived-its-bug detail also rings true; in retrieval code those branches tend to ossify because removing them feels like risking a regression, even when the precondition they were guarding against is long gone. A small thing that's helped me catch that class of bug: log the path taken through the rerank/fuse pipeline as a structured tag on every query so you can grep for "still hitting the bypass" weeks later.

HarinezumIgel • Jun 3

That’s a really good point — especially the failure mode where within-batch normalization makes every candidate look high-confidence, even in the “no good match” case. I’ve run into that behavior before as well.
In my case I’m trying to avoid that by introducing an absolute signal before the per-chunk filtering step. The reranker scores the full candidate pool (local + web), and the normalization is anchored using the top logit:

lo = min(all_raw_scores)
hi = max(all_raw_scores)
sigmoid_hi = 1.0 / (1.0 + math.exp(-hi))

normalized = (raw - lo) / (hi - lo)
scaled = normalized * sigmoid_hi

So while there is still a relative component (min–max), the entire batch gets suppressed if the top candidate is weak. In practice, if the best logit drops below ~−0.6, sigmoid_hi pulls all scores down enough that the system can still fall into the “no relevant context” path.
Effectively the decision isn’t purely relative — it’s anchored on the absolute value of the top candidate, which acts as a coarse batch-level confidence signal before the per-chunk thresholds are applied (stricter for local than for web at the moment).
So this is closer to a soft version of what you describe — not a fully calibrated confidence model, but still enough to avoid the “everything looks ~1.0” failure mode.
And +1 on the bypass logic — that definitely resonates. Those branches tend to stick around longer than intended. I like your suggestion of logging the pipeline path; being able to grep for “still hitting the bypass” later is a really practical way to surface that kind of drift.
Appreciate the insights — very relevant observations.