DEV Community

Konstantin Vasserman
Konstantin Vasserman

Posted on • Originally published at amgix.io

Hybrid Search Won't Help You If Your Keyword Search is Broken

The common wisdom is that using vector similarity (semantic) search produces better results than the keyword search alone, but to get the best results with one-shot retrieval (no re-ranking), one must use a hybrid search (combine the keyword and semantic searches). Does this common understanding hold true under scrutiny? Is semantic search always better than keyword search? Is hybrid search always better than semantic?

How Did We End Up in This Rabbit Hole

Recently, we've noticed that one of the popular search engines advertised themselves as the "most relevant hybrid search engine" (or something along those lines). Or maybe we imagined this, because we can't find the ad anymore.

Whether the ad was real or we dreamed it up is not really important. What's important is that it sparked a question: "what would make a search engine more relevant at hybrid search compared to others?" After all, given the same ML model, the semantic search should produce about the same results on any engine (accounting for the differences in vector search params, etc.). So the only differentiators in the hybrid search results would be the quality of keyword search and the fusion math.

Given that most engines default to RRF as their (often only) fusion algorithm, this led us to the next question, "how does the relevance of the keyword search affect the relevance of the hybrid search?"

Luckily, search relevance can be measured. This post follows the rabbit hole.

Is Semantic Always Better than Keyword Search?

Before we get into the nitty gritty details, let us answer the first question from our introduction: no, semantic search is not always better. It depends on your data. Semantic search is better on Natural Language (NL) datasets, but as we've seen in our early benchmarks of the PC Parts dataset, models can struggle with identifier-heavy data (part numbers, SKUs, special chars, etc.) - there is simply not enough semantic meaning in those datasets for a model to produce meaningful inferences about relevance. Keyword searches alone do significantly better with this sort of data.

Is Hybrid Search Always Better than Semantic?

Hypothesis

Whether hybrid search is better than semantic search alone depends on the quality/relevance of the keyword search.

How to Test the Hypothesis

In order to test our hypothesis we need to compare the relevances of the hybrid search results between systems with significantly different keyword search quality. Fortunately for us, such systems exist.

About nDCG@10

Throughout this document we will use nDCG@10 as the metric for search relevance. nDCG@k (normalized Discounted Cumulative Gain at k) is a metric that measures the quality of ranking results. It considers both the relevance of retrieved documents and their position in the result list. A perfect ranking would have an nDCG of 1.0. Higher values indicate better search effectiveness. nDCG@10 measures the quality of top 10 results.

In another set of benchmarks we measured keyword-only search relevance of Typesense, Meilisearch, Elasticsearch, and Amgix, on a number of BEIR datasets. Here is the summary of the nDCG@10 results for the tested datasets:

BEIR Dataset Docs BM25 Baseline Typesense Meilisearch Elasticsearch Amgix
SciFact 5K 0.6650 0.3386 0.3616 0.6953 0.6637
Quora 523K 0.7890 0.2521 0.2547 0.8058 0.8018
NQ 2.6M 0.3290 0.0364 0.0490 0.3116 0.3121

BM25 baseline numbers are from Pyserini BEIR Regressions

As you can see, there are some significant disparities in search relevance between two groups of engines: Typesense and Meilisearch on one end of the spectrum, Elasticsearch and Amgix on the other. While Elasticsearch and Amgix consistently score in the ballpark of BM25 baseline, Typesense and Meilisearch results are about half as relevant on the small SciFact dataset and get progressively worse on bigger datasets. By the time we get to the NQ dataset with 2.6M documents, Typesense and Meilisearch relevance numbers suggest that they return almost no relevant results.

Test Setup

To test our hypothesis that keyword search relevance is a significant variable in hybrid search quality, we will compare the hybrid search results between Meilisearch and Amgix on the small SciFact dataset. The SciFact dataset is small (saving us indexing times) and Meilisearch keyword search was strongest on it.

Both systems will use the same ML model: BAAI/bge-small-en-v1.5. Documents will contain the same flat representation of the text: {{doc.title}}\n{{doc.text}}. Meilisearch uses all defaults for keyword and model inference. Amgix uses Full Text (full_text) tokenizer with all defaults. We'll measure nDCG@10 at multiple settings of, what Meilisearch calls "semantic ratio" (weight of semantic search results in the fusion logic): from 0 (pure keyword search) to 1 (pure semantic search). In Amgix we will also compare two fusion modes: RRF (default) and Linear fusion.

Results

Semantic Ratio Search Type BM25 Baseline Meilisearch Amgix (RRF) Amgix (Linear)
0.00 Keywords Only 0.6790 0.3767 0.6918 0.6918
0.25 Hybrid 0.5880 0.7080 0.7258
0.50 Hybrid 0.7171 0.7365 0.7437
0.75 Hybrid 0.7190 0.7342 0.7340
1.00 Model Only 0.7190 0.7200 0.7200

Bold numbers represent the best score for given engine/configuration

Discussion

Things to note in the results:

  • The keyword-only results are familiar to us from the previous benchmarks. Amgix scores a little above the baseline, while Meilisearch produces about half the relevance. Nothing new here.

  • Model Only (semantic) search numbers are also predictable: the difference between 0.7190 and 0.7200 is insignificant and probably due to the vector search tuning parameters in the systems. The same model produces the same, basically, relevance numbers. That's what we would expect.

  • Hybrid space is where things get interesting:

    • In all configurations tested, hybrid search improved relevance compared to the pure keyword search results. The jump from 0.3767 to 0.5880, and then to 0.7171, for Meilisearch is a huge improvement over their keyword alone numbers. Amgix numbers improved too, but not as dramatically, since the keyword baseline was already fairly high.
    • But let's now look at how the hybrid numbers compare to the pure semantic (model only) numbers. Meilisearch hybrid relevance never exceeded the relevance of the model alone. In fact, looking at these numbers, one can make an argument that introducing their keyword search into the mix (at least with this one dataset tested) doesn't make search results better.
    • The Amgix story is different, in 5 out of 6 hybrid measurements, hybrid configuration outperformed pure semantic search. The only exception is RRF config with semantic ratio of 0.25 where the system scored 0.7080 (probably due to the outweighed significance given to the keyword signal).
    • Amgix hybrid scores of 0.7365 (RRF) and 0.7437 (Linear) are meaningful improvements (+2.2%, 3.2%) over pure semantic search relevance.

Conclusion

In this test, with this dataset, with these configuration settings, it seems like our hypothesis is correct. The quality of your keyword search may make the difference between the hybrid search relevance being an improvement over pure semantic search or not.

The devil is always in the details, but if your keyword search is broken, including it in a hybrid search may not give you the improvements you are looking for.

Top comments (0)