Prithvi S

Posted on Jun 9

BM25 Scoring in Elasticsearch: Why Your Search Results Rank the Way They Do

#elasticsearch #search #database #analytics

BM25 Scoring in Elasticsearch: Why Your Search Results Rank the Way They Do

Photo by Luke Chesser on Unsplash

Ever typed a query into your app and wondered why result #3 outranked result #2? Or why a document with 50 mentions of a keyword sits below one with only 5? The answer lives in the scoring algorithm - and in Elasticsearch, that algorithm is BM25.

Most engineers know Elasticsearch uses BM25. Fewer understand how it actually works. This post is for the curious ones. We will break down the probabilistic relevance framework behind every Elasticsearch query, explain why it beats the older TF-IDF approach, and show you how to tune it for your specific data.

Why This Matters

Search ranking is not magic. It is math - and the math matters because:

Bad scoring hurts user experience: When the wrong document ranks first, users click more, convert less, and trust your search less.
Tuning is free performance: No hardware changes, no re-architecture. Just understanding the parameters already in your index.
Debugging is impossible without understanding: The _explain API spews JSON that only makes sense if you know what BM25 is doing.

I have spent years building search systems at Cloudera, from log analytics to product search. BM25 is the default in every Elasticsearch cluster I have touched - and it is worth understanding.

From TF-IDF to BM25: A Brief History

If you learned search in a computer science course, you probably started with TF-IDF. Term Frequency times Inverse Document Frequency. Simple, elegant, and limited.

TF-IDF has two fundamental problems in real-world search:

Linear term frequency: If a document has 100 occurrences of a term, TF-IDF says it is 10x more relevant than one with 10 occurrences. In practice, the 100th occurrence adds almost no new relevance signal. Diminishing returns exist, but TF-IDF ignores them.
No document length normalization: Longer documents naturally contain more terms. Without correction, a 10,000-word Wikipedia article will dominate over a 200-word focused answer - even when the short document is exactly what the user wants.

BM25 (Best Match 25, from the Okapi system in the 1990s) solves both. It is a probabilistic relevance model, not a heuristic. It asks: "What is the probability that this document is relevant, given this query?" and builds the math from there.

Elasticsearch switched from TF-IDF to BM25 as the default similarity algorithm in version 5.0. If you are on any modern version, you are using BM25 whether you know it or not.

The BM25 Formula, Decomposed

The full formula looks intimidating. Let us strip it to what actually matters:

score = IDF * (term_frequency_saturated) * boost

Three components. Let us walk through each.

1. IDF: Inverse Document Frequency

IDF = ln(1 + (N - n + 0.5) / (n + 0.5))

Where:

N = total number of documents in the index
n = number of documents containing the term
ln = natural logarithm

What it means: Rare terms are more valuable. If "kubernetes" appears in 3 documents out of 10,000, those 3 documents are probably highly relevant to a query containing "kubernetes". If "the" appears in 9,999 documents, it contributes almost nothing to relevance scoring.

The + 0.5 in the formula prevents division by zero and smooths the curve. The ln(1 + ...) ensures the score stays positive and grows sub-linearly.

2. Term Frequency (Saturated)

This is where BM25 diverges from TF-IDF. The formula is:

TF_component = (f * (k1 + 1)) / (f + k1 * (1 - b + b * (dl / avgdl)))

Where:

f = raw term frequency (how many times the term appears in this document)
k1 = saturation parameter (default 1.2)
b = length normalization parameter (default 0.75)
dl = document length (in terms)
avgdl = average document length across the index

The key insight: The numerator f * (k1 + 1) grows linearly with frequency, but the denominator f + k1 * (...) grows too. The result is a curve that saturates:

Term Frequency	TF-IDF Score	BM25 Score (approx)
1	1.0	0.92
5	5.0	2.38
10	10.0	3.03
50	50.0	3.84
100	100.0	3.97

At 100 occurrences, TF-IDF says 100x relevance. BM25 says roughly 4x. The 100th repetition is barely adding signal. This mirrors human judgment: a document that mentions "docker" 100 times is not 100 times more relevant than one that mentions it 10 times.

3. The k1 Parameter: Controlling Saturation

k1 (default 1.2) controls how quickly the saturation curve flattens:

Low k1 (0.1-0.5): Rapid saturation. The 5th occurrence adds almost as much as the 50th. Good for fields where repetition is not informative (e.g., boilerplate text, categories).
High k1 (1.5-2.0): Slow saturation. Later occurrences still add noticeable score. Good for content-heavy fields where depth matters (e.g., blog posts, documentation).
k1 = 0: Pure binary relevance. One occurrence counts; everything after is ignored.
k1 = infinity: Approaches TF-IDF behavior (linear).

In practice, the default 1.2 works well for general text. But knowing when to tune it is what separates working search from great search.

4. The b Parameter: Document Length Normalization

b (default 0.75) controls how aggressively BM25 penalizes long documents:

length_factor = (1 - b + b * (dl / avgdl))

b = 0: No length normalization. A 10,000-word document and a 100-word document compete on equal footing. Good for short, structured fields like product SKUs or log tags.
b = 1: Full normalization. Long documents are heavily penalized. Good for mixed-length content where brevity signals quality (e.g., Q&A, support tickets).
b = 0.75 (default): A balanced compromise. The 10,000-word doc gets a moderate penalty, but not enough to bury it if it is genuinely relevant.

Why this matters: Without length normalization, your Wikipedia clone will always outrank your focused blog post. The user searching "kubernetes ingress tutorial" wants the 500-word guide, not the 5,000-word encyclopedia entry.

5. Boost: Query-Time Control

Elasticsearch allows boosting at query time:

{
  "query": {
    "multi_match": {
      "query": "kubernetes deployment",
      "fields": ["title^3", "content^1", "tags^2"]
    }
  }
}

The ^3 multiplies the title field's BM25 score by 3. This is not changing the underlying similarity algorithm - it is applying a coefficient after the fact. But it is the most common way engineers influence ranking without reindexing.

BM25 in Practice: Three Real-World Scenarios

Scenario 1: E-Commerce Product Search

You have products with titles ("Nike Air Max 90"), descriptions (500 words), and specifications (structured key-value pairs). The challenge: users type short queries like "running shoes" and expect the most relevant product first.

Tuning approach:

Title field: k1=1.2, b=0.3 (short, precise, no heavy length penalty needed)
Description field: k1=1.2, b=0.75 (standard)
Specifications field: k1=0.3, b=0.0 (structured, repetition is noise)

Query-time boost: title^3, description^1, specs^2 (specs matter because "size 10" is a strong signal, but title wins for general queries).

Scenario 2: Log Analytics (Observability)

You are searching application logs. Documents are semi-structured JSON: message, service_name, level, timestamp. The query: level:ERROR AND message:"connection timeout".

Tuning approach:

message field: k1=1.5, b=0.5 (repetition of "timeout" across multiple log lines is meaningful - saturation should be slower)
service_name: k1=0.1, b=0.0 (binary match - either the service is relevant or not)
level: k1=0.0, b=0.0 (exact match, no scoring needed - use a filter)

Key insight: In log search, exact matches on structured fields often matter more than free-text relevance. Use filter contexts for level and service_name to skip scoring entirely and leverage caching.

Scenario 3: Documentation & Knowledge Base

Technical documentation with long-form articles. Users search "how to configure TLS in Elasticsearch" and expect the setup guide, not the security architecture overview.

Tuning approach:

title field: k1=0.8, b=0.3 (titles are short and precise - saturation should be fast)
content field: k1=1.5, b=0.9 (long articles, heavy length normalization - but slow saturation because depth matters)
headings field: k1=0.6, b=0.2 (section headings are like mini-titles)

Query-time: headings^4, title^3, content^1 (a matching heading is usually the best signal).

Customizing BM25 in Your Index

Here is how to set custom BM25 parameters at index creation:

PUT /my-index
{
  "settings": {
    "index": {
      "similarity": {
        "my_custom_bm25": {
          "type": "BM25",
          "k1": 1.5,
          "b": 0.6
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "similarity": "my_custom_bm25"
      }
    }
  }
}

And here is how to change the default similarity for all text fields:

PUT /my-index
{
  "settings": {
    "index": {
      "similarity": {
        "default": {
          "type": "BM25",
          "k1": 1.2,
          "b": 0.75
        }
      }
    }
  }
}

Important: Changing similarity requires reindexing. You cannot change it on an existing index. Plan this during index design, not after you have 10 million documents.

Debugging Scores with the `_explain` API

The most powerful tool for understanding BM25 is the _explain API:

GET /my-index/_search
{
  "explain": true,
  "query": {
    "match": {
      "title": "kubernetes"
    }
  }
}

The response breaks down every component of the score for every matched document. You will see:

"details": [
  {
    "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))",
    "value": 2.734
  },
  {
    "description": "tfNorm, computed as (f * (k1 + 1)) / (f + k1 * (1 - b + b * dl / avgdl))",
    "value": 0.912
  }
]

When a stakeholder asks "why did this document rank first?", _explain is your answer. It turns scoring from a black box into an auditable decision tree.

Five Common BM25 Pitfalls (and How to Avoid Them)

1. Not Tuning for Content Type

The default k1=1.2, b=0.75 works for generic text. It fails for:

Short structured fields (titles, SKUs, tags) - use lower b
Very long documents (legal contracts, books) - use higher b
Repetitive content (logs, social media feeds) - use lower k1

Fix: Analyze your document length distribution. If 90% of docs are under 100 words, b=0.75 is over-penalizing.

2. Over-Boosting Into Noise

{
  "fields": ["title^10", "content^1"]
}

A 10x boost on title means a single title match can outweigh 20 content matches. If the title is "The Ultimate Guide to Everything", that is a bad outcome. The user wants content, not clickbait titles.

Fix: Start with ^1 to ^3 boosts. Use _explain to verify the math before going higher.

3. Ignoring the Filter Context

Using must + match for exact-match fields like status:active or category:electronics wastes compute. These belong in a filter clause, which skips scoring entirely and is cacheable.

{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "wireless headphones" } }
      ],
      "filter": [
        { "term": { "category": "electronics" } },
        { "term": { "status": "active" } }
      ]
    }
  }
}

The filter terms do not affect BM25 score. They just prune the candidate set. This is faster and cleaner.

4. Assuming BM25 Is Enough for Semantic Search

BM25 is lexical. It matches exact terms or their stemmed variants. It does not understand that "k8s" and "kubernetes" are the same thing, or that "docker container" and "containerization" are related concepts.

For semantic search, you need dense vectors (embeddings) alongside BM25. Elasticsearch 8.15+ supports this natively with hybrid search and Reciprocal Rank Fusion (RRF).

5. Benchmarking With Wrong Settings

Do not benchmark search quality on an index with refresh_interval: -1 (disabled refresh) and then complain that results are stale. BM25 scores depend on the index state - document counts, average length, term frequencies. All of these change as documents are added or removed.

Fix: Benchmark on a production-like index. If you are testing on a 1,000-document dev index, your avgdl and IDF values will not match the 10-million-document production reality.

The Modern Context: BM25 + Vector Search

In 2025 and 2026, the most interesting search systems are not purely lexical or purely semantic. They are hybrid. BM25 handles the exact-term matching that vectors miss. Dense vectors (from models like E5, OpenAI embeddings, or ELSER) handle the conceptual similarity that BM25 cannot reach.

Elasticsearch supports this with the retrievers API and RRF (Reciprocal Rank Fusion):

GET /my-index/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "match": {
                "content": "kubernetes deployment strategies"
              }
            }
          }
        },
        {
          "knn": {
            "field": "content_vector",
            "query_vector": [...],
            "k": 10
          }
        }
      ]
    }
  }
}

BM25 finds documents containing the exact terms. kNN finds documents that are conceptually similar. RRF merges the rankings without requiring score normalization. The result is search that is both precise and intelligent.

This is the architecture behind modern RAG (Retrieval-Augmented Generation) pipelines. The LLM needs the most relevant context, and BM25 + vectors together deliver better retrieval than either alone.

Photo by Alex Knight on Unsplash - representing the intersection of traditional and modern search

When to Use What: A Decision Framework

Use Case	Primary Scoring	Secondary/Tuning
E-commerce product search	BM25 with field boosts	k1=1.2, b=0.3 for titles
Log analytics	Filter + BM25 on message	k1=1.5 for repeated error patterns
Documentation/knowledge base	BM25 with heading boosts	k1=1.5, b=0.9 for long content
Semantic search (Q&A)	Dense vectors + RRF	BM25 for exact matches, vectors for meaning
Structured data search (exact match)	Filter context	Skip BM25 entirely
Legal/contract search	BM25 with phrase queries	k1=0.8, b=0.8 for mixed-length legal text

Conclusion

BM25 is not just a default setting. It is a probabilistic relevance model that makes specific, tunable trade-offs about term frequency saturation and document length normalization. Understanding those trade-offs lets you build search that ranks the right documents for your specific data and users.

The key takeaways:

BM25 saturates term frequency: The 50th keyword mention adds less than the 5th. This is a feature, not a bug.
Length normalization matters: Longer documents are penalized unless you tune b for your content distribution.
The defaults are just a starting point: k1=1.2, b=0.75 works for generic English text. Your data is probably not generic.
Use _explain for debugging: When rankings look wrong, the math is there. You just need to read it.
Hybrid is the future: BM25 for lexical precision, vector search for semantic understanding. Together they power modern RAG systems.

The next time someone asks why a document ranked where it did, you will have an answer grounded in probability theory, not guesswork.

I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I build search systems, data pipelines, and the occasional distributed system. Follow my work on GitHub: https://github.com/iprithv

References

Elasticsearch Similarity Module: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html
BM25 Original Paper: Robertson, S.E., et al. "Okapi at TREC-3." NIST Special Publication, 1995.
Elasticsearch 8.15 Retrievers API: https://www.elastic.co/guide/en/elasticsearch/reference/current/retriever.html
Hybrid Search with RRF: https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html

DEV Community

BM25 Scoring in Elasticsearch: Why Your Search Results Rank the Way They Do

BM25 Scoring in Elasticsearch: Why Your Search Results Rank the Way They Do

Why This Matters

From TF-IDF to BM25: A Brief History

The BM25 Formula, Decomposed

1. IDF: Inverse Document Frequency

2. Term Frequency (Saturated)

3. The k1 Parameter: Controlling Saturation

4. The b Parameter: Document Length Normalization

5. Boost: Query-Time Control

BM25 in Practice: Three Real-World Scenarios

Scenario 1: E-Commerce Product Search

Scenario 2: Log Analytics (Observability)

Scenario 3: Documentation & Knowledge Base

Customizing BM25 in Your Index

Debugging Scores with the `_explain` API

Five Common BM25 Pitfalls (and How to Avoid Them)

1. Not Tuning for Content Type

2. Over-Boosting Into Noise

3. Ignoring the Filter Context

4. Assuming BM25 Is Enough for Semantic Search

5. Benchmarking With Wrong Settings

The Modern Context: BM25 + Vector Search

When to Use What: A Decision Framework

Conclusion

References

Top comments (0)

BM25 Scoring in Elasticsearch: Why Your Search Results Rank the Way They Do

Why This Matters

From TF-IDF to BM25: A Brief History

The BM25 Formula, Decomposed

1. IDF: Inverse Document Frequency

2. Term Frequency (Saturated)

3. The k1 Parameter: Controlling Saturation

4. The b Parameter: Document Length Normalization

5. Boost: Query-Time Control

BM25 in Practice: Three Real-World Scenarios

Scenario 1: E-Commerce Product Search

Scenario 2: Log Analytics (Observability)

Scenario 3: Documentation & Knowledge Base

Customizing BM25 in Your Index

Debugging Scores with the _explain API

Five Common BM25 Pitfalls (and How to Avoid Them)

1. Not Tuning for Content Type

2. Over-Boosting Into Noise

3. Ignoring the Filter Context

4. Assuming BM25 Is Enough for Semantic Search

5. Benchmarking With Wrong Settings

The Modern Context: BM25 + Vector Search

When to Use What: A Decision Framework

Conclusion

References

Debugging Scores with the `_explain` API