BM25 Scoring in Elasticsearch: Why Your Search Results Rank the Way They Do
Photo by Luke Chesser on Unsplash
Ever typed a query into your app and wondered why result #3 outranked result #2? Or why a document with 50 mentions of a keyword sits below one with only 5? The answer lives in the scoring algorithm - and in Elasticsearch, that algorithm is BM25.
Most engineers know Elasticsearch uses BM25. Fewer understand how it actually works. This post is for the curious ones. We will break down the probabilistic relevance framework behind every Elasticsearch query, explain why it beats the older TF-IDF approach, and show you how to tune it for your specific data.
Why This Matters
Search ranking is not magic. It is math - and the math matters because:
- Bad scoring hurts user experience: When the wrong document ranks first, users click more, convert less, and trust your search less.
- Tuning is free performance: No hardware changes, no re-architecture. Just understanding the parameters already in your index.
-
Debugging is impossible without understanding: The
_explainAPI spews JSON that only makes sense if you know what BM25 is doing.
I have spent years building search systems at Cloudera, from log analytics to product search. BM25 is the default in every Elasticsearch cluster I have touched - and it is worth understanding.
From TF-IDF to BM25: A Brief History
If you learned search in a computer science course, you probably started with TF-IDF. Term Frequency times Inverse Document Frequency. Simple, elegant, and limited.
TF-IDF has two fundamental problems in real-world search:
Linear term frequency: If a document has 100 occurrences of a term, TF-IDF says it is 10x more relevant than one with 10 occurrences. In practice, the 100th occurrence adds almost no new relevance signal. Diminishing returns exist, but TF-IDF ignores them.
No document length normalization: Longer documents naturally contain more terms. Without correction, a 10,000-word Wikipedia article will dominate over a 200-word focused answer - even when the short document is exactly what the user wants.
BM25 (Best Match 25, from the Okapi system in the 1990s) solves both. It is a probabilistic relevance model, not a heuristic. It asks: "What is the probability that this document is relevant, given this query?" and builds the math from there.
Elasticsearch switched from TF-IDF to BM25 as the default similarity algorithm in version 5.0. If you are on any modern version, you are using BM25 whether you know it or not.
The BM25 Formula, Decomposed
The full formula looks intimidating. Let us strip it to what actually matters:
score = IDF * (term_frequency_saturated) * boost
Three components. Let us walk through each.
1. IDF: Inverse Document Frequency
IDF = ln(1 + (N - n + 0.5) / (n + 0.5))
Where:
-
N= total number of documents in the index -
n= number of documents containing the term -
ln= natural logarithm
What it means: Rare terms are more valuable. If "kubernetes" appears in 3 documents out of 10,000, those 3 documents are probably highly relevant to a query containing "kubernetes". If "the" appears in 9,999 documents, it contributes almost nothing to relevance scoring.
The + 0.5 in the formula prevents division by zero and smooths the curve. The ln(1 + ...) ensures the score stays positive and grows sub-linearly.
2. Term Frequency (Saturated)
This is where BM25 diverges from TF-IDF. The formula is:
TF_component = (f * (k1 + 1)) / (f + k1 * (1 - b + b * (dl / avgdl)))
Where:
-
f= raw term frequency (how many times the term appears in this document) -
k1= saturation parameter (default 1.2) -
b= length normalization parameter (default 0.75) -
dl= document length (in terms) -
avgdl= average document length across the index
The key insight: The numerator f * (k1 + 1) grows linearly with frequency, but the denominator f + k1 * (...) grows too. The result is a curve that saturates:
| Term Frequency | TF-IDF Score | BM25 Score (approx) |
|---|---|---|
| 1 | 1.0 | 0.92 |
| 5 | 5.0 | 2.38 |
| 10 | 10.0 | 3.03 |
| 50 | 50.0 | 3.84 |
| 100 | 100.0 | 3.97 |
At 100 occurrences, TF-IDF says 100x relevance. BM25 says roughly 4x. The 100th repetition is barely adding signal. This mirrors human judgment: a document that mentions "docker" 100 times is not 100 times more relevant than one that mentions it 10 times.
3. The k1 Parameter: Controlling Saturation
k1 (default 1.2) controls how quickly the saturation curve flattens:
- Low k1 (0.1-0.5): Rapid saturation. The 5th occurrence adds almost as much as the 50th. Good for fields where repetition is not informative (e.g., boilerplate text, categories).
- High k1 (1.5-2.0): Slow saturation. Later occurrences still add noticeable score. Good for content-heavy fields where depth matters (e.g., blog posts, documentation).
- k1 = 0: Pure binary relevance. One occurrence counts; everything after is ignored.
- k1 = infinity: Approaches TF-IDF behavior (linear).
In practice, the default 1.2 works well for general text. But knowing when to tune it is what separates working search from great search.
4. The b Parameter: Document Length Normalization
b (default 0.75) controls how aggressively BM25 penalizes long documents:
length_factor = (1 - b + b * (dl / avgdl))
- b = 0: No length normalization. A 10,000-word document and a 100-word document compete on equal footing. Good for short, structured fields like product SKUs or log tags.
- b = 1: Full normalization. Long documents are heavily penalized. Good for mixed-length content where brevity signals quality (e.g., Q&A, support tickets).
- b = 0.75 (default): A balanced compromise. The 10,000-word doc gets a moderate penalty, but not enough to bury it if it is genuinely relevant.
Why this matters: Without length normalization, your Wikipedia clone will always outrank your focused blog post. The user searching "kubernetes ingress tutorial" wants the 500-word guide, not the 5,000-word encyclopedia entry.
5. Boost: Query-Time Control
Elasticsearch allows boosting at query time:
{
"query": {
"multi_match": {
"query": "kubernetes deployment",
"fields": ["title^3", "content^1", "tags^2"]
}
}
}
The ^3 multiplies the title field's BM25 score by 3. This is not changing the underlying similarity algorithm - it is applying a coefficient after the fact. But it is the most common way engineers influence ranking without reindexing.
BM25 in Practice: Three Real-World Scenarios
Scenario 1: E-Commerce Product Search
You have products with titles ("Nike Air Max 90"), descriptions (500 words), and specifications (structured key-value pairs). The challenge: users type short queries like "running shoes" and expect the most relevant product first.
Tuning approach:
- Title field:
k1=1.2, b=0.3(short, precise, no heavy length penalty needed) - Description field:
k1=1.2, b=0.75(standard) - Specifications field:
k1=0.3, b=0.0(structured, repetition is noise)
Query-time boost: title^3, description^1, specs^2 (specs matter because "size 10" is a strong signal, but title wins for general queries).
Scenario 2: Log Analytics (Observability)
You are searching application logs. Documents are semi-structured JSON: message, service_name, level, timestamp. The query: level:ERROR AND message:"connection timeout".
Tuning approach:
-
messagefield:k1=1.5, b=0.5(repetition of "timeout" across multiple log lines is meaningful - saturation should be slower) -
service_name:k1=0.1, b=0.0(binary match - either the service is relevant or not) -
level:k1=0.0, b=0.0(exact match, no scoring needed - use a filter)
Key insight: In log search, exact matches on structured fields often matter more than free-text relevance. Use filter contexts for level and service_name to skip scoring entirely and leverage caching.
Scenario 3: Documentation & Knowledge Base
Technical documentation with long-form articles. Users search "how to configure TLS in Elasticsearch" and expect the setup guide, not the security architecture overview.
Tuning approach:
-
titlefield:k1=0.8, b=0.3(titles are short and precise - saturation should be fast) -
contentfield:k1=1.5, b=0.9(long articles, heavy length normalization - but slow saturation because depth matters) -
headingsfield:k1=0.6, b=0.2(section headings are like mini-titles)
Query-time: headings^4, title^3, content^1 (a matching heading is usually the best signal).
Customizing BM25 in Your Index
Here is how to set custom BM25 parameters at index creation:
PUT /my-index
{
"settings": {
"index": {
"similarity": {
"my_custom_bm25": {
"type": "BM25",
"k1": 1.5,
"b": 0.6
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"similarity": "my_custom_bm25"
}
}
}
}
And here is how to change the default similarity for all text fields:
PUT /my-index
{
"settings": {
"index": {
"similarity": {
"default": {
"type": "BM25",
"k1": 1.2,
"b": 0.75
}
}
}
}
}
Important: Changing similarity requires reindexing. You cannot change it on an existing index. Plan this during index design, not after you have 10 million documents.
Debugging Scores with the _explain API
The most powerful tool for understanding BM25 is the _explain API:
GET /my-index/_search
{
"explain": true,
"query": {
"match": {
"title": "kubernetes"
}
}
}
The response breaks down every component of the score for every matched document. You will see:
"details": [
{
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))",
"value": 2.734
},
{
"description": "tfNorm, computed as (f * (k1 + 1)) / (f + k1 * (1 - b + b * dl / avgdl))",
"value": 0.912
}
]
When a stakeholder asks "why did this document rank first?", _explain is your answer. It turns scoring from a black box into an auditable decision tree.
Five Common BM25 Pitfalls (and How to Avoid Them)
1. Not Tuning for Content Type
The default k1=1.2, b=0.75 works for generic text. It fails for:
- Short structured fields (titles, SKUs, tags) - use lower
b - Very long documents (legal contracts, books) - use higher
b - Repetitive content (logs, social media feeds) - use lower
k1
Fix: Analyze your document length distribution. If 90% of docs are under 100 words, b=0.75 is over-penalizing.
2. Over-Boosting Into Noise
{
"fields": ["title^10", "content^1"]
}
A 10x boost on title means a single title match can outweigh 20 content matches. If the title is "The Ultimate Guide to Everything", that is a bad outcome. The user wants content, not clickbait titles.
Fix: Start with ^1 to ^3 boosts. Use _explain to verify the math before going higher.
3. Ignoring the Filter Context
Using must + match for exact-match fields like status:active or category:electronics wastes compute. These belong in a filter clause, which skips scoring entirely and is cacheable.
{
"query": {
"bool": {
"must": [
{ "match": { "description": "wireless headphones" } }
],
"filter": [
{ "term": { "category": "electronics" } },
{ "term": { "status": "active" } }
]
}
}
}
The filter terms do not affect BM25 score. They just prune the candidate set. This is faster and cleaner.
4. Assuming BM25 Is Enough for Semantic Search
BM25 is lexical. It matches exact terms or their stemmed variants. It does not understand that "k8s" and "kubernetes" are the same thing, or that "docker container" and "containerization" are related concepts.
For semantic search, you need dense vectors (embeddings) alongside BM25. Elasticsearch 8.15+ supports this natively with hybrid search and Reciprocal Rank Fusion (RRF).
5. Benchmarking With Wrong Settings
Do not benchmark search quality on an index with refresh_interval: -1 (disabled refresh) and then complain that results are stale. BM25 scores depend on the index state - document counts, average length, term frequencies. All of these change as documents are added or removed.
Fix: Benchmark on a production-like index. If you are testing on a 1,000-document dev index, your avgdl and IDF values will not match the 10-million-document production reality.
The Modern Context: BM25 + Vector Search
In 2025 and 2026, the most interesting search systems are not purely lexical or purely semantic. They are hybrid. BM25 handles the exact-term matching that vectors miss. Dense vectors (from models like E5, OpenAI embeddings, or ELSER) handle the conceptual similarity that BM25 cannot reach.
Elasticsearch supports this with the retrievers API and RRF (Reciprocal Rank Fusion):
GET /my-index/_search
{
"retriever": {
"rrf": {
"retrievers": [
{
"standard": {
"query": {
"match": {
"content": "kubernetes deployment strategies"
}
}
}
},
{
"knn": {
"field": "content_vector",
"query_vector": [...],
"k": 10
}
}
]
}
}
}
BM25 finds documents containing the exact terms. kNN finds documents that are conceptually similar. RRF merges the rankings without requiring score normalization. The result is search that is both precise and intelligent.
This is the architecture behind modern RAG (Retrieval-Augmented Generation) pipelines. The LLM needs the most relevant context, and BM25 + vectors together deliver better retrieval than either alone.
Photo by Alex Knight on Unsplash - representing the intersection of traditional and modern search
When to Use What: A Decision Framework
| Use Case | Primary Scoring | Secondary/Tuning |
|---|---|---|
| E-commerce product search | BM25 with field boosts | k1=1.2, b=0.3 for titles |
| Log analytics | Filter + BM25 on message | k1=1.5 for repeated error patterns |
| Documentation/knowledge base | BM25 with heading boosts | k1=1.5, b=0.9 for long content |
| Semantic search (Q&A) | Dense vectors + RRF | BM25 for exact matches, vectors for meaning |
| Structured data search (exact match) | Filter context | Skip BM25 entirely |
| Legal/contract search | BM25 with phrase queries | k1=0.8, b=0.8 for mixed-length legal text |
Conclusion
BM25 is not just a default setting. It is a probabilistic relevance model that makes specific, tunable trade-offs about term frequency saturation and document length normalization. Understanding those trade-offs lets you build search that ranks the right documents for your specific data and users.
The key takeaways:
- BM25 saturates term frequency: The 50th keyword mention adds less than the 5th. This is a feature, not a bug.
-
Length normalization matters: Longer documents are penalized unless you tune
bfor your content distribution. -
The defaults are just a starting point:
k1=1.2, b=0.75works for generic English text. Your data is probably not generic. -
Use
_explainfor debugging: When rankings look wrong, the math is there. You just need to read it. - Hybrid is the future: BM25 for lexical precision, vector search for semantic understanding. Together they power modern RAG systems.
The next time someone asks why a document ranked where it did, you will have an answer grounded in probability theory, not guesswork.
I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I build search systems, data pipelines, and the occasional distributed system. Follow my work on GitHub: https://github.com/iprithv
References
- Elasticsearch Similarity Module: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html
- BM25 Original Paper: Robertson, S.E., et al. "Okapi at TREC-3." NIST Special Publication, 1995.
- Elasticsearch 8.15 Retrievers API: https://www.elastic.co/guide/en/elasticsearch/reference/current/retriever.html
- Hybrid Search with RRF: https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html
Top comments (0)