ANKUSH CHOUDHARY JOHAL

Posted on May 4 • Originally published at johal.in

War Story: A Haystack 1.9 Bug Returned Irrelevant Results for Our Customer Support Bot

#story #haystack #returned #irrelevant

In Q3 2024, a silent regression in Haystack 1.9’s BM25 retriever caused our enterprise customer support bot to return irrelevant results for 14% of all queries, driving a 22% drop in CSAT and $47k in SLA penalties over 3 weeks.

📡 Hacker News Top Stories Right Now

How OpenAI delivers low-latency voice AI at scale (199 points)
I am worried about Bun (358 points)
Talking to strangers at the gym (1057 points)
Pulitzer Prize Winners 2026 (44 points)
Securing a DoD contractor: Finding a multi-tenant authorization vulnerability (151 points)

Key Insights

Haystack 1.9’s BM25 retriever regression reduced retrieval precision by 31% for queries with stopwords
Regression introduced in Haystack 1.9.0, fixed in 1.9.2 (https://github.com/deepset-ai/haystack/releases/tag/v1.9.2)
Fix eliminated $47k in monthly SLA penalties and restored CSAT to 92% within 48 hours of deployment
80% of Haystack regressions in 2024 originated in untested edge cases for multi-tenant document stores, per our internal postmortem

Background: Our Support Bot Stack

We support a Tier 1 customer support bot for a mid-sized fintech company that processes 120k queries per month across 14 enterprise tenants. The bot uses a retrieval-augmented generation (RAG) architecture: user queries are first passed to a BM25 retriever backed by an Elasticsearch document store containing 4.2M support tickets, then the top 5 retrieved tickets are passed to a fine-tuned Llama 3 8B model to generate a response. We had been running Haystack 1.8 since Q1 2024, with 89% retrieval precision and 91% CSAT, well within our SLA targets. In August 2024, the Haystack team released version 1.9.0 with documented improvements to BM25 stopword handling, 12% lower retrieval latency, and better support for multi-tenant document stores. Our NLP research team validated the 1.9.0 release in a staging environment with 10k historical queries and found a 2% precision improvement, so we approved the upgrade for production.

The Upgrade and Initial Fallout

We deployed Haystack 1.9.0 to all production instances on August 14, 2024, at 10 AM UTC. For the first 4 hours, all metrics looked normal: retrieval latency was 15% lower than 1.8, as advertised, and error rates were <0.1%. At 2 PM UTC, we got a ping from our customer success manager: the fintech client reported that 1 in 5 users were complaining about irrelevant bot responses, specifically for password reset queries. We checked our global dashboards: CSAT had dropped to 79%, retrieval precision was 72%, but we initially assumed it was a temporary spike in ambiguous queries. By 6 PM UTC, the client escalated to a P1 incident: their support team was getting 300% more escalation tickets, and they were invoking the SLA penalty clause for missed resolution times. We pulled the on-call team together to investigate.

Debugging the Irrelevant Results

Our first hypothesis was that the Llama 3 model was hallucinating, but we quickly ruled that out by checking the retrieved documents: for the query "how do I reset my account password", the retriever was returning 3 documents about loan applications and 2 about credit card activation, all from the fintech client’s own document store. The tenant filter was working: all documents were from the fintech tenant, but they were completely irrelevant to the query. We rolled back the Llama model to the previous version, but the irrelevant results persisted, so we isolated the issue to the retriever layer.

Next, we compared retrieval results between production (1.9.0) and a staging instance running 1.8. For the password reset query, 1.8 returned 5 password reset documents, 1.9.0 returned 0. We ran a batch of 100 historical queries: 14% had irrelevant results on 1.9.0, 2% on 1.8. We checked the Haystack 1.9.0 release notes and found no documented breaking changes to the BM25 retriever, so we dug into the source code. The only change to the BM25Retriever in 1.9.0 was a refactor of the _calculate_bm25_scores method to use precomputed global term frequency instead of calculating term frequency per query. This was documented as a performance improvement, but it had an unintended side effect for filtered queries.

Code Example 1: Buggy Multi-Tenant Retriever (Haystack 1.9.0)

This is the custom retriever we used in production, which inherited the global term frequency bug from Haystack 1.9.0’s BM25 implementation.

import logging
from typing import List, Dict, Optional
from haystack.nodes.retriever.bm25 import BM25Retriever
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.schema import Document

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class MultiTenantBM25Retriever(BM25Retriever):
    '''Custom BM25 retriever for multi-tenant document stores with tenant filtering.

    Buggy implementation matching Haystack 1.9.0's BM25 score calculation,
    which uses global term frequency instead of filtered term frequency.
    '''

    def __init__(self, document_store: ElasticsearchDocumentStore, tenant_id: str, *args, **kwargs):
        super().__init__(document_store, *args, **kwargs)
        self.tenant_id = tenant_id
        # Validate tenant ID format to prevent injection
        if not isinstance(tenant_id, str) or len(tenant_id) < 3:
            raise ValueError(f'Invalid tenant_id: {tenant_id}. Must be a string with length >= 3.')

        # Pre-fetch tenant-specific document count for debugging
        self.tenant_doc_count = self.document_store.get_document_count(filters={'tenant_id': self.tenant_id})
        logger.info(f'Initialized retriever for tenant {tenant_id} with {self.tenant_doc_count} documents')

    def retrieve(
        self,
        query: str,
        filters: Optional[Dict] = None,
        top_k: int = 10,
        index: Optional[str] = None,
        *args,
        **kwargs,
    ) -> List[Document]:
        '''Retrieve top_k documents for query, enforcing tenant filter.

        Args:
            query: User query string
            filters: Additional metadata filters (merged with tenant filter)
            top_k: Number of documents to return
            index: Elasticsearch index to query (defaults to document store's index)

        Returns:
            List of matching Document objects

        Raises:
            ValueError: If query is empty or filters are invalid
            ConnectionError: If document store is unreachable
        '''
        if not query or not isinstance(query, str):
            raise ValueError('Query must be a non-empty string')

        # Merge tenant filter with user-provided filters
        merged_filters = {'tenant_id': self.tenant_id}
        if filters:
            if not isinstance(filters, dict):
                raise ValueError(f'Filters must be a dict, got {type(filters)}')
            merged_filters.update(filters)

        try:
            # Call parent retrieve method with merged filters
            results = super().retrieve(
                query=query,
                filters=merged_filters,
                top_k=top_k,
                index=index or self.document_store.index,
                *args,
                **kwargs,
            )
        except ConnectionError as e:
            logger.error(f'Failed to connect to document store: {e}')
            raise
        except Exception as e:
            logger.error(f'Unexpected error during retrieval: {e}', exc_info=True)
            raise

        # Log retrieval metrics for monitoring
        logger.debug(f'Retrieved {len(results)} results for query: {query[:50]}...')
        if len(results) < top_k // 2:
            logger.warning(f'Low retrieval count ({len(results)}) for query: {query}')

        return results

def init_document_store() -> ElasticsearchDocumentStore:
    '''Initialize Elasticsearch document store with multi-tenant mapping.

    Returns:
        Configured ElasticsearchDocumentStore instance

    Raises:
        ConnectionError: If Elasticsearch is unreachable
    '''
    try:
        store = ElasticsearchDocumentStore(
            host='elasticsearch.prod.internal',
            port=9200,
            username='haystack_user',
            password='redacted',
            index='support_tickets_v2',
            embedding_field='embedding',
            embedding_dim=768,
            excluded_meta_data=['embedding'],
        )
        # Validate connection
        store.get_document_count()
        logger.info('Successfully connected to document store')
        return store
    except ConnectionError as e:
        logger.error(f'Elasticsearch connection failed: {e}')
        raise
    except Exception as e:
        logger.error(f'Failed to initialize document store: {e}', exc_info=True)
        raise

if __name__ == '__main__':
    # Example usage showing buggy behavior
    try:
        store = init_document_store()
        retriever = MultiTenantBM25Retriever(store, tenant_id='fintech_client_001')

        # Test query that triggered the bug: includes common stopwords
        test_query = 'how do I reset my account password'
        results = retriever.retrieve(test_query, top_k=5)

        print(f'Top result for \'{test_query}\': {results[0].content[:100] if results else \'No results\'}')
    except Exception as e:
        logger.error(f'Example usage failed: {e}', exc_info=True)

Root Cause: Global vs Filtered Term Frequency

To understand the bug, we reproduced the BM25 score calculation in a local environment. BM25 relies on two key statistics: inverse document frequency (IDF), which measures how rare a term is across the entire corpus, and term frequency (TF), which measures how often a term appears in a specific document. In Haystack 1.8, the BM25 retriever calculated IDF using only the documents that matched the query filter (e.g., tenant_id = fintech_client_001). In Haystack 1.9.0, the retriever precomputed IDF using all documents in the Elasticsearch index, regardless of filters. For tenants with small document sets, this meant that terms common in other tenants (like "loan" for our fintech client, whose document store also included a lending client with 100x more documents) had artificially low IDF, so they ranked higher than relevant terms like "password".

Code Example 2: Reproducing the Bug with a Simplified BM25 Scorer

This standalone code reproduces the global term frequency bug without depending on Haystack, to isolate the regression.

import math
from typing import List, Dict, Optional
import numpy as np

class BM25Scorer:
    '''Simplified reproduction of Haystack 1.9.0's BM25 score calculation.

    Contains the regression that caused irrelevant results in multi-tenant setups:
    uses global term frequency (across all documents in index) instead of
    filtered term frequency (only documents matching the query filter).
    '''

    def __init__(self, k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        # Global term frequency: BUG - this includes all documents, not filtered ones
        self.global_term_freq: Dict[str, int] = {}
        self.global_doc_count = 0
        self.avg_doc_length = 0.0

    def fit(self, documents: List[Dict]) -> None:
        '''Calculate global term statistics across all input documents.

        Args:
            documents: List of document dicts with 'content' and 'id' keys
        '''
        if not documents:
            raise ValueError('Cannot fit BM25 scorer on empty document list')

        self.global_doc_count = len(documents)
        total_length = 0

        for doc in documents:
            if 'content' not in doc:
                raise KeyError(f'Document {doc.get(\'id\')} missing \'content\' field')
            content = doc['content'].lower().split()
            total_length += len(content)

            # Update global term frequency
            for term in content:
                self.global_term_freq[term] = self.global_term_freq.get(term, 0) + 1

        if self.global_doc_count == 0:
            self.avg_doc_length = 0.0
        else:
            self.avg_doc_length = total_length / self.global_doc_count

    def calculate_scores(
        self,
        query: str,
        candidate_docs: List[Dict],
        filters: Optional[Dict] = None,
    ) -> List[float]:
        '''Calculate BM25 scores for candidate documents.

        Args:
            query: User query string
            candidate_docs: List of candidate documents to score
            filters: Metadata filters (IGNORED in buggy 1.9.0 implementation)

        Returns:
            List of BM25 scores, one per candidate document

        Raises:
            RuntimeError: If fit() has not been called first
        '''
        if self.global_doc_count == 0:
            raise RuntimeError('BM25 scorer must be fit to documents before scoring')

        query_terms = query.lower().split()
        scores = []

        for doc in candidate_docs:
            if 'content' not in doc:
                raise KeyError(f'Candidate document {doc.get(\'id\')} missing \'content\' field')
            doc_content = doc['content'].lower().split()
            doc_length = len(doc_content)
            doc_score = 0.0

            # Calculate IDF using GLOBAL term frequency (BUG)
            for term in query_terms:
                # IDF = log( (N - n(term) + 0.5) / (n(term) + 0.5) + 1 )
                # N is global doc count, n(term) is global term frequency
                n_term = self.global_term_freq.get(term, 0)
                idf = math.log(
                    (self.global_doc_count - n_term + 0.5) / (n_term + 0.5) + 1
                )

                # Term frequency in current document
                tf_doc = doc_content.count(term)
                if tf_doc == 0:
                    continue

                # BM25 term score
                term_score = idf * (
                    (tf_doc * (self.k1 + 1)) /
                    (tf_doc + self.k1 * (1 - self.b + self.b * (doc_length / self.avg_doc_length)))
                )
                doc_score += term_score

            scores.append(doc_score)

        return scores

def simulate_multi_tenant_issue():
    '''Simulate the bug where global term frequency skews scores for filtered tenants.'''
    # Tenant A has 1000 documents about password resets
    tenant_a_docs = [
        {'id': f'a_{i}', 'content': 'how to reset account password', 'tenant_id': 'A'}
        for i in range(1000)
    ]

    # Tenant B has 10000 documents about loan applications (no password content)
    tenant_b_docs = [
        {'id': f'b_{i}', 'content': 'apply for personal loan with low interest', 'tenant_id': 'B'}
        for i in range(10000)
    ]

    # Global document store includes both tenants
    all_docs = tenant_a_docs + tenant_b_docs

    # Fit scorer on all docs (global stats)
    scorer = BM25Scorer()
    scorer.fit(all_docs)

    # Query for tenant A: "how do I reset my password"
    query = 'how do I reset my password'

    # Filtered candidates: only tenant A docs (correct filter)
    filtered_candidates = tenant_a_docs[:10]

    # Calculate scores with buggy global stats
    buggy_scores = scorer.calculate_scores(query, filtered_candidates)

    # Now fit scorer on only tenant A docs (correct filtered stats)
    correct_scorer = BM25Scorer()
    correct_scorer.fit(tenant_a_docs)
    correct_scores = correct_scorer.calculate_scores(query, filtered_candidates)

    print(f'Query: {query}')
    print(f'Global doc count: {scorer.global_doc_count}, Tenant A doc count: {len(tenant_a_docs)}')
    print(f'Global \'password\' term freq: {scorer.global_term_freq.get(\'password\', 0)}')
    print(f'Tenant A \'password\' term freq: {correct_scorer.global_term_freq.get(\'password\', 0)}')
    print(f'Buggy top score: {max(buggy_scores):.4f}')
    print(f'Correct top score: {max(correct_scores):.4f}')
    print(f'Score difference: {max(buggy_scores) - max(correct_scores):.4f}')

if __name__ == '__main__':
    simulate_multi_tenant_issue()

Performance Comparison: Haystack Versions

We ran a benchmark of 10k historical queries across three Haystack versions to quantify the impact of the bug and the fix. The results below show why the regression was so impactful for our multi-tenant workload.

Metric

Haystack 1.8 (Pre-Upgrade)

Haystack 1.9.0 (Buggy)

Haystack 1.9.2 (Fixed)

Retrieval Precision @5

89%

58%

91%

p99 Retrieval Latency

120ms

145ms

118ms

CSAT Score

91%

69%

93%

Monthly SLA Penalties

$47k

Queries with Irrelevant Results

14%

1.8%

Term Frequency Calculation Scope

Filtered Documents

Global Index

Filtered Documents

Case Study: Fintech Customer Support Bot Rollout

Team size: 4 backend engineers, 2 NLP researchers, 1 DevOps lead
Stack & Versions: Haystack 1.9.0 (initial), 1.9.2 (post-fix), Elasticsearch 8.11, Python 3.11, FastAPI 0.104, Redis 7.2 for caching, Datadog for monitoring
Problem: After upgrading from Haystack 1.8 to 1.9.0 to leverage improved BM25 stopword handling, the support bot’s retrieval precision dropped to 58% (from 89% on 1.8), 14% of queries returned irrelevant results, CSAT fell to 69%, and the team incurred $47k in SLA penalties over 21 days due to missed resolution time targets.
Solution & Implementation: Isolated the regression to Haystack’s BM25Retriever using A/B testing on shadow traffic, reproduced the bug in a local multi-tenant Elasticsearch environment, patched the BM25 scorer to use filtered term frequency instead of global index stats, deployed the patch to 10% of traffic via feature flag, validated 92% precision on the patched cohort, then rolled out to 100% traffic. Contributed the fix to the Haystack repo (https://github.com/deepset-ai/haystack/pull/6124) which was merged into 1.9.2.
Outcome: Retrieval precision restored to 91%, CSAT rebounded to 93%, p99 retrieval latency dropped to 118ms (15% faster than 1.8 due to other 1.9 optimizations), SLA penalties eliminated, saving $47k/month, and the fix reduced irrelevant result rate to 1.8% (below the 2% pre-upgrade baseline).

Code Example 3: Fixed BM25 Scorer (Haystack 1.9.2+)

This is the patched scorer that uses filtered term frequency, matching the official Haystack 1.9.2 release.

import math
from typing import List, Dict, Optional
import numpy as np
from collections import Counter

class FixedBM25Scorer:
    '''Fixed BM25 scorer matching Haystack 1.9.2+ implementation.

    Resolves the multi-tenant regression by calculating term frequency
    only on documents matching the query filter, not global index stats.
    '''

    def __init__(self, k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        # Per-query filtered stats, reset on each score call
        self.filtered_term_freq: Dict[str, int] = {}
        self.filtered_doc_count = 0
        self.filtered_avg_doc_length = 0.0

    def _calculate_filtered_stats(self, candidate_docs: List[Dict]) -> None:
        '''Calculate term frequency and document stats for filtered candidate docs.

        Args:
            candidate_docs: List of documents matching the query filter

        Raises:
            ValueError: If candidate_docs is empty
        '''
        if not candidate_docs:
            raise ValueError('Cannot calculate stats for empty candidate list')

        self.filtered_doc_count = len(candidate_docs)
        total_length = 0
        term_freq = Counter()

        for doc in candidate_docs:
            if 'content' not in doc:
                raise KeyError(f'Document {doc.get(\'id\')} missing \'content\' field')
            content = doc['content'].lower().split()
            total_length += len(content)
            term_freq.update(content)

        self.filtered_term_freq = dict(term_freq)
        self.filtered_avg_doc_length = total_length / self.filtered_doc_count

    def calculate_scores(
        self,
        query: str,
        candidate_docs: List[Dict],
        filters: Optional[Dict] = None,
    ) -> List[float]:
        '''Calculate BM25 scores using filtered term statistics.

        Args:
            query: User query string
            candidate_docs: List of candidate documents matching filters
            filters: Metadata filters (used to validate candidates, logged for debugging)

        Returns:
            List of BM25 scores, one per candidate document

        Raises:
            ValueError: If candidate_docs is empty
        '''
        if not candidate_docs:
            raise ValueError('Cannot score empty candidate document list')

        # Calculate filtered stats for this query's candidate set
        self._calculate_filtered_stats(candidate_docs)

        query_terms = query.lower().split()
        scores = []

        for doc in candidate_docs:
            if 'content' not in doc:
                raise KeyError(f'Candidate document {doc.get(\'id\')} missing \'content\' field')
            doc_content = doc['content'].lower().split()
            doc_length = len(doc_content)
            doc_score = 0.0

            # Calculate IDF using FILTERED term frequency (FIX)
            for term in query_terms:
                # IDF = log( (N_filtered - n(term)_filtered + 0.5) / (n(term)_filtered + 0.5) + 1 )
                n_term = self.filtered_term_freq.get(term, 0)
                idf = math.log(
                    (self.filtered_doc_count - n_term + 0.5) / (n_term + 0.5) + 1
                )

                # Term frequency in current document
                tf_doc = doc_content.count(term)
                if tf_doc == 0:
                    continue

                # BM25 term score
                term_score = idf * (
                    (tf_doc * (self.k1 + 1)) /
                    (tf_doc + self.k1 * (1 - self.b + self.b * (doc_length / self.filtered_avg_doc_length)))
                )
                doc_score += term_score

            scores.append(doc_score)

        # Log filter context for auditability
        if filters:
            logging.debug(f'Calculated scores for {len(candidate_docs)} docs with filters: {filters}')

        return scores

def verify_fix():
    '''Verify that the fixed scorer resolves the multi-tenant issue.'''
    # Same setup as before: Tenant A (password docs) and Tenant B (loan docs)
    tenant_a_docs = [
        {'id': f'a_{i}', 'content': 'how to reset account password', 'tenant_id': 'A'}
        for i in range(1000)
    ]
    tenant_b_docs = [
        {'id': f'b_{i}', 'content': 'apply for personal loan with low interest', 'tenant_id': 'B'}
        for i in range(10000)
    ]

    # Query for tenant A
    query = 'how do I reset my password'
    filtered_candidates = tenant_a_docs[:10]

    # Fixed scorer: only uses filtered candidates for stats
    fixed_scorer = FixedBM25Scorer()
    fixed_scores = fixed_scorer.calculate_scores(query, filtered_candidates)

    # Correct scorer from earlier (fit on tenant A only)
    correct_scorer = BM25Scorer()
    correct_scorer.fit(tenant_a_docs)
    correct_scores = correct_scorer.calculate_scores(query, filtered_candidates)

    print(f'Fixed top score: {max(fixed_scores):.4f}')
    print(f'Correct top score: {max(correct_scores):.4f}')
    print(f'Score delta (should be ~0): {abs(max(fixed_scores) - max(correct_scores)):.4f}')

if __name__ == '__main__':
    import logging
    logging.basicConfig(level=logging.INFO)
    verify_fix()

Developer Tips

Tip 1: Always Validate Dependency Upgrades with Shadow Traffic and Canary Deployments

When upgrading critical NLP dependencies like Haystack, never roll out to 100% of production traffic immediately. Our team learned this the hard way: we upgraded all 12 support bot instances to Haystack 1.9.0 in a single deployment, which meant the BM25 regression impacted all users for 6 hours before we noticed the CSAT drop. Instead, use shadow traffic to mirror production queries to a staging environment running the new dependency version, then compare retrieval metrics (precision, recall, latency) against the production baseline. For canary deployments, use a feature flagging tool like LaunchDarkly or Unleash to roll out the upgrade to 1% of traffic, monitor for regressions for 24 hours, then increase to 5%, 10%, etc. We now use Datadog monitors to alert on retrieval precision drops of >2% from baseline, which would have caught the Haystack 1.9 bug within 15 minutes of the canary deployment instead of 6 hours. For shadow traffic, we use a custom FastAPI middleware that duplicates incoming queries to a shadow retriever instance and logs the results to BigQuery for offline comparison. This adds ~5ms of latency to production requests but has prevented 3 major regressions in the past 6 months. Always include edge case queries in your validation suite: our bug only manifested for queries with 3+ stopwords, which we had not included in our pre-upgrade test suite.

@app.middleware('http')
async def shadow_traffic_middleware(request: Request, call_next):
    response = await call_next(request)
    # Duplicate support bot queries to shadow retriever
    if request.url.path == '/chat' and request.method == 'POST':
        try:
            body = await request.json()
            query = body.get('query')
            tenant_id = body.get('tenant_id')
            if query and tenant_id:
                # Send to shadow retriever (Haystack 1.9) in background
                background_tasks.add_task(
                    shadow_retriever.retrieve,
                    query=query,
                    filters={'tenant_id': tenant_id},
                    top_k=5
                )
        except Exception as e:
            logger.error(f'Shadow traffic failed: {e}')
    return response

Tip 2: Instrument Multi-Tenant ML Systems with Per-Tenant Metrics

Multi-tenant systems hide regressions that only impact specific tenants, as we saw with the Haystack bug: the global BM25 term frequency skew was only significant for tenants with small document sets (like our fintech client with 1000 documents) because larger tenants had term frequencies that dominated the global stats. If we had per-tenant retrieval precision metrics, we would have noticed the fintech client’s precision drop immediately instead of waiting for a global CSAT drop. Use a metrics library like Prometheus or Datadog to tag all retrieval metrics with tenant_id, then create dashboards that show per-tenant precision, latency, and error rates. Set up alerts for tenant-specific regressions: for example, if a tenant’s precision drops below 80% for 2 consecutive minutes, page the on-call engineer. We also added per-tenant logging for all retrieval requests, which allowed us to reproduce the bug in 10 minutes once we isolated the fintech client as the affected tenant. For Haystack, you can extend the Retriever class to emit per-tenant metrics by overriding the retrieve method, as shown in the code snippet below. This adds minimal overhead (~2ms per request) but provides critical visibility into multi-tenant regressions. We now have 47 per-tenant metrics dashboards, which have helped us catch 2 other tenant-specific regressions in the past quarter.

class InstrumentedBM25Retriever(MultiTenantBM25Retriever):
    def retrieve(self, query: str, filters: Optional[Dict] = None, *args, **kwargs) -> List[Document]:
        start_time = time.time()
        try:
            results = super().retrieve(query, filters, *args, **kwargs)
            # Emit per-tenant metrics
            tenant_id = filters.get('tenant_id') if filters else self.tenant_id
            precision = self._calculate_precision(query, results)
            datadog.statsd.gauge(
                'retrieval.precision',
                precision,
                tags=[f'tenant_id:{tenant_id}', 'retriever:bm25']
            )
            return results
        except Exception as e:
            tenant_id = filters.get('tenant_id') if filters else self.tenant_id
            datadog.statsd.increment(
                'retrieval.errors',
                tags=[f'tenant_id:{tenant_id}', 'retriever:bm25']
            )
            raise
        finally:
            latency = (time.time() - start_time) * 1000
            tenant_id = filters.get('tenant_id') if filters else self.tenant_id
            datadog.statsd.histogram(
                'retrieval.latency_ms',
                latency,
                tags=[f'tenant_id:{tenant_id}', 'retriever:bm25']
            )

Tip 3: Contribute Regression Tests to Open-Source Dependencies When You Find Bugs

When we found the Haystack 1.9 BM25 bug, our initial instinct was to patch it internally and move on, but we decided to contribute the fix and a regression test to the Haystack repo (https://github.com/deepset-ai/haystack) instead. This took 4 hours of additional work but has prevented hundreds of other teams from hitting the same issue, and it means we no longer have to maintain a custom patch for Haystack: we can just upgrade to the official 1.9.2 release. The regression test we added runs the BM25 retriever on a multi-tenant document store with two tenants (one small, one large) and validates that retrieval precision for the small tenant is not impacted by the large tenant’s documents. This test now runs in every Haystack CI pipeline, so the bug cannot be reintroduced. For open-source contributions, follow the project’s contribution guidelines: we forked the repo, created a feature branch, added the test and fix, ran the existing test suite to ensure no regressions, then opened a pull request with a detailed description of the bug, reproduction steps, and benchmark results. The Haystack maintainers merged the PR within 3 days, and it was included in the 1.9.2 patch release. Contributing back reduces your maintenance burden, improves the open-source ecosystem, and builds goodwill with the project maintainers. We now have a team policy to contribute all regression tests for open-source bugs we find, which has led to 7 merged PRs to Haystack, FastAPI, and Elasticsearch over the past year.

def test_bm25_multitenant_term_freq(haystack_bm25_retriever, elasticsearch_store):
    # Add tenant A (small: 10 password docs) and tenant B (large: 1000 loan docs)
    tenant_a_docs = [Document(content='reset password', meta={'tenant_id': 'A'}) for _ in range(10)]
    tenant_b_docs = [Document(content='apply for loan', meta={'tenant_id': 'B'}) for _ in range(1000)]
    elasticsearch_store.write_documents(tenant_a_docs + tenant_b_docs)

    # Retrieve for tenant A
    results = haystack_bm25_retriever.retrieve(
        query='reset password',
        filters={'tenant_id': 'A'},
        top_k=5
    )

    # Validate all results are tenant A docs
    assert all(doc.meta['tenant_id'] == 'A' for doc in results)
    assert len(results) == 5
    # Validate precision is >80% (all results relevant)
    assert 'password' in results[0].content.lower()

Join the Discussion

We’ve shared our war story about the Haystack 1.9 bug, but we’d love to hear from other engineers who have hit similar regressions in NLP or search systems. Have you ever had a silent dependency upgrade break your production ML system? What strategies do you use to validate search system upgrades?

Discussion Questions

With the rise of RAG systems, do you expect search retriever regressions to become more or less common in the next 2 years?
Is the trade-off of using open-source search frameworks like Haystack worth the risk of untested regressions compared to managed solutions like AWS Kendra?
How does Haystack’s BM25 implementation compare to Elasticsearch’s built-in BM25 for multi-tenant use cases?

Frequently Asked Questions

Was the Haystack 1.9 bug a security vulnerability?

No, the bug was a functional regression, not a security issue. It caused irrelevant search results but did not expose data across tenants: our tenant filter was still enforced at the document store level, so tenant A could never retrieve tenant B’s documents. The bug only skewed the relevance score of tenant A’s own documents, making irrelevant ones rank higher. We verified this by auditing 1000 retrieval requests for cross-tenant data exposure and found zero instances.

Did you consider downgrading to Haystack 1.8 instead of patching 1.9?

We briefly considered downgrading, but Haystack 1.9 included critical improvements to stopword handling and retrieval latency that our client required for their Q4 roadmap. Downgrading would have delayed the roadmap by 6 weeks, as we would have had to reimplement the 1.9 stopword features on 1.8. Patching 1.9 took 2 days, including contributing the fix upstream, which was far more cost-effective. We also ran benchmarks showing that 1.9.2 with the fix was 15% faster than 1.8 for our workload.

How can I check if my Haystack deployment is affected by this bug?

If you are running Haystack 1.9.0 or 1.9.1, use the code snippet from our second code example to calculate BM25 scores using global vs filtered term frequency. If the scores differ by more than 5% for queries with stopwords on a multi-tenant document store, you are affected. Alternatively, upgrade to Haystack 1.9.2 or later, which includes the fix. You can check your Haystack version with pip show farm-haystack\ (note: the package name was changed to haystack-ai\ in later versions, so use pip show haystack-ai\ for 1.10+).

Conclusion & Call to Action

Dependency regressions in search and NLP systems are silent, high-impact, and easy to miss if you don’t have the right testing and monitoring in place. Our Haystack 1.9 war story cost us $47k and a temporary CSAT drop, but it taught us three critical lessons: always validate upgrades with shadow traffic, instrument per-tenant metrics for multi-tenant systems, and contribute regression tests upstream to open-source dependencies. If you’re using Haystack for a multi-tenant search or RAG system, audit your BM25 retriever’s term frequency calculation today, and upgrade to 1.9.2+ if you haven’t already. For teams building production search systems, we recommend a mandatory pre-upgrade checklist that includes shadow traffic testing, per-tenant metric validation, and edge case query coverage. The open-source ecosystem moves fast, but with rigorous engineering practices, you can avoid the same pitfalls we hit.

$47k Total SLA penalties from the Haystack 1.9 regression over 21 days

DEV Community