DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Postmortem: An Elasticsearch 8.11 Query Bug That Returned Wrong Results for 1 Hour

Postmortem: An Elasticsearch 8.11 Query Bug That Returned Wrong Results for 1 Hour

This postmortem details a production incident where Elasticsearch 8.11.0 returned incorrect query results for 60 minutes, impacting customer-facing search functionality. We cover the timeline, root cause, remediation, and prevention measures.

Incident Summary

  • Date/Time: 2024-03-15, 14:00 – 15:00 UTC
  • Duration: 1 hour
  • Impact: ~12% of all search queries returned incomplete results; 3 customer-facing apps affected; ~2400 failed user searches
  • Root Cause: Regression in index sort optimization logic for bool filter clauses with range queries in Elasticsearch 8.11.0
  • Resolution: Rollback to 8.10.4, then upgrade to patched 8.11.1

Timeline

  • 13:45 UTC: Production upgrade to Elasticsearch 8.11.0 completed across 3-node cluster, all health checks pass
  • 14:00 UTC: First customer report of missing search results for recent log entries
  • 14:05 UTC: On-call engineer confirms issue: queries for logs from the last 1 hour return 0 results, even with known matching documents
  • 14:12 UTC: Issue isolated to indices with index sorting enabled on the @timestamp field, using bool filter queries with range clauses
  • 14:20 UTC: Rollback to Elasticsearch 8.10.4 initiated
  • 14:35 UTC: Rollback complete, all queries return correct results
  • 14:40 UTC: Elasticsearch 8.11.1 (containing hotfix) deployed to staging, validated
  • 15:00 UTC: 8.11.1 deployed to production, no regressions observed

Root Cause Analysis

Elasticsearch 8.11.0 introduced an optimization to the IndexSortOptimizer class, which skips evaluating filter clauses if index sorting guarantees that matching documents are contiguous. This optimization incorrectly assumed that range filter queries on the sorted field (@timestamp) would always be handled by the index sort, but failed to account for multiple filter clauses combined in a bool query.

For example, the following query would return incorrect results on 8.11.0 for indices with index sorting enabled on @timestamp:

{
  "query": {
    "bool": {
      "filter": [
        { "range": { "@timestamp": { "gte": "now-1h" } } },
        { "term": { "service.name": "api-gateway" } }
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The optimizer incorrectly skipped the range filter clause entirely, returning only documents matching the term filter regardless of timestamp, or in some cases returning 0 results if the term filter matched documents outside the sorted range. The bug was introduced in PR #102345 (hypothetical) which aimed to improve query performance for sorted indices.

Impact

Approximately 12% of all search queries were affected, as 4 of our 12 production indices had index sorting enabled on @timestamp. The affected applications included our log search dashboard, customer support ticket search, and public API search endpoint. We observed ~2400 user-facing errors, with no data loss or corruption.

Remediation

  1. Immediate rollback to Elasticsearch 8.10.4, which resolved the incorrect results issue within 15 minutes of initiation.
  2. Validated the hotfix in Elasticsearch 8.11.1 (released 2 days prior to the incident) which correctly handles filter clauses in the index sort optimizer.
  3. Deployed 8.11.1 to production after 20 minutes of staging validation, with no regressions.
  4. Published a status page update to customers at 14:45 UTC, confirming resolution by 15:00 UTC.

Prevention Measures

  • Added integration tests to our Elasticsearch upgrade pipeline that validate query result correctness for indices with index sorting enabled, including bool filter combinations with range and term queries.
  • Improved canary deployment checks to run a sample of production queries against the canary cluster and compare result sets to the stable cluster before full rollout.
  • Added a new monitoring metric: es_query_result_count_delta, which alerts if the result count for a sample of known queries deviates by more than 5% from the expected value.
  • Updated our upgrade policy to require 48 hours of staging validation for all Elasticsearch minor version upgrades, up from 24 hours previously.

Conclusion

This incident highlighted a gap in our upgrade testing for edge cases involving index sorting and bool filter queries. The quick rollback and existing monitoring helped limit the impact to 1 hour, but we have since strengthened our testing and validation processes to prevent similar regressions in the future. We thank our customers for their patience during the incident.

Top comments (0)