Postmortem: An Elasticsearch 8.11 Query Bug That Returned Wrong Results for 1 Hour
This postmortem details a production incident where Elasticsearch 8.11.0 returned incorrect query results for 60 minutes, impacting customer-facing search functionality. We cover the timeline, root cause, remediation, and prevention measures.
Incident Summary
- Date/Time: 2024-03-15, 14:00 – 15:00 UTC
- Duration: 1 hour
- Impact: ~12% of all search queries returned incomplete results; 3 customer-facing apps affected; ~2400 failed user searches
- Root Cause: Regression in index sort optimization logic for bool filter clauses with range queries in Elasticsearch 8.11.0
- Resolution: Rollback to 8.10.4, then upgrade to patched 8.11.1
Timeline
- 13:45 UTC: Production upgrade to Elasticsearch 8.11.0 completed across 3-node cluster, all health checks pass
- 14:00 UTC: First customer report of missing search results for recent log entries
- 14:05 UTC: On-call engineer confirms issue: queries for logs from the last 1 hour return 0 results, even with known matching documents
- 14:12 UTC: Issue isolated to indices with index sorting enabled on the
@timestampfield, using bool filter queries with range clauses - 14:20 UTC: Rollback to Elasticsearch 8.10.4 initiated
- 14:35 UTC: Rollback complete, all queries return correct results
- 14:40 UTC: Elasticsearch 8.11.1 (containing hotfix) deployed to staging, validated
- 15:00 UTC: 8.11.1 deployed to production, no regressions observed
Root Cause Analysis
Elasticsearch 8.11.0 introduced an optimization to the IndexSortOptimizer class, which skips evaluating filter clauses if index sorting guarantees that matching documents are contiguous. This optimization incorrectly assumed that range filter queries on the sorted field (@timestamp) would always be handled by the index sort, but failed to account for multiple filter clauses combined in a bool query.
For example, the following query would return incorrect results on 8.11.0 for indices with index sorting enabled on @timestamp:
{
"query": {
"bool": {
"filter": [
{ "range": { "@timestamp": { "gte": "now-1h" } } },
{ "term": { "service.name": "api-gateway" } }
]
}
}
}
The optimizer incorrectly skipped the range filter clause entirely, returning only documents matching the term filter regardless of timestamp, or in some cases returning 0 results if the term filter matched documents outside the sorted range. The bug was introduced in PR #102345 (hypothetical) which aimed to improve query performance for sorted indices.
Impact
Approximately 12% of all search queries were affected, as 4 of our 12 production indices had index sorting enabled on @timestamp. The affected applications included our log search dashboard, customer support ticket search, and public API search endpoint. We observed ~2400 user-facing errors, with no data loss or corruption.
Remediation
- Immediate rollback to Elasticsearch 8.10.4, which resolved the incorrect results issue within 15 minutes of initiation.
- Validated the hotfix in Elasticsearch 8.11.1 (released 2 days prior to the incident) which correctly handles filter clauses in the index sort optimizer.
- Deployed 8.11.1 to production after 20 minutes of staging validation, with no regressions.
- Published a status page update to customers at 14:45 UTC, confirming resolution by 15:00 UTC.
Prevention Measures
- Added integration tests to our Elasticsearch upgrade pipeline that validate query result correctness for indices with index sorting enabled, including bool filter combinations with range and term queries.
- Improved canary deployment checks to run a sample of production queries against the canary cluster and compare result sets to the stable cluster before full rollout.
- Added a new monitoring metric:
es_query_result_count_delta, which alerts if the result count for a sample of known queries deviates by more than 5% from the expected value. - Updated our upgrade policy to require 48 hours of staging validation for all Elasticsearch minor version upgrades, up from 24 hours previously.
Conclusion
This incident highlighted a gap in our upgrade testing for edge cases involving index sorting and bool filter queries. The quick rollback and existing monitoring helped limit the impact to 1 hour, but we have since strengthened our testing and validation processes to prevent similar regressions in the future. We thank our customers for their patience during the incident.
Top comments (0)