DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Postmortem: A Redis 8.0 OOM Error Caused Our E-Commerce Site to Crash on Black Friday

Postmortem: A Redis 8.0 OOM Error Caused Our E-Commerce Site to Crash on Black Friday

Published: December 2, 2024 | Incident Date: November 29, 2024 | Authors: Site Reliability Engineering Team

Executive Summary

On November 29, 2024 (Black Friday), our e-commerce platform experienced a total outage lasting 47 minutes from 09:12 to 09:59 UTC, caused by an Out-Of-Memory (OOM) error in our Redis 8.0 cluster. The incident prevented 92% of active shoppers from completing checkouts, resulting in an estimated $2.1M in lost revenue. This postmortem details the incident timeline, root cause, resolution steps, and actionable lessons for engineering teams running Redis 8.0 in production.

Incident Timeline (All Times UTC)

  • 08:45: Black Friday traffic spikes to 3x normal peak load; Redis 8.0 cluster memory usage hits 78% of allocated maxmemory.
  • 09:02: Engineering team receives PagerDuty alert for Redis node memory usage exceeding 85% threshold.
  • 09:08: On-call SRE confirms Redis 8.0 primary node is at 94% memory utilization; attempts to evict keys via redis-cli MEMORY PURGE fail due to Redis 8.0's new lazy eviction default for large objects.
  • 09:12: Primary Redis node triggers OOM killer; cluster fails over to secondary node, which immediately hits OOM as it syncs unevicted keys from the failed primary.
  • 09:14: All Redis cluster nodes are OOM-killed; application servers lose all cache and session state, returning 502/503 errors to users.
  • 09:22: Incident declared SEV-1; full incident response team assembles.
  • 09:35: Team identifies Redis 8.0's new Native JSON module (enabled for product catalog caching) as the primary memory consumer, with 12GB of unexpired JSON blobs stored without TTL.
  • 09:42: Temporary fix applied: scale Redis cluster from 3 to 6 nodes, increase total maxmemory by 2x, and enable maxmemory-policy allkeys-lru (previously set to noeviction for Redis 8.0's new default).
  • 09:59: All services restored; cache warmup completes by 10:12 UTC.

Root Cause Analysis

We upgraded our Redis deployment from 7.2 to 8.0 on November 15, 2024, to leverage two new Redis 8.0 features:

  1. Native JSON support (v2.0 module bundled with Redis 8.0) for caching product catalog data with nested attributes.
  2. Redis Query Engine (RQE) for real-time personalized recommendation queries against cached user behavior data.

Two misconfigurations and one unexpected behavior in Redis 8.0 combined to cause the OOM:

  1. Default maxmemory-policy change: We incorrectly assumed Redis 8.0's default maxmemory-policy was noeviction (matching 7.2) and did not explicitly set the policy in our Redis config. Redis 8.0 actually defaults to volatile-lru only when the Redis Query Engine is enabled, but we had not set any TTLs on JSON keys, so no keys were volatile, making the eviction policy effectively noeviction.
  2. Missing TTLs on JSON cache keys: The product catalog team stored full JSON product objects in Redis 8.0 without TTLs, assuming the cache would evict old entries automatically. Since no keys had TTLs, the volatile-lru policy could not evict any keys, leading to unbounded memory growth.
  3. Redis 8.0 memory overhead for JSON: The Native JSON module in Redis 8.0 uses a new binary storage format that has 22% higher memory overhead than the JSON-as-string storage we used in Redis 7.2. We did not adjust our maxmemory allocation to account for this overhead, so our 16GB per node allocation was insufficient for the same key count.
  4. Lazy eviction failure: When memory hit 94%, we attempted to manually evict keys, but Redis 8.0's default lazy eviction for objects larger than 4KB (a new 8.0 feature) meant eviction commands were queued asynchronously and could not free memory fast enough to prevent OOM.

Resolution Steps

We executed the following steps to restore service:

  1. Scaled the Redis cluster horizontally from 3 nodes (16GB maxmemory each) to 6 nodes (16GB each), doubling total available memory to 96GB.
  2. Explicitly set maxmemory-policy allkeys-lru in all Redis configs to ensure keys are evicted regardless of TTL when memory is full.
  3. Added a mandatory 1-hour TTL to all JSON product catalog keys, with a background job to refresh hot keys before expiration.
  4. Disabled Redis 8.0's lazy eviction for objects under 16KB via lazyfree-lazy-eviction no to allow immediate memory reclamation during incidents.
  5. Deployed a Redis Exporter sidecar to all nodes to surface per-module memory usage (JSON, RQE, core) in our Grafana dashboards, which we had not configured pre-incident.

Impact Assessment

  • Downtime: 47 minutes total; 92% of active users affected.
  • Revenue Loss: Estimated $2.1M in unprocessed orders; 14% of Black Friday total revenue.
  • User Trust: 7% increase in support tickets related to checkout failures; 3% increase in churn among first-time Black Friday shoppers.
  • Internal Impact: 12 SRE and engineering staff pulled from planned Black Friday feature work to respond to the incident.

Lessons Learned

What Went Wrong

  • Insufficient load testing of Redis 8.0 upgrade: We tested functional correctness but not memory utilization under peak load.
  • Reliance on default Redis 8.0 configuration: We did not audit new 8.0 defaults (eviction policy, lazy eviction, JSON memory overhead) before deploying to production.
  • Poor observability: We did not have per-module memory metrics for Redis 8.0, so we could not identify the JSON module as the memory hog until 23 minutes into the incident.
  • Missing TTLs on cache keys: The product team did not follow our caching standards (mandatory TTLs for all non-session keys) for the new JSON cache.

What Went Right

  • Redis cluster failover worked as expected initially, though it failed due to the OOM spreading to secondaries.
  • Incident response team assembled within 8 minutes of SEV-1 declaration.
  • We had recent Redis cluster backups, so we did not lose persistent data (though cache state was lost).

Action Items

Action Item

Owner

Deadline

Status

Audit all Redis 8.0 configs for non-default production settings; document all custom configs in internal wiki

SRE Team

Dec 9, 2024

In Progress

Add mandatory TTL enforcement to all Redis write paths via custom Redis module

Backend Team

Dec 16, 2024

Not Started

Run full peak-load stress test on Redis 8.0 cluster with 2x Black Friday traffic

QA Team

Dec 23, 2024

Not Started

Add per-module Redis memory metrics to all production dashboards

Observability Team

Dec 12, 2024

In Progress

Create Redis 8.0 upgrade runbook with pre-flight checks for memory, eviction, and module overhead

SRE Team

Dec 9, 2024

In Progress

Conclusion

This incident highlights the risks of upgrading critical infrastructure components like Redis without thorough testing of new defaults and resource utilization. Redis 8.0's new features deliver significant value for our use case, but we failed to account for their memory footprint and eviction behavior. By implementing the action items above, we aim to prevent similar OOM incidents during peak traffic events in 2025.

Have questions about our Redis 8.0 setup? Reach out to the SRE team at sre-team@example.com.

Top comments (0)