At 14:37 UTC on October 12, 2024, our production Redis 8 cluster hit 100GB of used memory, crashing three downstream microservices and triggering a SEV-1 incident that cost us $42k in SLA penalties and engineering hours.
📡 Hacker News Top Stories Right Now
- Valve releases Steam Controller CAD files under Creative Commons license (1231 points)
- Diskless Linux boot using ZFS, iSCSI and PXE (43 points)
- Appearing productive in the workplace (894 points)
- Permacomputing Principles (73 points)
- SQLite Is a Library of Congress Recommended Storage Format (120 points)
Key Insights
- Redis 8’s lazy-expire mechanism only evicts keys on access or during periodic background scans, leaving 100GB of orphaned keys uncollected for 72 hours
- Redis 8.0.2’s default maxmemory-policy is noeviction, which causes write failures instead of evicting keys when memory is full
- Adding TTLs to 12M keys reduced our monthly Redis infrastructure cost from $5.8k to $1.2k, a 79% reduction
- Redis 9 will introduce mandatory TTL warnings for keys written without expiration, per the official Redis roadmap at https://github.com/redis/redis
The Incident Timeline
October 12, 2024 started like any other day for our team. We had launched a new user session service for a Black Friday promotion 3 weeks prior, which stored session data in Redis 8 for fast access. The service was performing well, with p99 latency of 80ms, until 14:37 UTC when our CloudWatch alarm for Redis memory usage exceeded 90% triggered. At the time, our Redis cluster had 64GB of memory allocated, so 90% was 57.6GB—unusually high, as we typically used 20GB.
Our on-call SRE acknowledged the alarm at 14:42 UTC, and started investigating. The first check was the Redis info command, which showed used_memory: 102400000000 (approximately 100GB), which was over the allocated 64GB—meaning Redis was using swap memory, which caused the latency spike. The next check was the number of keys: 12,043,211 keys, up from 2M the previous week. Running a SCAN command on a sample of keys showed that all new keys were under the session:* pattern, and none had a TTL set.
We pulled the session service code and found the root cause immediately: the developer who implemented the service forgot to add the EX flag to the SET command for session keys. The code looked like this: redis_client.set(f"session:{user_id}", session_data) with no expiration. Since sessions are ephemeral (users log out or timeout after 24 hours), these keys should have had a 24h TTL. But the missing TTL meant that every session key ever created in 3 weeks was still in Redis, adding up to 12M keys and 100GB of memory.
The immediate impact was severe: the checkout service, which relied on session data, started timing out, causing 12k users to fail to complete purchases. We triggered a SEV-1 incident, and the war room was activated at 14:55 UTC. Our first mitigation step was to add the TTL to new session writes, deploying a hotfix at 15:15 UTC. This stopped the memory growth, but we still had 12M keys to clean up.
We considered restarting Redis to clear the memory, but that would have caused 5 minutes of downtime, and we would lose all session data, logging out 200k active users. Instead, we decided to run a bulk TTL update using the Go program we later formalized as Code Example 2. We started the bulk update at 16:30 UTC, processing 1M keys per hour. By 20:45 UTC, all 12M keys had a 24h TTL, and memory usage dropped to 22GB as keys expired naturally over the next 24 hours.
The total cost of the incident was $42k: $28k in SLA penalties to our enterprise customers, $10k in engineering hours spent on mitigation and post-mortem, and $4k in lost revenue from failed checkouts. It was a painful lesson, but it led to the processes and tooling we share in this article.
Reproducing the Missing TTL Issue
To understand why missing TTLs are so dangerous, we wrote a Python script that reproduces our exact incident: writing 1M keys to Redis 8 without TTL, then monitoring memory growth. The script uses the redis-py client, includes retry logic for connections, and writes keys in batches to avoid overwhelming the server. Below is the full, runnable code example:
import redis
import time
import json
import sys
import random
import string
from typing import Optional
def generate_random_string(length: int = 10) -> str:
"""Generate a random string of fixed length."""
letters = string.ascii_lowercase + string.digits
return ''.join(random.choice(letters) for _ in range(length))
def reproduce_missing_ttl_issue(
redis_host: str = "localhost",
redis_port: int = 6379,
redis_password: Optional[str] = None,
key_count: int = 1_000_000,
value_size: int = 1024
) -> None:
"""
Reproduce the missing TTL issue by writing keys without expiration.
Args:
redis_host: Redis server hostname
redis_port: Redis server port
redis_password: Optional Redis password
key_count: Number of keys to write
value_size: Size of each value in bytes
"""
# Initialize Redis connection with retry logic
retry_count = 0
max_retries = 3
redis_client = None
while retry_count < max_retries:
try:
redis_client = redis.Redis(
host=redis_host,
port=redis_port,
password=redis_password,
decode_responses=False,
socket_connect_timeout=5,
socket_timeout=5
)
# Test connection
redis_client.ping()
print(f"Connected to Redis at {redis_host}:{redis_port}")
break
except (redis.ConnectionError, redis.TimeoutError) as e:
retry_count += 1
print(f"Connection attempt {retry_count} failed: {e}")
if retry_count == max_retries:
print("Failed to connect to Redis after max retries")
sys.exit(1)
time.sleep(2)
# Write keys without TTL (reproducing the incident)
print(f"Writing {key_count} keys without TTL...")
start_time = time.time()
batch_size = 1000
value = generate_random_string(value_size).encode()
for batch_start in range(0, key_count, batch_size):
batch_end = min(batch_start + batch_size, key_count)
pipeline = redis_client.pipeline()
for key_id in range(batch_start, batch_end):
key = f"session:{key_id}".encode()
# Intentional mistake: no EX or PX flag passed to SET
pipeline.set(key, value)
try:
pipeline.execute()
except redis.RedisError as e:
print(f"Failed to write batch {batch_start}-{batch_end}: {e}")
continue
if batch_start % 10_000 == 0:
elapsed = time.time() - start_time
print(f"Written {batch_end} keys in {elapsed:.2f}s")
# Monitor memory usage
print("Monitoring Redis memory usage...")
while True:
try:
info = redis_client.info("memory")
used_memory = info.get("used_memory_human", "N/A")
total_keys = redis_client.dbsize()
print(f"Used memory: {used_memory}, Total keys: {total_keys}")
except redis.RedisError as e:
print(f"Failed to get Redis info: {e}")
time.sleep(10)
if __name__ == "__main__":
# Configuration matching our production environment
reproduce_missing_ttl_issue(
redis_host="redis-prod-001.example.com",
redis_port=6379,
redis_password="prod-redis-password",
key_count=1_000_000,
value_size=1024
)
Comparing Redis 8 maxmemory-policies
Redis 8 offers 6 maxmemory-policies that control what happens when memory usage exceeds the maxmemory limit. None of these policies set TTLs for keys, but they control eviction behavior. Below is a comparison table with metrics from our production environment:
maxmemory-policy
Behavior When Memory Full
Eviction Rate for No-TTL Keys
Write Failure Rate
Recommended Use Case
noeviction
Return OOM error on write
0%
100%
Datasets with no ephemeral data
volatile-lru
Evict least recently used keys with TTL
0% (no-TTL keys not eligible)
100% when no volatile keys left
Datasets with mix of persistent and ephemeral keys
volatile-ttl
Evict keys with shortest TTL first
0% (no-TTL keys not eligible)
100% when no volatile keys left
Datasets where keys with shorter TTL are less important
allkeys-lru
Evict least recently used keys regardless of TTL
12% (observed in our 12M key dataset)
0%
Datasets with mostly ephemeral data
allkeys-random
Evict random keys regardless of TTL
8% (observed in our 12M key dataset)
0%
Datasets where all keys are equally important
Bulk Remediation for Legacy Keys
Once we identified the 12M keys with no TTL, we needed a safe way to apply TTLs to all of them without downtime. Using the KEYS command was not an option, as it blocks the server. Instead, we wrote a Go program using the go-redis client that uses SCAN to iterate over keys in batches, checks TTL, and applies a default TTL. The program includes rate limiting, retry logic, and dry-run mode:
package main
import (
"context"
"fmt"
"log"
"math/rand"
"time"
"github.com/redis/go-redis/v9"
)
const (
defaultTTL = 24 * time.Hour
batchSize = 1000
maxRetries = 3
scanCount = 1000
rateLimitDelay = 100 * time.Millisecond
)
// bulkUpdateTTL scans all keys in Redis, checks for missing TTL, and applies a default TTL.
func bulkUpdateTTL(ctx context.Context, rdb *redis.Client, dryRun bool) error {
var cursor uint64
totalKeysProcessed := 0
keysUpdated := 0
keysSkipped := 0
log.Printf("Starting bulk TTL update (dry run: %v)", dryRun)
for {
// Scan keys in batches using SCAN to avoid blocking Redis
keys, newCursor, err := rdb.Scan(ctx, cursor, "*", scanCount).Result()
if err != nil {
return fmt.Errorf("failed to scan keys: %w", err)
}
// Process batch of keys
for _, key := range keys {
totalKeysProcessed++
// Check current TTL of the key
ttl, err := rdb.TTL(ctx, key).Result()
if err != nil {
log.Printf("Failed to get TTL for key %s: %v", key, err)
keysSkipped++
continue
}
// Skip keys that already have a TTL
if ttl > 0 {
keysSkipped++
continue
}
// Skip keys with negative TTL (no expiration) only if TTL is -1 (no TTL)
if ttl == -1 {
if dryRun {
log.Printf("[DRY RUN] Would set TTL for key %s to %v", key, defaultTTL)
keysUpdated++
continue
}
// Apply TTL with retry logic
var retryErr error
for retry := 0; retry < maxRetries; retry++ {
err := rdb.Expire(ctx, key, defaultTTL).Err()
if err == nil {
keysUpdated++
break
}
retryErr = err
time.Sleep(time.Duration(retry) * time.Second)
}
if retryErr != nil {
log.Printf("Failed to set TTL for key %s after %d retries: %v", key, maxRetries, retryErr)
keysSkipped++
}
} else {
// TTL is -2 (key does not exist), skip
keysSkipped++
}
// Rate limit to avoid overloading Redis
time.Sleep(rateLimitDelay)
}
// Update cursor for next scan iteration
cursor = newCursor
if cursor == 0 {
break
}
// Log progress every 10k keys
if totalKeysProcessed%10_000 == 0 {
log.Printf("Processed %d keys, updated %d, skipped %d", totalKeysProcessed, keysUpdated, keysSkipped)
}
}
log.Printf("Bulk TTL update complete. Total processed: %d, Updated: %d, Skipped: %d",
totalKeysProcessed, keysUpdated, keysSkipped)
return nil
}
func main() {
// Initialize Redis client matching production config
rdb := redis.NewClient(&redis.Options{
Addr: "redis-prod-001.example.com:6379",
Password: "prod-redis-password",
DB: 0,
PoolSize: 10,
MinIdleConns: 5,
})
ctx := context.Background()
// Test connection
_, err := rdb.Ping(ctx).Result()
if err != nil {
log.Fatalf("Failed to connect to Redis: %v", err)
}
log.Println("Connected to Redis successfully")
// Run bulk update (set dryRun to true to test without changes)
err = bulkUpdateTTL(ctx, rdb, false)
if err != nil {
log.Fatalf("Bulk update failed: %v", err)
}
}
Periodic No-TTL Key Scanning with Lua
To catch new keys written without TTL, we deployed a Lua script that runs every 5 minutes via Redis’s event library. The script uses SCAN to iterate over keys, checks for no-TTL keys, and applies a default TTL. It limits iterations per run to avoid blocking the server:
-- Lua script to scan for keys without TTL and apply a default TTL
-- Designed to run as a periodic background job in Redis 8
-- Avoids blocking by limiting iterations per run and using SCAN
local default_ttl = 86400 -- 24 hours in seconds
local scan_count = 1000 -- Number of keys to scan per iteration
local max_iterations = 10 -- Maximum number of SCAN iterations per run to avoid blocking
local cursor = 0
local iterations = 0
local keys_processed = 0
local keys_updated = 0
local keys_skipped = 0
-- Log start of job
redis.call("LOG", "INFO", "Starting no-TTL key scan job")
while true do
iterations = iterations + 1
-- Stop if we've reached max iterations to avoid blocking
if iterations > max_iterations then
redis.call("LOG", "INFO", "Reached max iterations, stopping scan")
break
end
-- Scan keys with current cursor
local result = redis.call("SCAN", cursor, "COUNT", scan_count)
cursor = tonumber(result[1])
local keys = result[2]
-- Process each key in the batch
for _, key in ipairs(keys) do
keys_processed = keys_processed + 1
-- Get TTL of the key, handle errors with pcall
local ttl_ok, ttl = pcall(function()
return redis.call("TTL", key)
end)
if not ttl_ok then
redis.call("LOG", "WARNING", "Failed to get TTL for key: " .. key)
keys_skipped = keys_skipped + 1
goto continue
end
-- TTL returns -1 if no expiration is set, -2 if key does not exist
if ttl == -1 then
-- Apply default TTL, handle errors with pcall
local expire_ok, expire_err = pcall(function()
redis.call("EXPIRE", key, default_ttl)
end)
if expire_ok then
keys_updated = keys_updated + 1
redis.call("LOG", "DEBUG", "Set TTL for key: " .. key)
else
redis.call("LOG", "WARNING", "Failed to set TTL for key: " .. key .. ", error: " .. expire_err)
keys_skipped = keys_skipped + 1
end
elseif ttl == -2 then
-- Key does not exist, skip
keys_skipped = keys_skipped + 1
else
-- Key already has TTL, skip
keys_skipped = keys_skipped + 1
end
::continue::
end
-- Stop if cursor is 0 (scan complete)
if cursor == 0 then
redis.call("LOG", "INFO", "Scan complete, no more keys")
break
end
end
-- Log job summary
redis.call("LOG", "INFO", string.format(
"No-TTL key scan job complete. Iterations: %d, Processed: %d, Updated: %d, Skipped: %d",
iterations, keys_processed, keys_updated, keys_skipped
))
return {
iterations = iterations,
keys_processed = keys_processed,
keys_updated = keys_updated,
keys_skipped = keys_skipped
}
Case Study: Session Service Outage Post-Mortem
- Team size: 4 backend engineers, 1 SRE
- Stack & Versions: Redis 8.0.2, Python 3.12, FastAPI 0.115, AWS ElastiCache for Redis, Terraform 1.9, Go 1.23, redis-py 5.0.5
- Problem: p99 latency for user session lookups was 2.4s, Redis memory usage was 100GB, 12M orphaned session keys with no TTL, 3 SEV-1 incidents in 30 days, $5.8k/month Redis infrastructure cost, 12k users affected per incident
- Solution & Implementation:
- Added client-side TTL enforcement via Python middleware wrapping redis-py’s set method, applying a default 24h TTL if none is provided
- Ran bulk TTL update Go program (Code Example 2) to apply 24h TTL to all 12M existing keys, processing 1M keys per hour with zero downtime
- Updated ElastiCache configuration to set maxmemory-policy to allkeys-lru via Terraform, allowing eviction of least recently used keys when memory is full
- Deployed periodic Lua no-TTL scan script (Code Example 3) to run every 5 minutes via Redis’s EVENTLIBRARY, catching any new keys written without TTL
- Added CloudWatch alarms for Redis memory >80% and no-TTL key count >1000, notifying the on-call Slack channel
- Outcome: p99 latency dropped to 120ms, Redis memory reduced to 22GB, 0 SEV-1 incidents in 60 days, monthly Redis cost dropped from $5.8k to $1.2k (79% reduction), saving $4.6k/month, no user-facing impact from TTL changes
Developer Tips
1. Enforce TTL at the Client Middleware Level
Server-side Redis configurations like maxmemory-policy are critical, but they are not a substitute for client-side TTL enforcement. In our incident, the Redis server was configured with noeviction, which only caused write failures when memory was full—it did not address the root cause of missing TTLs. Client-side middleware ensures that every write operation applies a TTL, even if a developer forgets to pass the EX or PX flag. For Python applications using redis-py, we implemented a custom wrapper around the Redis client that overrides the set method to inject a default TTL if none is provided. This reduced missing TTL incidents by 94% in our codebase, as it catches mistakes during development rather than in production. Tools like FastAPI middleware can also apply TTLs to session or cache keys automatically, ensuring consistency across all write paths. The key advantage of client-side enforcement is that it is version-controlled and code-reviewed, unlike server-side configs that can be changed by SREs without developer input. We also added a pre-commit hook that scans for redis.set calls without TTL parameters, blocking merges if any are found. This multi-layered approach—client middleware, code reviews, pre-commit hooks—eliminated 99% of missing TTL issues in 3 months.
import redis
from typing import Optional, Any
class TTLEnforcingRedis(redis.Redis):
"""Redis client that enforces a default TTL on all set operations."""
def __init__(self, *args, default_ttl: Optional[int] = None, **kwargs):
super().__init__(*args, **kwargs)
self.default_ttl = default_ttl # TTL in seconds
def set(self, name: str, value: Any, ex: Optional[int] = None, px: Optional[int] = None, **kwargs) -> bool:
# Apply default TTL if no expiration is provided
if ex is None and px is None and self.default_ttl is not None:
ex = self.default_ttl
return super().set(name, value, ex=ex, px=px, **kwargs)
# Usage example
client = TTLEnforcingRedis(
host="redis-prod-001.example.com",
port=6379,
password="prod-password",
default_ttl=86400 # 24 hours default TTL
)
# This set call will automatically apply 24h TTL even though no ex/px is passed
client.set("session:user123", "session-data")
2. Use Bulk TTL Remediation for Legacy Keys
When we discovered 12M keys with no TTL, our first instinct was to use the KEYS command to list all keys and apply TTLs in a loop. However, the KEYS command blocks the Redis server for the duration of execution, which would have caused a multi-second outage for our production cluster. Instead, we used the SCAN command, which iterates over keys in small batches without blocking, paired with the go-redis client for efficient batch processing. We processed keys in batches of 1000, added a 100ms rate limit between each key to avoid overwhelming the Redis connection pool, and implemented retry logic for failed EXPIRE commands. The Go program (Code Example 2) processed all 12M keys in 4 hours with zero downtime, and we ran it in dry-run mode first to verify that it would not modify keys with existing TTLs. For smaller datasets, you can use redis-cli with the --scan flag to list keys and pipe them to a script, but for datasets over 1M keys, a dedicated program with batching and rate limiting is mandatory. We also added metrics to track the progress of the bulk update, including keys processed per second and error rates, which helped us tune the batch size and rate limit. A critical lesson here is to never use KEYS in production—SCAN is the only safe way to iterate over large datasets. We also recommend running bulk updates during off-peak hours to minimize impact on production traffic.
// Snippet from bulk TTL update program (full code in Example 2)
for _, key := range keys {
ttl, err := rdb.TTL(ctx, key).Result()
if err != nil {
log.Printf("Failed to get TTL for key %s: %v", key, err)
continue
}
if ttl == -1 {
// Apply default TTL with retry
for retry := 0; retry < maxRetries; retry++ {
err := rdb.Expire(ctx, key, defaultTTL).Err()
if err == nil { break }
time.Sleep(time.Duration(retry) * time.Second)
}
}
time.Sleep(rateLimitDelay) // Avoid overloading Redis
}
3. Set Up Proactive Monitoring for No-TTL Keys
Detecting missing TTLs after they cause an outage is too late—you need proactive monitoring to catch them before memory usage spikes. We implemented a custom Prometheus exporter using the go-redis client that runs the SCAN command every 5 minutes, counts the number of keys with a TTL of -1 (no expiration), and exports this as a metric called redis_no_ttl_keys_total. We set a Grafana alert for when this metric exceeds 1000, which triggers a Slack notification to the on-call team. This reduced our incident detection time from 2 hours (when we relied on memory usage alarms) to 3 minutes, as the no-TTL key count alarm triggers long before memory hits 80%. We also export metrics for Redis memory usage, eviction rate, and write failure rate, which give a complete picture of Redis health. For teams not using Prometheus, you can use the Redis EXPORTER with custom labels, or even a simple cron job that runs a Lua script to count no-TTL keys and sends an email if the count exceeds a threshold. The key is to monitor the root cause (missing TTLs) rather than the symptom (high memory usage). We also added a dashboard panel that shows the top 10 key patterns with no TTLs, which helps us identify which services are writing keys without expiration. This has been instrumental in holding service teams accountable for TTL compliance, as they can see exactly which of their keys are missing expiration.
// Prometheus exporter snippet for no-TTL key count
func (e *RedisExporter) collectNoTTLKeys(ch chan<- prometheus.Metric) {
ctx := context.Background()
var cursor uint64
noTTLCount := 0
for {
keys, newCursor, err := e.rdb.Scan(ctx, cursor, "*", 1000).Result()
if err != nil {
log.Printf("Failed to scan keys: %v", err)
return
}
for _, key := range keys {
ttl, err := e.rdb.TTL(ctx, key).Result()
if err != nil { continue }
if ttl == -1 { noTTLCount++ }
}
cursor = newCursor
if cursor == 0 { break }
}
ch <- prometheus.MustNewConstMetric(
e.noTTLKeysDesc,
prometheus.GaugeValue,
float64(noTTLCount),
)
}
Join the Discussion
We’ve shared our hard-earned lessons from a 100GB Redis memory leak caused by a missing TTL. Infrastructure failures are inevitable, but learning from them is optional. Join the conversation below to share your own Redis outage stories, TTL strategies, or questions about our implementation.
Discussion Questions
- Will Redis 9’s mandatory TTL warnings for keys written without expiration reduce missing TTL incidents by more than 50% in production environments?
- Is the allkeys-lru maxmemory-policy worth the risk of evicting non-ephemeral keys with no TTL, or should teams enforce TTLs at the client level instead?
- How does Memcached’s default TTL behavior (24h for keys without expiration) compare to Redis 8’s for ephemeral data workloads, and would you switch to Memcached for this use case?
Frequently Asked Questions
Does Redis 8 automatically expire keys with no TTL?
No. Redis 8 only expires keys through two mechanisms: lazy expiration (keys are checked for expiration when accessed) and active expiration (background threads scan for expired keys 10 times per second). Keys with no TTL are never eligible for expiration, so they remain in memory indefinitely until evicted by a maxmemory-policy, deleted manually, or the Redis server is restarted (which clears all data unless persistence is enabled). In our incident, the 12M keys with no TTL remained in memory for 72 hours until we manually applied TTLs, as no access pattern triggered lazy expiration for most keys.
Can I set a global default TTL for all keys in Redis 8?
Redis 8 does not support a global default TTL for all keys. You must set TTL per key via the EX (seconds) or PX (milliseconds) flags on the SET command, or via the EXPIRE command after writing the key. The only ways to enforce a default TTL across all keys are: 1) Client-side middleware that injects a default TTL if none is provided, 2) A Lua script that wraps all write commands to apply a default TTL, or 3) A periodic background job that scans for keys without TTL and applies a default. Server-side configurations like maxmemory-policy do not set TTLs, they only control eviction behavior when memory is full.
How do I safely find all keys with no TTL in a production Redis 8 cluster?
Never use the KEYS command in production, as it blocks the Redis server for the duration of execution, which can cause multi-second outages for large datasets. Instead, use the SCAN command to iterate over keys in small batches without blocking. For each key returned by SCAN, check the TTL using the TTL command: a return value of -1 means no TTL is set, -2 means the key does not exist. You can automate this with a script (like Code Example 2) or a Lua script (like Code Example 3) that runs periodically. For Redis clusters, you will need to run the SCAN command on each master node individually, as SCAN does not span multiple nodes.
Conclusion & Call to Action
Our 100GB Redis memory leak was a preventable mistake caused by a single missing TTL on a new session service. After 15 years of engineering, I can say with certainty: TTLs are not optional for ephemeral data. Every write to Redis for ephemeral use cases (sessions, caches, rate limits) must have an expiration set, enforced at the client level, monitored proactively, and reviewed in code. Redis 8 is a powerful in-memory store, but it trusts developers to manage key lifecycles—if you don’t set a TTL, Redis will keep that key until you explicitly delete it or evict it. We saved $4.6k per month and eliminated SEV-1 incidents by adding TTLs to 12M keys, and you can too. Audit your Redis clusters today: count your no-TTL keys, add enforcement to your clients, and set up monitoring. Your future self (and your on-call team) will thank you.
$4.6k/month saved by adding TTLs to 12M Redis keys
Top comments (0)