Beyond TTL: Explicit Cache Invalidation

#systemdesign #caching #distributedsystems #backendengineering

Most engineers default to TTL for cache invalidation. But a simple Time-To-Live guarantees stale reads for a duration, which is unacceptable for critical data that demands immediate read-after-write consistency.

Relying solely on TTL can lead to frustrating production outages and customer trust issues. Imagine a user updating their profile picture, only for their friends to see the old one for another five minutes. Or worse, a customer sees their bank balance after a transfer and it shows the old amount, leading them to believe the transaction failed. These aren't edge cases; they're common failures when caching is poorly integrated with data updates, directly impacting user experience and potentially costing real money.

Beyond TTL: Explicit Cache Invalidation

For data that demands read-after-write consistency, you cannot rely on a probabilistic, time-based expiry. You need explicit invalidation. The application must actively remove or update a cache entry whenever the underlying data changes in the source of truth (typically a database).

The most robust and common pattern is "Cache Aside with explicit invalidation." Here’s the flow:

Read: Application tries to read data from the cache.
- Cache Hit: Return data immediately.
- Cache Miss: Application reads data from the database, stores it in the cache, then returns it.
Write (Update/Delete): Application writes (updates or deletes) data directly to the database.
Invalidate: Immediately after the successful database write, the application explicitly deletes the corresponding entry from the cache. This ensures the next read will be a cache miss, fetching the fresh data from the database.

This strategy ensures that any subsequent read after a write will either hit an empty cache (triggering a fresh database read) or hit a cache that still holds the old value briefly until the invalidation propagates. For true read-after-write consistency, the invalidation must happen before any subsequent read.

Here's an ASCII diagram for a single application instance interacting with a single cache instance:

+--------+      +----------+      +----------+
| Client | <--> |    App   | <--> | Database |
+--------+      +----------+      +----------+
                    | ^
                    | | (1. Cache Miss -> Read DB)
                    v | (2. Store in Cache)
                  +-----+
                  |Cache|
                  +-----+
                      ^
                      | (3. Invalidate on Write)

In a distributed system with multiple application instances and multiple cache nodes (e.g., a Redis cluster or a Memcached fleet), explicit invalidation becomes more complex. You can't just delete from one cache node; you need to ensure all relevant cache nodes invalidate that entry. This is typically handled via a distributed messaging system (like Kafka, RabbitMQ, or Redis Pub/Sub) or by making the cache invalidation a part of a distributed transaction or event stream.

+--------+      +----------+      +----------+
| Client | <--> |  App 1   | <--> | Database |
+--------+      +----------+      +----------+
                    |   ^            |
                    |   |            | (DB Write)
                    v   |            v
                  +-----+          +----------------+
                  |Cache|          | Invalidation   |
                  | Node| <------- |  Message Bus   |
                  |  A  |          | (e.g., Kafka)  |
                  +-----+          +----------------+
                      ^                    ^
                      |                    |
                      | (Publishes invalidation)
                      |                    | (Subscribes to invalidation)
                      v                    v
                  +-----+          +-----+
                  |Cache|          |Cache|
                  | Node| <------- | Node|
                  |  B  |          |  C  |
                  +-----+          +-----+

When App 1 writes to the database, it also publishes an invalidation message (e.g., "invalidate key 'user:123'") to the Message Bus. All cache nodes, or services that manage cache nodes, subscribe to this bus and delete the specific key from their local cache. This ensures eventual consistency across cache replicas for invalidation, which is typically fast enough (tens of milliseconds latency for most message buses).

Uber's Profile Service Invalidation

Consider Uber's profile service. When a driver updates their vehicle registration or a rider changes their payment method, this critical information needs to be instantly consistent across various downstream services (e.g., matching engine, billing, support).

Uber likely employs a pattern similar to "Cache Aside with explicit invalidation" for such high-priority data. When a write occurs to the underlying persistent storage (e.g., a sharded MySQL database or a NoSQL store), the service responsible for that write explicitly invalidates the relevant key in their distributed cache (e.g., a massive Redis cluster). This invalidation isn't just a local operation; it's propagated across regions and cache replicas.

For example, a Profile Update Service might:

Persist data to MySQL (typically takes 5-10ms).
On successful commit, publish an event to a Kafka topic like profile_updates (adds <1ms to latency for publishing).
A dedicated Cache Invalidation Service (or individual services with their own caches) subscribes to profile_updates.
Upon receiving an event for user:123, it issues a DELETE user:123 command to the Redis cluster (typically <1ms for a local delete).

This ensures that within tens of milliseconds, even across geographically dispersed data centers, stale data is purged from caches. For reads hitting other services, if they cache user profiles, they also subscribe to the profile_updates topic, ensuring they too invalidate their local caches. This pattern helps Uber achieve a consistently high cache hit ratio (>95%) for frequently accessed data while guaranteeing strong read-after-write consistency for critical updates.

Common Mistakes

Assuming TTL is enough for everything. TTL is great for less critical data where eventual consistency is acceptable (e.g., trending topics, non-essential recommendations). But for user-generated content, financial data, or critical configuration, TTL is a recipe for stale data and customer complaints. Engineers often apply a blanket TTL policy without differentiating data consistency needs.
Forgetting about Race Conditions during invalidation. A classic race condition:
- App writes to DB.
- App tries to invalidate cache.
- Before invalidation happens, another App instance reads from the cache, gets old data, and stores it back in the cache (if it's a "miss and populate" strategy, or if the invalidation fails).
- The initial App instance successfully invalidates.
- Result: the cache now contains the old data again. The safest way to avoid this is to DELETE from cache after a successful DB write, and if a read hits a stale cache before invalidation, it will get stale data once. The subsequent read will correctly fetch fresh data. To mitigate the race, some systems use a "delete-then-write" pattern, but this can lead to temporary cache misses and increased DB load. A more robust solution involves versioning (e.g., write data to cache with a version, only update if versions match or newer) or ensuring cache writes are atomic with DB writes (e.g., transactional outbox pattern).
No retry mechanism for invalidation. Network issues, cache server failures, or transient errors can cause an invalidation request to fail. If your application doesn't retry invalidation, the cache will remain stale indefinitely, leading to hard-to-debug issues. Always build in retry logic with exponential backoff for cache operations.
Blindly invalidating everything. When a complex object changes, some engineers invalidate the entire object or even all related objects. This can lead to a drastic drop in cache hit ratio, increased load on the database, and degraded performance. Design your cache keys and invalidation strategy to be as granular as possible. If only a small field within a large user profile changes, invalidate only that specific user's profile, not all profiles.

Interview Angle

When an interviewer asks about caching, especially about invalidation, they're looking for your understanding of consistency models and practical trade-offs beyond simple TTL.

Common Follow-up Questions:

"How do you achieve read-after-write consistency in a distributed cache?"
- Strong Answer: "We'd use a 'Cache Aside' pattern. On a write, the application first updates the database and then explicitly invalidates (deletes) the relevant key from the cache. This forces the next read to go to the database, ensuring fresh data. For distributed caches, this invalidation would be propagated via a message bus like Kafka or Redis Pub/Sub to all relevant cache nodes or services."
"What if the cache invalidation fails?"
- Strong Answer: "This is a critical failure scenario. The application must implement robust retry logic with exponential backoff for cache invalidation operations. If a retry still fails after several attempts, we'd log the error, alert monitoring systems, and potentially revert to a TTL for that specific key as a fallback, understanding the temporary consistency trade-off. For extremely critical data, you might even consider a transactional outbox pattern where the DB write and cache invalidation message are part of an atomic transaction."
"How do you handle this with multiple cache replicas/regions?"
- Strong Answer: "We'd use a publish/subscribe mechanism. The service performing the write publishes an invalidation event to a message queue (e.g., Kafka). All cache instances (or cache management services) in different regions or data centers subscribe to this queue and, upon receiving the event, explicitly delete the corresponding key from their local cache. This ensures eventual consistency for cache invalidation across the distributed system, typically within tens of milliseconds."
"What are the trade-offs of explicit invalidation vs. TTL?"
- Strong Answer: "Explicit invalidation offers strong read-after-write consistency, crucial for critical data, but adds complexity to the write path (extra cache operation, retry logic, distributed messaging). It also demands careful key management. TTL is simpler to implement, requires less write path complexity, and is suitable for data where eventual consistency is acceptable. However, it guarantees stale reads for the TTL duration and is prone to unnoticed consistency issues for frequently updated data. The choice depends on the data's consistency requirements and performance profile."

Need help refining your system design skills or tackling specific interview challenges? Book a 1:1 session with me on Topmate to deep dive into advanced topics and real-world system architectures.

Want to Go Deeper?

I do 1:1 sessions on system design, backend architecture, and interview prep.
If you're preparing for a Staff/Senior role or cracking FAANG rounds — book a session here.

DEV Community

Beyond TTL: Explicit Cache Invalidation