Today, I'm sharing a Distributed In-Memory Cache architecture specifically designed for Heavy-Read systems, complete with a demo and benchmark.
Disclaimer: The "pattern" described below is synthesized from a production architecture implemented at my company. It has been modified and "AI-refined" for public sharing.
The Challenge: Scaling Read-Heavy Applications
Imagine hitting a database or even a Redis instance with 1 million requests per second (10^6 Req/s). That's not sustainable without significant investment. If we do the math, using traditional vertical scaling (beefing up the DB/Redis servers) becomes incredibly expensive, fast.
Consider an e-commerce application:
- A Write API handles product creation/updates (sellers posting products).
- A Read API handles product viewing (buyers browsing products).
Since the ratio of product views to product posts is usually massive, it's prudent to separate these functionalities into at least two distinct microservices: a Write Service and a Read Service. This separation ensures that if the Write Service experiences an outage, the Read Service (the critical path for user experience) remains unaffected.
So, how do we handle the intense read load?
The Proposed Architecture: Event-Driven Cache Synchronization
Here is the high-level approach to decouple the Read Service from the main data store:
When the seller successfully creates or updates a product (Write Service):
- The Write Service performs the database operation.
- It then publishes an event (the updated data) to a message broker (e.g., Redis Pub/Sub).
- All instances (pods) of the Read Service subscribe and listen for this event.
- Upon receiving the event, each Read Service pod updates its own local In-Memory Cache.
At first glance, this might seem standard, but the real power is in the implementation details that significantly reduce overhead.
The "Secret Sauce": Eliminating CPU and I/O Bottlenecks
After intense monitoring (metrics, APM, etc.), we noticed unusual spikes in service overhead during peak load. Reviewing the code and benchmarking led us to three crucial optimizations:
1. Zero-Copy Cache Update
When the Write Service publishes an update event to Redis, it first serializes (marshals) the object into raw bytes.
Object -> Serialize -> Bytes -> Publish to Redis
Crucially, when the Read Service pods listen for this event, they do not deserialize (unmarshal) the object back. They simply take the raw bytes and store them directly into the local In-Memory Cache.
2. Bypassing the Standard Framework Overhead
We initially used a framework (like Go Fiber) for our Read Service. However, we switched to using Go's native net/http library.
The key optimization: the Read Service directly serves the raw bytes retrieved from the In-Memory Cache as the HTTP response body.
Local Cache (Bytes) -> net/http -> HTTP Response (Bytes)
By combining steps 1 and 2, we have essentially eliminated all CPU-bound serialization/deserialization and I/O-bound database/remote cache lookups from the critical read path.
3. Leveraging HTTP 304 Caching (Etag)
To minimize network overhead, we implemented HTTP 304 caching.
- The Write Service calculates the ETag (a hash of the object) when it publishes the event.
- This ETag is passed along to the Read Service pods.
- The Read Service uses this ETag in its responses. If the client sends an
If-None-Matchheader with the matching ETag, the Read Service returns a 304 Not Modified, saving bandwidth and processing time.
4. Response Compression (Gzip/Brotli)
While optional (and we temporarily skipped it due to client-side complexity), the ideal is to also handle Gzip/Brotli compression at the Write Service. This means the compressed bytes are stored in the cache, further reducing network overhead without impacting the Read Service's CPU budget.
Architectural Implications
Here are the key takeaways from adopting this pattern:
- Database and Remote Cache are removed from the Critical Path: The DB and Redis are only used for event pub/sub and as a fallback if a new pod needs to bootstrap its cache. This allows for significantly lower capacity requirements for these vertical components.
- Massive Horizontal Scaling: Scaling Read Service pods (horizontal scaling) is orders of magnitude cheaper and easier than scaling the DB/Redis (vertical scaling). In our production environment, each pod can handle 60,000 Req/s, meaning <20 pods are needed for the 1 million Req/s target.
- CQRS Pattern Alignment: This architecture inherently follows the Command Query Responsibility Segregation (CQRS) principle by distinctly separating the data update (Command/Write) path from the data retrieval (Query/Read) path.
- Nano-Service Mindset: By isolating and optimizing a single, high-traffic read operation to this degree, it approaches the design philosophy of a "nano-service," similar to an AWS Lambda function handling a single task.
The Core Architecture Flow
Key Strengths
| Feature | Benefit |
|---|---|
| Ultra-Low Latency. | Sub-millisecond P99 response times on cache hits. |
| Linear Scaling | Adding more Reader pods scales throughput proportionally. |
| Real-Time Sync | Auto-propagation of updates via Pub/Sub with minimal overhead. |
| CPU/Memory Efficiency | Minimal serialization/deserialization And Smart caching (LFU/LRU). |
Demo Benchmark Results
You can find the implementation and demo on GitHub: https://github.com/huykn/distributed-cache
The demo includes three comparative endpoints. Results on my test machine (4 threads / 400 concurrent users):
| Endpoint | Path | P99 | What it does |
|---|---|---|---|
| Fast path | /post |
13ms | Local cache + raw bytes (no marshal) |
| Redis baseline | /post-redis |
30ms | Reads JSON directly from Redis |
| Local + marshal | /post-marshal |
15ms | Local cache + json.Marshal on read |
The difference between the raw bytes approach (/post) and the one that requires JSON marshalling (/post-marshal) clearly shows the value of eliminating CPU-bound serialization.
Real Metrics: Production Scale and Efficiency
Beyond the low-latency benchmarks, the true value of this architecture is seen in its production efficiency and scalability:
Production Resource Utilization
| Component | Specification | Purpose / Notes |
|---|---|---|
| Total Load | 1Million Req/s | Aggregate throughput for the Heavy-Read API. |
| Reader Pod (k8s) | ~60,000 Req/s | Throughput achieved by a single, standard Kubernetes pod. |
| Total Reader Pods | ~20 Pods | The maximum number of pods needed to handle the peak 1M Req/s load. |
| Redis Instance | only 1 r7i.xlarge (AWS) | Used only for Pub/Sub events and occasional cache miss bootstrap. |
What About Stale Data?
The heavy-read example focuses on speed and architecture. In real systems you also need to guard against stale data (out-of-order messages, missed invalidations, network partitions).
For a full, code-level treatment of this problem (versioned entries, detector, and OnSetLocalCache-based validation), see:
That example shows how to reject stale updates, detect stale entries on Get(), and handle Redis failures.
Conclusion
The Distributed In-Memory Cache pattern is more than just a performance tweak; it is a fundamental shift in how we handle read traffic in event-driven microservices.
This approach prioritizes Availability (A) and Partition Tolerance (P) over strict Consistency (C) (it adheres to the AP side of the CAP theorem). There will be a brief period of eventual consistency while the event propagates. When adopting this, ensure your business logic can tolerate minor, temporary data inconsistencies.
Found this valuable? Give the gitbub repository a star and a watch! Your support encourages me to write the next deep dive on "Heavy-Write API"


Top comments (0)