Khúc Ngọc Huy

Posted on Dec 13, 2025 • Edited on Dec 20, 2025

1M+ Req/s Heavy-Read API in Go. Production Lessons Learned

#backend #go #redis #k8s

Today, I'm sharing a Distributed In-Memory Cache architecture specifically designed for Heavy-Read systems, complete with a demo and benchmark.

Disclaimer: The "pattern" described below is synthesized from a production architecture implemented at my company. It has been modified and "AI-refined" for public sharing.

The Challenge: Scaling Read-Heavy Applications

Imagine hitting a database or even a Redis instance with 1 million requests per second (10^6 Req/s). That's not sustainable without significant investment. If we do the math, using traditional vertical scaling (beefing up the DB/Redis servers) becomes incredibly expensive, fast.

Consider an e-commerce application:

A Write API handles product creation/updates (sellers posting products).
A Read API handles product viewing (buyers browsing products).

Since the ratio of product views to product posts is usually massive, it's prudent to separate these functionalities into at least two distinct microservices: a Write Service and a Read Service. This separation ensures that if the Write Service experiences an outage, the Read Service (the critical path for user experience) remains unaffected.

So, how do we handle the intense read load?

The Proposed Architecture: Event-Driven Cache Synchronization

Here is the high-level approach to decouple the Read Service from the main data store:

When the seller successfully creates or updates a product (Write Service):

The Write Service performs the database operation.
It then publishes an event (the updated data) to a message broker (e.g., Redis Pub/Sub).
All instances (pods) of the Read Service subscribe and listen for this event.
Upon receiving the event, each Read Service pod updates its own local In-Memory Cache.

At first glance, this might seem standard, but the real power is in the implementation details that significantly reduce overhead.

The "Secret Sauce": Eliminating CPU and I/O Bottlenecks

After intense monitoring (metrics, APM, etc.), we noticed unusual spikes in service overhead during peak load. Reviewing the code and benchmarking led us to three crucial optimizations:

1. Zero-Copy Cache Update

When the Write Service publishes an update event to Redis, it first serializes (marshals) the object into raw bytes.

Object -> Serialize -> Bytes -> Publish to Redis

Crucially, when the Read Service pods listen for this event, they do not deserialize (unmarshal) the object back. They simply take the raw bytes and store them directly into the local In-Memory Cache.

2. Bypassing the Standard Framework Overhead

We initially used a framework (like Go Fiber) for our Read Service. However, we switched to using Go's native net/http library.

The key optimization: the Read Service directly serves the raw bytes retrieved from the In-Memory Cache as the HTTP response body.

Local Cache (Bytes) -> net/http -> HTTP Response (Bytes)

By combining steps 1 and 2, we have essentially eliminated all CPU-bound serialization/deserialization and I/O-bound database/remote cache lookups from the critical read path.

3. Leveraging HTTP 304 Caching (Etag)

To minimize network overhead, we implemented HTTP 304 caching.

The Write Service calculates the ETag (a hash of the object) when it publishes the event.
This ETag is passed along to the Read Service pods.
The Read Service uses this ETag in its responses. If the client sends an If-None-Match header with the matching ETag, the Read Service returns a 304 Not Modified, saving bandwidth and processing time.

4. Response Compression (Gzip/Brotli)

While optional (and we temporarily skipped it due to client-side complexity), the ideal is to also handle Gzip/Brotli compression at the Write Service. This means the compressed bytes are stored in the cache, further reducing network overhead without impacting the Read Service's CPU budget.

Architectural Implications

Here are the key takeaways from adopting this pattern:

Database and Remote Cache are removed from the Critical Path: The DB and Redis are only used for event pub/sub and as a fallback if a new pod needs to bootstrap its cache. This allows for significantly lower capacity requirements for these vertical components.
Massive Horizontal Scaling: Scaling Read Service pods (horizontal scaling) is orders of magnitude cheaper and easier than scaling the DB/Redis (vertical scaling). In our production environment, each pod can handle 60,000 Req/s, meaning <20 pods are needed for the 1 million Req/s target.
CQRS Pattern Alignment: This architecture inherently follows the Command Query Responsibility Segregation (CQRS) principle by distinctly separating the data update (Command/Write) path from the data retrieval (Query/Read) path.
Nano-Service Mindset: By isolating and optimizing a single, high-traffic read operation to this degree, it approaches the design philosophy of a "nano-service," similar to an AWS Lambda function handling a single task.

The Core Architecture Flow

Key Strengths

Feature	Benefit
Ultra-Low Latency.	Sub-millisecond P99 response times on cache hits.
Linear Scaling	Adding more Reader pods scales throughput proportionally.
Real-Time Sync	Auto-propagation of updates via Pub/Sub with minimal overhead.
CPU/Memory Efficiency	Minimal serialization/deserialization And Smart caching (LFU/LRU).

Demo Benchmark Results

You can find the implementation and demo on GitHub: https://github.com/huykn/distributed-cache

The demo includes three comparative endpoints. Results on my test machine (4 threads / 400 concurrent users):

Endpoint	Path	P99	What it does
Fast path	`/post`	13ms	Local cache + raw bytes (no marshal)
Redis baseline	`/post-redis`	30ms	Reads JSON directly from Redis
Local + marshal	`/post-marshal`	15ms	Local cache + `json.Marshal` on read

The difference between the raw bytes approach (/post) and the one that requires JSON marshalling (/post-marshal) clearly shows the value of eliminating CPU-bound serialization.

Real Metrics: Production Scale and Efficiency

Beyond the low-latency benchmarks, the true value of this architecture is seen in its production efficiency and scalability:

Production Resource Utilization

Component	Specification	Purpose / Notes
Total Load	1Million Req/s	Aggregate throughput for the Heavy-Read API.
Reader Pod (k8s)	~60,000 Req/s	Throughput achieved by a single, standard Kubernetes pod.
Total Reader Pods	~20 Pods	The maximum number of pods needed to handle the peak 1M Req/s load.
Redis Instance	only 1 r7i.xlarge (AWS)	Used only for Pub/Sub events and occasional cache miss bootstrap.

What About Stale Data?

The heavy-read example focuses on speed and architecture. In real systems you also need to guard against stale data (out-of-order messages, missed invalidations, network partitions).

For a full, code-level treatment of this problem (versioned entries, detector, and OnSetLocalCache-based validation), see:

That example shows how to reject stale updates, detect stale entries on Get(), and handle Redis failures.

Conclusion

The Distributed In-Memory Cache pattern is more than just a performance tweak; it is a fundamental shift in how we handle read traffic in event-driven microservices.

This approach prioritizes Availability (A) and Partition Tolerance (P) over strict Consistency (C) (it adheres to the AP side of the CAP theorem). There will be a brief period of eventual consistency while the event propagates. When adopting this, ensure your business logic can tolerate minor, temporary data inconsistencies.

Found this valuable? Give the gitbub repository a star and a watch!

Top comments (1)

Bình Nguyễn Duy • Dec 15 '25

Very nice post. Thanks for sharing