lowkey dev

Posted on Sep 21

How I Saved My System Through Peak Season

#architecture #backend #performance

Introduction: Peak Season and the Challenge Ahead

The travel season was here, and the atmosphere at our company was hotter than the sun outside. Our system—the heartbeat of all operations—was about to face peak traffic 8–10 times higher than usual. I opened my laptop and accessed the dashboard like a normal user, but immediately felt the pressure: everything was slow and laggy, each click sent a flurry of requests that were hard to control.

Every analytics table, every chart was a potential “CPU and memory bomb.” The server under stress, and OOM (Out of Memory) was almost guaranteed if traffic kept spiking. This marked the start of my journey to save the system, where every decision would directly affect the user experience.

Investigating the Frontend: The Tip of the Iceberg

Opening F12, I saw hundreds of requests continuously hitting endpoints, many fetching entire customer, transaction, and payment tables. The dashboard tried to compute everything in real-time, but CPU and memory jumped with every click.

I applied lazy loading for non-critical data, cached some tables temporarily in localStorage, and sacrificed a little smoothness in UX. Instantly, the dashboard became more responsive, the backend felt lighter. But I knew this was just the tip of the iceberg—the real danger was lurking deeper.

Investigating the Backend: Where the Pressure Truly Lies

Frontend only shows the tip of the iceberg. I opened server logs, enabled APM, and tracked slow queries and profiling metrics. Many endpoints computed analytics in real-time on massive tables. Read-heavy queries were unoptimized, fetching all data on each dashboard load, sending CPU and memory into overdrive.

I tried precomputing heavy metrics and storing them in Redis. Initially, data was a few minutes behind real-time, making me anxious, but the dashboard ran smoothly and the backend stabilized. A clear trade-off: sacrificing some accuracy to save the system. Redis hit rates increased, and I felt both relief and tension.

CQRS and Read-Heavy Queries: A Long-Term Solution

The read-heavy queries continued to stress the server. I tried scaling MySQL, adding replicas, increasing RAM—but memory spikes still occurred. I decided to implement CQRS, separating write and read operations, using OpenSearch to serve read-heavy queries.

Data synchronization was complex, logic was intricate, but the dashboard finally responded fast and reliably. Complexity increased—more services in the codebase, listeners syncing data, added monitoring for OpenSearch, Redis, and MySQL. Yet, heavy analytics tables now ran smoothly, CPU and memory no longer jumped wildly.

Precomputing the Dashboard: Sacrificing Realtime

The most critical analytics tables, if computed in real-time, would make the server under stress prone to crashes. I precomputed results and stored them in Redis. When peak traffic hit, the dashboard ran smoothly, though data was no longer fully real-time. I remembered the moment of clicking through the dashboard and seeing charts lag by a few minutes—a trade-off worth accepting to keep the system alive.

Exports and dashboard queries now returned lightning-fast data from Redis, CPU dropped from 95% to 60%, and memory stabilized.

Cache Promise, Request Coalescing, and Pre-Warming

Before peak traffic, many concurrent requests hitting the same data made Redis and the database shaky. I implemented Cache Promise and request coalescing, merging multiple requests so that only one query actually hit the database. The code became more complex, but the backend stood firm—I felt like we had weathered a storm.

I also scheduled pre-warming cache jobs. The server absorbed a light load during off-peak hours, but when traffic peaked, data was ready. The dashboard stayed smooth, and the backend calmly handled 8–10x traffic without faltering.

Request Prioritization and Selective Querying

Some Excel exports or analytics requests used to slow down critical operations. I implemented bulkhead and request prioritization, ensuring critical requests were processed first. Some analytics exports were slower, but the system remained responsive.

To avoid OOM, I queried only necessary fields and processed large exports in batches. Real-time data integrity was partially sacrificed, but the server survived, the dashboard remained smooth, and the feeling of victory ran through the system.

Monitoring and Alerting: Better Safe Than Sorry

During preparation, I set up continuous monitoring: CPU, memory, Redis hits, OpenSearch query latency, successful and failed request counts. I configured alerts for threshold breaches, so we received warnings before the system truly failed.

This way, I didn’t wait for the server to crash to know something was wrong—memory spikes or slow queries were reported immediately, allowing timely intervention.

Chaos Testing and Load Testing

Before the peak season, my team ran load tests simulating peak traffic and performed chaos testing, intentionally breaking some services. Through these tests, we learned a lot: redundant cache, stacked request queues, potential deadlocks in OpenSearch sync listeners. These exercises helped us prepare rollback plans, increase replicas, and adjust batch sizes.

Rollout & Hotfix During Peak Hours

One night, during the traffic peak, a minor bug in the precomputed dashboard caused data to lag more than usual. I had to apply a hotfix directly in production, deploying carefully step by step while monitoring Redis and OpenSearch. It was tense and stressful, but once everything stabilized, it felt like we had truly survived a data storm.

Conclusion: Lessons Learned

After surviving the peak traffic, the dashboard ran smoothly, the backend was stable, and users were unaffected. Reflecting on the experience, I realized that preparation is everything: setting up monitoring, alerts, load testing, chaos testing, and pre-warming caches beforehand can make the difference between success and disaster.

Equally important is finding the root cause of issues. It’s easy to patch symptoms, but unless you understand the underlying problems—whether it’s read-heavy queries, unoptimized endpoints, or poorly synchronized data—the system will eventually break under stress.

Finally, there’s no perfect solution. Every choice involves trade-offs: sacrificing some UX smoothness, accepting minor delays in real-time data, increasing system complexity. Recognizing these trade-offs and planning for them ahead of time is the key to keeping a system alive during high-pressure peak traffic seasons.

DEV Community