When Monitoring Saves the Day: How We Optimized Our Production Database Without Increasing Costs

#database #monitoring #performance #sre

When Monitoring Saves the Day: How We Optimized Our Production Database Without Increasing Costs
It started like any other day until our monitoring system triggered an alert on the production database.
The alarm indicated that the storage utilization on the production database had crossed the configured threshold, meaning the storage was approaching a critical level and required immediate attention.
As part of the investigation, I reviewed other key database metrics to better understand the overall system health. That’s when I noticed something interesting.
The database had been experiencing frequent spikes in CPU utilization, occasionally reaching as high as 89% of available vCPUs. These spikes were brief and intermittent, which explained why they never triggered a CPU alarm threshold.
Memory usage was also relatively high. Freeable memory was being consumed up to about 80%, but not to the point where the system had to rely on swap memory. Swap usage was only about 1MB, which indicated that memory pressure had not yet begun affecting database performance.
To ensure performance was not already degraded, I checked additional metrics:
Read latency
Write latency

Both metrics were well within acceptable ranges, confirming that despite the high CPU and memory spikes, database performance remained stable.
Defining the Action Plan

After assessing the situation, I outlined the actions to present to the CTO.
1. Immediate Storage Increase
The first step was straightforward: increase the database storage to prevent reaching a critical capacity limit.
One advantage of using Amazon RDS is that storage scaling can occur without downtime, as the process runs in the background. After reviewing the cost implications, the increase was approved since the cost impact was minimal.

2. Investigate CPU and Memory Spikes
The next challenge was the recurring spikes in CPU and memory utilization.
There were two possible approaches:
Option A: Increase the instance size
Upgrading the database instance to the next size would increase compute and memory capacity. However, because our architecture included both a primary database and a read replica, upgrading both instances would nearly double our RDS cost.
Option B: Optimize database queries
Since performance metrics were still healthy, I recommended first working with the engineering team to optimize database queries that might be inefficient or resource-intensive.
This approach would allow us to improve performance without immediately increasing infrastructure costs.

Safe Execution: Scaling Storage
Before applying the storage increase to the production database, I followed a cautious approach.
As a rule of thumb, I scaled the read replica first, ensuring that the operation completed successfully without unexpected side effects.
Once that was completed successfully, I proceeded to scale the primary database storage, which also completed without downtime or disruption.
This was performed during a period of low traffic, allowing us to confirm that the operation did not impact uptime.

Collaboration with Engineering
Following this infrastructure adjustment, I connected with the Software Engineering Team Lead to discuss the observed resource spikes.
A Jira ticket was created and assigned to a developer to investigate and optimize the database queries responsible for the load.
The optimization work was first implemented and tested extensively in the staging environment. Once the improvements showed the desired results, the optimized queries were deployed to production.

The Result
After deploying the query optimizations, we observed a significant reduction in database resource usage:

CPU utilization dropped to approximately 60%
Memory consumption stabilized
Database performance remained strong

Summary
This experience reinforced an important principle in engineering:
Not every performance issue should be solved by scaling infrastructure.
Instead of immediately upgrading the database instance, which would have significantly increased our AWS costs, we focused on observability, analysis, and optimization.
By combining infrastructure scaling where necessary (storage) with application-level improvements (query optimization), we resolved the issue efficiently while keeping operational costs under control.

Final Thought
Good engineering is not just about making systems work.
It’s about making them work efficiently, reliably, and sustainably.
Sometimes the best solution isn’t scaling up, it's understanding the system deeply enough to optimize what already exists.

DEV Community

When Monitoring Saves the Day: How We Optimized Our Production Database Without Increasing Costs

Top comments (0)