Mhamad El Itawi for AWS Community Builders

Posted on May 31

We Cut $120,000 from Our Cloud Bill Without Sacrificing Reliability

#aws #cloud #webdev #devops

We were running a cloud-hosted platform on AWS EKS, with EC2 worker nodes managed by us, MongoDB Atlas for NoSQL workloads, AWS RDS for relational databases, and Amazon ElastiCache for Redis for caching and temporary data.

Over time, the infrastructure had grown the way most real systems grow: more services, more data, more backups, more images, more snapshots, and more “temporary” resources that were no longer temporary.

The platform worked, but the cloud bill was higher than it needed to be.

So we started cutting waste, improving the application, and resizing the infrastructure around how the system actually behaved.

The result: around $120,000 in annual savings, without sacrificing reliability.

The Problem Was Not One Big Thing

When we started reviewing the infrastructure, it was clear that there was no single expensive resource causing the entire problem.

The cost came from many places at once.

Some services were using more CPU and memory than they needed. Some microservices did not really need to be separate anymore. Some databases were oversized for their actual usage. Some storage had accumulated over time. Some backups and snapshots were kept longer than necessary. Some resources were simply unused.

That is usually how cloud costs grow.

Not because of one bad decision, but because of hundreds of small decisions that were reasonable at the time and never revisited later.

So instead of looking for one magic fix, we approached the problem from multiple angles: application code, architecture, databases, Kubernetes resources, storage, backups, caching, and non-production environments.

The Optimizations

1. Making the Application Use Fewer Resources

One of the most important parts of the optimization was improving the application itself.

It is easy to look at cloud cost as an infrastructure problem only, but inefficient code directly affects infrastructure cost. If the application uses too much CPU or memory, the platform needs more pods, larger nodes, bigger instances, and more capacity overall.

We reviewed critical parts of the codebase and focused on reducing CPU and memory usage.

Some of the improvements included:

Reducing unnecessary in-memory object creation
Switching selected workflows to asynchronous and event-driven processing
Optimizing loops and replacing inefficient algorithms
Reducing heavy image-processing operations
Improving background jobs
Avoiding repeated or unnecessary work

These changes reduced CPU and memory usage by around 30%.

That had a direct impact on our AWS EKS environment. Since the workloads became lighter, we improved pod density and reduced the amount of EC2 capacity needed to run the platform.

In other words, better code translated directly into a lower cloud bill.

2. Merging Microservices That Did Not Need to Be Separate

We also reviewed the microservices architecture.

Microservices are useful when they solve real problems: independent scaling, ownership, deployment flexibility, or fault isolation. But when services are too small, too tightly coupled, or always deployed together, they can create unnecessary overhead.

Each service adds cost in different ways. It needs CPU, memory, logging, monitoring, networking, deployments, configuration, and operational support.

After reviewing the system, we identified services that no longer needed to run independently. Some had low traffic. Some were tightly coupled. Others were always released together.

By merging selected microservices, we reduced the number of running containers, decreased inter-service communication overhead, simplified deployments, and reduced Kubernetes resource requests.

This helped make the platform easier to operate and cheaper to run.

3. Optimizing MongoDB Atlas

The platform used MongoDB Atlas for NoSQL workloads, so we reviewed how our clusters were configured and how the application interacted with them.

One of the changes we made was disabling unnecessary multi-write behavior where it was not required by the business use case. Before making the change, we validated the impact carefully to make sure reliability and data consistency would not be affected.

We also implemented autoscaling in MongoDB Atlas. Instead of sizing clusters permanently for peak demand, autoscaling allowed the database capacity to better follow real usage patterns.

This helped us reduce overprovisioning while still keeping the database layer ready for higher-traffic periods.

MongoDB Atlas was an important part of the optimization because database costs can grow quickly when clusters are oversized, write patterns are inefficient, or environments contain more data than they actually need.

4. Reducing Staging Data and Using Smaller Environments

Non-production environments are often ignored in cost reviews, but they can become surprisingly expensive.

Our staging environment had grown larger than it needed to be. It contained more data than required for testing and validation, and the infrastructure around it was closer to production-sized than necessary.

We reduced the amount of data in the staging database and resized the environment based on actual usage.

The goal was not to make staging useless or unrealistic. We still needed enough representative data to test properly. But we did not need to pay for production-like capacity all the time.

This helped reduce costs across database storage, compute, backups, and supporting infrastructure.

5. Reviewing AWS RDS

We also reviewed our AWS RDS relational databases.

Part of the work was upgrading databases to newer supported versions. This was important for security and maintainability, but it also helped with performance and resource efficiency.

Database upgrades can bring improvements in query planning, indexing behavior, memory usage, and general engine performance.

We also reviewed database sizing, backup configuration, retention periods, and actual usage patterns.

The idea was simple: RDS should be sized based on real workload needs, not old assumptions.

6. Cleaning Up the Resources Nobody Thinks About

A big part of the savings came from cleaning up resources that had quietly accumulated over time.

These were not exciting changes, but they mattered.

We cleaned up old container images in the registry. Every deployment creates images, and without proper retention, old images stay forever.

We reviewed and deleted unnecessary EC2 and database snapshots. Snapshots are important, but keeping every old snapshot forever is expensive and usually unnecessary.

We deleted unused EC2 instances, including forgotten instances that were created for old R&D work, temporary experiments, migrations, or one-off tasks and were never shut down afterward.

We removed unused cloud disks that were no longer attached to active resources.

We released static IPs that were allocated but no longer used.

We cleaned bucket storage by removing old documents, temporary files, duplicate files, exports, and artifacts that were no longer needed.

Individually, some of these items did not look huge. Together, they made a noticeable difference.

This was one of the biggest reminders from the whole process: cloud waste usually hides in boring places.

7. Fixing Amazon ElastiCache for Redis Memory Growth

Amazon ElastiCache for Redis was another area we reviewed.

It was used for caching, sessions, queues, and temporary data. But if temporary data does not have an expiration policy, it slowly becomes permanent data.

We introduced expiration policies for data that did not need to live forever.

This helped reduce memory usage, avoid stale data buildup, and reduce the pressure to use larger cache nodes.

The rule we followed was simple: if the data is temporary, it should have a lifetime.

8. Reviewing Backups and Retention Policies

Backups were another important part of the review.

We did not want to reduce cost by blindly deleting backups. That would be risky and irresponsible.

Instead, we reviewed backup frequency, retention periods, database backups, snapshots, and recovery requirements across production and non-production environments.

The goal was to keep the backups we actually needed and remove unnecessary long-term retention where it did not provide real value.

This helped reduce costs while keeping recovery requirements intact.

Cost optimization should never come at the expense of being able to recover from failure.

9. Using Reserved Instances for Predictable Workloads

After reducing waste and improving utilization, we had a much clearer picture of our real baseline usage.

That made it easier to decide where reserved instances made sense.

Reserved capacity is useful when workloads are predictable, but it should not be the first step. If you reserve capacity before optimizing, you may simply commit to paying for waste.

We first reduced unnecessary usage, then used reserved instances for the workloads that were consistently running.

That helped us reduce long-term compute costs while keeping flexibility for variable workloads.

10. Moving More Static Content to CDN

We also improved how static content was delivered.

Instead of serving static files directly from backend workloads, we relied more on CDN delivery.

This reduced load on the application, improved response times for users, and helped reduce backend compute usage.

It was one of those changes that improved both performance and cost efficiency at the same time.

What Actually Made the Difference

The biggest savings did not come from one magic change.

They came from doing many practical things well:

Making the application use around 30% less CPU and memory
Merging services that did not need to be separate
Optimizing MongoDB Atlas usage and autoscaling
Reviewing and right-sizing AWS RDS databases
Reducing staging data and using smaller non-production environments
Cleaning old container images, snapshots, disks, buckets, static IPs, and forgotten EC2 instances from old R&D work or temporary tasks
Adding expiration policies in Amazon ElastiCache for Redis
Reviewing backup and retention policies
Using reserved instances for predictable workloads
Moving more static content to CDN

None of these changes are glamorous on their own.

But together, they made the platform leaner, easier to operate, and significantly cheaper to run.

The result was around $120,000 in annual cloud savings, without sacrificing reliability.

For me, the biggest takeaway was this: cloud cost optimization is not only about infrastructure. It is about engineering discipline across the full system.

Architecture matters. Code efficiency matters. Database sizing matters. Cleanup habits matter. Backup policies matter. Non-production environments matter.

And when all of those things are reviewed together, the impact can be much bigger than expected.

🌐 For more engineering stories and practical lessons, follow me on LinkedIn or Instagram.