Moving to or starting with AWS (or any cloud provider) comes with an implicit assumption that your business will pay what it uses. Although technically true, most businesses ignore the human aspect. More often than not, developers will make assumptions while allocating resources and end up with:
- Overprovisioned resources
- Unused resources
In this article, based on our experiences and AWS events/whitepapers, we will outline a few approaches to combat the ramifications of these decisions at any stage of product growth. The earlier we recognize how to manage them constantly, the more cost savings we can achieve.
Remember saving costs means nothing if either your are spending too much time on it or not achieving your business goals.
Cost optimization is not a one-time activity. You cannot throw people at the problem once and hope the results will outlive your business goals. Organizations and smaller teams within them are more fluid than ever. They are adapting to the external (Porter's Five Forces) and internal forces being applied to the business. Thus it is prudent on our part to deal with cost optimization as a continous process of optimization.
One such useful technique is the Deming Cycle. PDCA (Plan-Do-Check-Act) is an iterative four-step management method used in business for the control and continuous improvement of processes and products.
For our cost optimization problem the four steps map to various opportunities and tools provided by AWS.
When developers launch new services or environments, cost optimization should be one of the metrics that is planned and predicted. Without compromising on functionality, a budget should be set alongwith any predictions on the usage. This can help the team decide on the right instances and regions to use from the get-go.
In practice, a time horizon of a quarter works great for new services that have no historical data on usage patterns. AWS also provides auto scaling features that can help in quickly reacting to sudden spikes of usage.
Even if the service is already deployed, minor tweaks in auto scale capacity (ECS or EC2), will allow you to get started on the cost optimization journey.
- Auto scaling
- Load tests
you cannot improve what you cannot measure
With your services, you cannot tweak costs unless you know how they are being utilized and your spend pattern. You must identify your biggest spends and focus on them.
First tool to help you measure, is tags. Tags are a crucial and prolific way of categorizing and inspecting utilization of your cloud resources at every level. Tags can be set on resources instances, volumes, EIPs, AMIs, LBs, etc.
Standardize on tags across your teams and BU. Add the bare minimum dimensions and with meaningful information for the org, using each tag such as:
- environment -
- team -
- geography -
These tags will give you different vantage points of your current and forecasted costs.
Auto scaling plays an important in mitigating unpredicted service behaviors. Although we may plan for them, a sudden spike needs to be handled and not compromise on service performance or user experience. Setup auto scaling where possible with EC2, Fargate or ECS containers.
AWS Compute Optimizer recommends optimal AWS resources for your workloads to reduce costs and improve performance by using machine learning to analyze historical utilization metrics. Compute Optimizer is available to you at no additional charge
The best part about CO is it not only recommends instance types but also visualize a what-if scenario. This helps you understand how the recommended instance would have performed on the recommended instance type.
Apart from AWS CO suggestions, you can also consider optimizing on cheaper instance alternatives such as the AWS Graviton instances. If the workload can be run on ARM then you can get more performance at a cheaper monthly price.
If you're running EKS or ECS, you've got a control plane that is orchestrating the data plane. With EKS, you pay $0.10 per hour for each Amazon EKS cluster that you create. You can use a single Amazon EKS cluster to run multiple applications by taking advantage of Kubernetes namespaces and IAM security policies.
Hence, it's a good idea to analyse all your clusters and consolidate them into one or few as your architecture permits. Maybe reduce that by a couple different teams or even application types if you have a lot of applications that are say machine learning jobs or web applications you can break those down into different types of clusters that use different types of compute in each cluster that's optimized for those workloads. The larger the clusters, the more you can share some resources and bin pack things onto ec2 instances.
If you architecture is fault tolerant and can restart failed processed or jobs, you can try to move away from on-demand instances to spot instances. This has helped us personally shave off almost 60% of our primary costs. To ensure fault tolerance via uptime, we have a hybrid approach of 20% on-demand and 80% spot instances.
Your SLAs need to be met while you're optimizing. To react within your SLA, you must enable alerts and healthchecks where auto scaling does not help. Using AWS Cloudwatch and others, you should create a notification system to handle failures and spikes.
- Auto scaling
- AWS Compute Optimizer
Monitoring for both, cost analysis and performance variance is essential. For cost, you have two options, Cost Explorer and Usage Reports. Cost explorer is more realtime and gives you an insight on your current spend whereas with CUR (Cost and Usage Reports), the current month’s billing data will be delivered to an Amazon S3 bucket that you designate during set-up. You can receive hourly, daily or monthly reports that break out your costs by product or resource and by tags that you define yourself. AWS updates the report in your bucket at least once per day. After setting up a Cost and Usage Report, you will receive the current month’s billing data and daily updates in the same Amazon S3 bucket.
You can read an in-depth article on AWS Cost Allocation Tags and Cost Reduction
You should set alerts on your spend which will automatically alert your predicted costs are exceeded.
- Cost Explorer
- Usage Reports
- AWS Cloudwatch
Once your optimization efforts have yielded results, it is time to make them the norm and perform hygenic actions.
First off, after a cycle of activities (quarter, product release, Black Friday sale), clear your unused resources. Identify them by running reports or checking usage, such as ALBs with no target groups, unused EIPs, very old snapshots, S3 buckets that have no new objects being added, unmounted EBS volumes, etc. Remove unhealthy or unused instances that may be up due to target capacity in Spot Requests.
Standardize the clusters and ensure new services are launched within the same clusters. Use AWS CDK or Cloudformation to provide templates to developers.
- Cloud Formation
- AWS CDK
- AWS Cloudwatch
Finance folks that look at the bill once and pay it do not understand the business value of each line item. It is up to the developers to understand the workloads and the compute environment to continously optimize and reduce costs.
Hope you enjoyed reading the article.