Scaling AWS costs to match the business

#aws #cost #netflixoss #cloudzero

I recently wrote a Medium post on cloud native cost optimization, in part to help customers who are currently dealing with rapid large and unexpected changes in their businesses due to the impact of COVID-19. Out of the ensuing discussion, a few things emerged. One is that I should try out dev.to for developer oriented posts, so this is my first post here. Another is that there's benefits and challenges in generating a metric that reports AWS cost per unit of business, so that's the subject of this discussion.

The first challenge is to decide what your business does, and whether there is a dominant metric that measures the value you provide to customers. I was at Netflix in 2011 when we started to build tooling to optimize our AWS spend, and Netflix has a very focused business model and measured customer value as the number of "streaming starts per second" (SPS). i.e. The rate at which people decide to start watching a show on Netflix.

We also had an AWS deployment model which tagged and attributed all the entities we created on AWS back to individuals and teams, and produced detailed billing on an hourly basis. Starting with a total AWS cost, dividing by SPS produced an hourly average total "cloud cost of value". Digging in further, the cost could be broken down by production delivery vs. test and development vs. data science vs. movie encoding etc. and individual teams were sent a weekly report showing their own share of the total cost, and how it was trending.

I've found that many customers don't have good tagging and attribution setup, so find it hard to work out what is driving their AWS bill. The first step is to come up with a percentage attribution metric for the bill, and drive it to cover most of the spend. I'd focus on this as a priority until it's in the 70-90% range, then clean up the rest over time.

For a more typical complex and diverse business, with many points of value delivery, the trick is to pick a dominant expense that scales with customer activity. One of the travel industry customers I've been working with did this, and used the metric to drive a cost reduction program over the last nine months. Amongst many other optimizations they implemented autoscaling to drive up average utilization. When COVID-19 hit, their customer traffic dropped, and their autoscalers maintained high utilization on a smaller footprint, so their AWS bill for that workload automatically reduced.

AWS is working with many customers and partners to help cost optimize in these uncertain times. After my Medium post Erik Peterson of CloudZero reached out to me to discuss their product which implements automatic tagging and allocation of metrics to help SaaS engineering teams continuously optimize their AWS spend. If this sounds interesting CloudZero Inc and AWS are offering through the end of May a 20% discount, $1k credit and a free 30-day trial with upfront waste assessment.

The orignal NetflixOSS tool that implemented this was called Ice, and they outgrew it, and passed it over to Teevity, who maintain their own version.