DEV Community

Murtaza 🐳 for AWS Community Builders

Posted on • Originally published at Medium on

THE PERPETUAL STATE OF ANXIETY WHEN WORKING IN THE CLOUD

Anyone who works in the cloud would know how it feels to constantly lose sleep over fears of exceeding budget. It would not be too far-fetched to assume that the sheer power of the cloud alone must have burned many startups. The question is, were they ready for the cloud?

Before we move forward, you should know two things about me: I am an avid Reddit explorer, and the cloud is my bread and butter. More often than not, I run into posts expressing their dismay over the alarmingly high costs of working in the cloud.


A screenshot of some Reddit posts about cloud billing.

One of the incidents shared in a post on Reddit and then on Medium by the developers of Milkie Way, entitled “ We Burnt $72K testing Firebase + Cloud Run and almost went Bankrupt ”, is very insightful. All the budget alerts did not work. The cloud is so heavily integrated that the scope of the incident was catastrophic. But what went wrong?

As the load increased after deployment and sanity/smoke tests, the firebase quickly shifted from free to paid, and GCP budget alerts were sent out to the team within minutes. The bill rose exponentially, and they did not have time to react fast enough.

Being Proactive Instead of Reactive

Cloud features have now evolved and become multi-dimensional and even more complex — something businesses should be careful about. To explain this more simply, imagine an incident where we have a Lambda function triggered by S3 whenever there is a change in the bucket. That same Lambda function saves logs to the same bucket. A recursive loop?


A Demonstration of a Lambda Function that Saves Logs to the Same Bucket that Triggers it.

We all learn from our mistakes and experimentation is never wrong. However, without proper research and load/capacity tests, nightmares like the ones documented in “Chapter 11” mentioned in the post shared above, become a reality. Anything that is done to salvage the damage is a reactive approach.

Besides, not everyone will be as lucky as most Reddit users who shared their stories. So how can we tackle this effectively?

Imagine the case below, where we can throttle requests at the API gateway. Also, we can only allow a certain number of concurrent Lambda executions. If we rely only on these limited controls, what happens to the requests that the API gateway blocks? Do they get logged somewhere, or are they discarded? What if those requests are being used to place orders? Are we potentially losing revenue because our infrastructure cannot handle the load?

We have to start thinking differently. We have to think of it as an application because infrastructure now behaves like the code, which we can control completely.

Mechanisms to Control Scalability

We need a kill switch! Or, to use a more appropriate phrase, we need to control scalability.

There is, however, the matter of how it would look to the end-user. We can always design an end-user experience solution. For example, we create a-sync queues, receive requests and let end-users know their request is in the pipeline, or bring a maintenance page. There are more ways to handle end-users than the ones mentioned.

Let us discuss the kill switch now — a proactive approach that anticipates malicious cloud actors or high spikes. To give a brief description, a kill switch throttles requests or disables the service that generated the billing spikes. It is best described in ‘Poor man’s kill switch’ by Nicola Apicella. He talks about the token bucket algorithm, which throttles requests if the bucket is out of coins. We can do a capacity test and finalise the number of tokens in the bucket and the number of requests we can entertain at a time.

We can also potentially auto scale the environment using the token bucket algorithm. For example, we create cloud watch alarms to monitor the load, and once it reaches the threshold, the infrastructure starts scaling up, after which we can increase the number of coins in the bucket.

The simple implementation below depicts the fine-grained control we would have in this design, i.e., controlled scalability. We will not throttle any requests received on the API gateway because we want to handle every one of them. We have a bucket full of coins. Each request uses a coin from the bucket and then recycles it. If we want to increase the number of concurrent executions, we will have to increase the number of tokens in the bucket. Other services would auto scale, assuming auto scaling is enabled.


A Token Bucket Algorithm Gives Us Control over the Number of Requests We Can Entertain at a Given Time.

This solution could have another use case as well. For example, if we run multiple environments, we can keep the lower ones within budget by maintaining a token bucket. Leonti Bielski uses something similar to the article shared above, only he bluntly shuts/terminates the services.

Then again, any solution is better than no solution. Nobody wants to wake up to a huge bill waiting for them.

Conclusion

What we have discussed in this blog is a concept. We need to realise that the cloud is no longer a static environment, where you configure the infrastructure by filling in parameters. With server-less and micro service architecture, the cloud is now dynamic and concepts from the on-premise era no longer hold any sway. Even if you are a beginner, you will realise that bills can never be estimated once you use the cloud a bit. This problem impacts low-budget startups and could be a potential sinkhole in enterprises. Untested and poorly provisioned infrastructure could be a nightmare waiting to happen.


Top comments (0)