CAST AI

Posted on Dec 17, 2021 • Originally published at cast.ai

How to use spot instances during the Christmas frenzy?

#kubernetes #devops #tutorial #cloud

You’ve been doing a great job using spot instances for the entire year. But as 2021 is coming to a close, you might have noticed that there’s less inventory of your favorite instance types.

And even when you manage to get your hands on some, they get reclaimed rather quickly. Everyone else needs extra capacity during the holiday season.

Can you even trust spot instances to reduce your costs at such a critical time? Should you run everything on on-demand instances and pay the premium just to keep the application running?

Those are the questions we will answer in this article.

Jump to the section that’s most relevant to you:

Dealing with spot instance interruptions manually is hard
And what if you handled the spot interruptions automatically?
Automation takes many tasks off your plate
Here’s how spot instance automation works in practice
The results may surprise you

Dealing with spot instance interruptions manually is hard

Hyperscalers like AWS, Google Cloud, and Azure have massive data centers scattered around the globe. At some point, they find themselves with spare capacity and they offer it at a discounted price.

The only catch is that the vendor can pull the plug at any time. This makes spot instances generally difficult to manage for production workloads. When getting a spot instance, you have no guarantee on how long it will stay available.

Vendors only share some historical data about past interruptions to help you make a more informed decision. Or notify you when the risk of interruption increases for the instance you’re using to give you some extra time.

A spot instance is reclaimed with short notice - from 2 minutes to as little as 30 seconds. Is that enough time to drop everything and find a replacement for your instance? You may know a superhuman engineer who’s able to do that, but they surely have more productive tasks to do.

Spinning up a new instance takes time

And even if you identify the virtual machine where to move your workload, creating a new one takes time, so you’re still facing the danger of potential downtime.

Running paused machines means extra costs

Alternatively, you could have some paused machines at your disposal that could step in if you happen to lose a spot instance. But this comes at an additional price, and aren’t you in the spot business to save on your cloud costs?

Spot instance availability changes rapidly

Another issue is that the available capacity sold as spot instances can differ a lot based on size, region, time of day, and many other factors - all subject to frequent changes.

The availability of an instance is based on supply and demand. So if you pick the most popular instance types and a market surge like Black Friday occurs, prepare for a nasty surprise.

And what if you handled the spot interruptions automatically?

Using an automation platform addresses all the problems listed above and makes all the difference when it comes to your cloud bill. You can rely on spot instances even during the most intense times of the year.

At CAST AI, we have the Spot Fallback feature that keeps your workloads running even if the cloud provider sells all the capacity and there are no spot instances left for you to use. If no spot instances matching your requirements are available, the autoscaler temporarily adds an on-demand node for your spot-only workloads so they have a place to run on.

Once the inventory of Spot/Preemptible instances is available again, the autoscaler replaces the on-demand nodes used during the fallback with actual spot instances.

That way you can still get the benefits of spot instances without having to be anxious about the potential downtime.

Automation takes many tasks off your plate

But arguably the best part is that you don’t need to care about provisioning instances. Moving a workload from one spot instance to another is challenging if you do it manually because it’s a multi-step process:

1. Examine the cloud provider offer

You can check the less popular instances - they're less likely to be interrupted and can run for longer periods of time. Before deciding to buy an instance, look at how often it is interrupted (its interruption or eviction rate).

2. Make your bid

Set the maximum price you want to spend for your preferred spot instance. The rule of thumb is to follow the level of on-demand pricing and avoid getting interrupted when the price of that instance spikes.

3. Manage spot instances in groups

This opens the doors to request a variety of instance types at the same time and increases your chances of securing a spot instance. You can get more tips about managing spot instances here: Spot instances: How to reduce AWS, Azure, and GCP costs by 90%

To make spot instances work and secure the right mix of instances, you’ll have to dedicate a lot of time and effort to configuration, setup, and maintenance.

A good automation platform will do them all for you. All you need to do is set up the right policies and see your workloads gracefully moved from one instance to another when interruption happens. Here’s how.

Here’s how spot instance automation works in practice

CAST AI is an automation platform that uses a mix of tactics to optimize cloud costs and help teams get the performance they need. This approach extends to how spot instances are automated in CAST AI.

The platform doesn’t stick to a predefined list of instance types, but instead scans workloads for spot suitability and matches their requirements automatically. You can edit the list of spot-suitable workloads afterwards.

This flow is a lot faster than selecting the instance types you prefer manually and picking the pods which shouldn’t run on spot instances at all. The AI checks it all for you in minutes.

During the times when everyone is competing for spot instances they know and use, you will be playing in a mostly empty playground of ALL instance types suitable for your application.

Configuring the automation

You can automate spot instances on Kubernetes workloads running on EKS, Kops, AKS, and GKE. Here’s a short overview of how you can configure CAST AI to cover all the scenarios your application might face.

Tolerations

This configuration comes in handy when spot instances are an optional choice for your workload.

When your pod is marked only with tolerations, the Kubernetes scheduler can place the pod on both spot and regular nodes, always picking the most cost-effective resources.

Node Selectors

In this configuration, workloads will only use spot instances. The autoscaling mechanism will pick a spot instance whenever your pod requires an additional workload in the cluster.

Node Affinity

Here we use spot instances if they’re available - if not, the application falls back to on-demand instances.

For example, if a spot instance gets interrupted and the on-demand instances in the cluster have some available capacity, pods that previously ran on that spot instance will be scheduled on the available on-demand resources. Moving your workloads back to spot instances is possible via the Rebalancer feature.

Spot Reliability

This configuration focuses on minimizing the chance of workload interruptions. The autoscaler identifies which instance types are less likely to be interrupted and you can set a default reliability value to be applied across the entire cluster.

The reliability value is measured by the percentage of reclaimed instances during the trailing month for this instance type. Click here for more info.

You can control the reliability value at a more granular level, per workload. For example, you can leave the most cost-efficient value globally and choose more stable instances for specific workloads.

The results may surprise you

A few months ago, we decided to check whether we were really using the best instances available and were shocked to see this spot instance recommendation: the INF1 instance. Who would CAST AI pick that for us? It's essentially a supercomputer for high-performance ML inference that costs a lot if you buy it for on-demand pricing.

Funnily enough, the CAST AI platform didn’t go off rails. The INF1 spot instance was actually cheaper at that point in time than the typical instance we were getting.

So while everyone was manually competing for what they knew, we rented a Ferrari for the price of a Fiat just because automation looked at every single instance type that can do the job.

If we were choosing instances manually, we’d never look into this category. Automation expanded our reach and got us this gem of an instance.

If you are ready to quit competing on the same level everyone’s on - let us know and we’ll happily show you around the CAST AI platform.