Making the most of your Machine Learning budget on Amazon SageMaker

juliensimon profile image Julien Simon Originally published at Medium on ・9 min read

In the last 10 years or so, developers have been incredibly successful at running and scaling all kinds of workloads on AWS . Above all, they’ve learned to make the most of the pay as you go model, making sure they don’t spend a dollar more than necessary.

What do you mean by “we’ve spent all your budget on GPUs already”?

Holding on the firm belief that Machine Learning (ML) is no different from anything else (sue me), this post will present a number of cost optimization techniques, most of them for Amazon SageMaker, our popular fully-managed ML service. These techniques should help you avoid anti-patterns and keep your bills to the absolute minimum. Whether you spend that on fermented beverages or on extra PoC budget is entirely up to you ;)

Data preparation

Every ML project needs a dataset, and sometimes you have to build it from scratch. This usually means cleaning and labeling vast amounts of data, a time consuming (and costly) task if there ever was one.

You can save money by:

  1. Resisting the urge to build ad-hoc ETL tools running on instance-based services like EC2 or EMR. AWS has a nice selection of fully-managed services that will save you time and money, both on coding and infrastructure management. I’d recommend looking at Amazon Athena (analytics in SQL), Amazon Glue (Spark-based ETL) as well as Amazon Lake Formation (data lake). The latter is still in preview (feel free to join), and it features very promising ML features (aka ML transforms) that can automatically link or deduplicate data: more info in this re:Invent 2018 session.
  2. Using Amazon SageMaker Ground Truth and active learning to cut on data labeling costs. Ground Truth is a new service launched at re:Invent 2018. Not only does it provide intuitive tools to label text, image or custom datasets, it also supports an automatic labeling technique called active learning. In a nutshell, active learning uses manually labeled data to train a ML model, in turn capable of labeling data. This can reduce manual labeling by up to 70% : not only will your data get labeled faster, you’ll also save a lot of time and money on human resources.


Once you have a dataset to work with, it’s time to start exploring and experimenting. Jupyter notebooks are a popular way to do this, which is why Amazon SageMaker provides fully-managed notebook instances, pre-installed with most of the tools you’ll need.

You can save money by:

  1. Not using notebook instances! You don’t have to use them to work with Amazon SageMaker: you can perfectly start experimenting on a local machine (or an on-premise server), and then use the SageMaker SDK to fire up training and deployment on AWS.
  2. Stopping notebook instances when you don’t need them. This sounds like an obvious one, but are you really doing it? There’s really no reason to leave notebooks instances running : commit your work, stop them and restart them when you need them again. Storage is persisted and you can also use lifecycle configuration to automate package installation or repository synchronization.
  3. Experimenting at small scale and right-sizing. Do you really need the full dataset to start visualising data and evaluating algorithms? Probably not. By working on a small fraction of your dataset, you’ll be able to use smaller notebook instances. Here’s an example: imagine 5 developers working 10 hours a day on their own notebook instance. With ml.t3.xlarge, the daily cost is 5*10*$0.233=$11.65. With ml.c5.2xlarge (more oomph and more RAM to support a large dataset): 5*10*$0.476=$23.68. Twice the cost. You could save $476 per month (that’s serious beer/PoC money).
  4. Using local mode. When experimenting with the SageMaker SDK, you can use local mode to train and deploy your model on the notebook instance itself, not on managed instances. This has the double benefit of using the same code for experimentation and production, as well as of saving time and money by not firing up instances. All it takes is using the ‘local’ instance type: this is supported on all Deep Learning environments, here’s a TensorFlow example.


Once you’ve found an algorithm that performs nicely on your small-scale dataset, you’ll want to run it on the full dataset. This could potentially run for a while, so please spend wisely!

You can save money by:

  1. Not running long-lasting jobs on notebook instances. Unfortunately, this anti-pattern seems to be pretty common. People pick a large notebook instance (say, ml.p3.2xlarge, $4.284 per hour), fire up a large job, leave it running, forget about it and end up spending for an instance doing nothing for hours once the job is complete! Instead, please run your training jobs on managed instances : thanks to distributed training, you’ll get your results much quicker, and as instances are terminated as soon as training is complete, you will never overpay for training! As a bonus, you won’t be at the mercy of a clean-up script (or an overzealous admin) killing all notebook instances in the middle of the night (“because they’re doing nothing, right?”).
  2. Right-sizing your training instances. More often than not, simple models won’t simply train faster on larger instances , because they don’t benefit from increased hardware parallelism. They may even train slower to the GPU communication overhead! Just like on EC2, start with small instances, scale out first, then scale up. And yes, it’s fun to play with ml.p3.16xlarge, but they cost $34.272 per hour, so don’t be foolish :)
  3. Using AWS-provided versions of TensorFlow, Apache MXNet, etc. We have entire teams dedicated to extracting the last bit of performance from Deep Learning libraries on AWS (read this, this, and this). No offence, but if you think you can ‘pip install’ and go faster, your time is probably be invested elsewhere :)
  4. Packing your dataset in RecordIO / TFRecord files. Large datasets made of thousands or even million of files incur lots of unnecessary IO, which can slow down your train jobs. Packing these files into large record-structured files (say, 100MB each) will make it easier to move your dataset around and distribute it to training instances. Please use TFRecord for TensorFlow, and RecordIO for Apache MXNet and most built-in algorithms.
  5. Streaming large datasets with Pipe Mode. Pipe mode streams your dataset directly from Amazon S3 to your training instances. No copying is involved, which saves on startup time and also lets you work with infinitely large datasets (as they’re not fully loaded in RAM any more). Pipe mode is supported by most built-in algorithms and TensorFlow.


Figuring out the right set of hyper parameters for a ML model can be quite expensive, as techniques like grid search or random search typically involve training hundreds of different models.

You can save (a lot of) money by using automatic model tuning, a ML technique that can quickly and efficiently figure out the optimal set of parameters with a limited number of training jobs (think tens, not hundreds). It’s available for all algorithms: built-in, Deep Learning and custom. I cannot recommend it enough.


Now that you’re happy with the performance of your model, it’s time to deploy it to an Amazon SageMaker endpoint serving HTTPS predictions. This part of the ML process is actually where you’ll spend the most , because your models probably need to be available 24/7. Things can go south pretty quickly, especially at scale.

You can save money by:

  1. Deleting unnecessary endpoints. This is an obvious one, but we’re probably all guilty of neglecting it. It’s very easy to deploy models with SageMaker (especially when experimenting), and it’s equally easy to forget to clean up. Please remember that endpoints must be deleted explicitly. You can easily do it with the SageMaker SDK or in the console.
  2. Right-sizing and auto scaling your endpoints. Just like for training, you need to figure out which instance type works best for your application: latency, throughput and cost. Once again, I’d recommend starting small, scaling out first, and then scaling up. Endpoints supports auto scaling, so there’s no reason to run a large, under-utilized instance when you could run a dynamic fleet of small ones.
  3. Using batch transform if you don’t need online predictions. Some applications don’t require a live endpoint, they’re perfectly fine with batch prediction running periodically. This is supported by Amazon SageMaker, with the extra benefit that the underlying instances are terminated automatically when the batch has been process, meaning that you will never overpay for prediction because you left that endpoint running for a week for no good reason…
  4. Using Elastic Inference instead of full-fledged GPU instances. Elastic inference is a new service launched at re:Invent 2018 that lets you attach fractional GPU acceleration to any Amazon instance (EC2, Sagemaker notebook instances and endpoints). For instance, a c5.large instance configured with eia1.medium acceleration will cost you $0.22 an hour. This combination is only 10–15% slower than a p2.xlarge instance, which hosts a dedicated NVIDIA K80 GPU and costs $0.90 an hour. Bottom line: you get a 75% cost reduction for equivalent GPU performance. In this case, you’d save $490 per instance per month… Before deploying to GPU instances, please test Elastic Inference first : it’s probably worth every minute of your time and then some.
  5. Optimizing your ML models for the underlying hardware. Amazon SageMaker Neo is yet another service launched at re:Invent 2018. In a nutshell, the Neo compiler converts models into an efficient common format , which is executed on the device by a compact runtime that uses less than one-hundredth of the resources that a generic framework would traditionally consume. The Neo runtime is optimized for the underlying hardware , using specific instruction sets that help speed up ML inference. How is that saving you money? Well, as your model is now running quite faster, it’s reasonable to expect that you could downsize prediction instances while maintaining acceptable levels of latency and throughput. The best part? It takes a single API call to compile a model and it’s free of charge.
  6. Using inference pipelines instead of multiple endpoints. Some models require predictions to be pre-processed and/or post-processed. Until recently, this meant deploying multiple endpoints (one per step), incurring extra costs and extra latency. With Inference Pipelines, SageMaker is now capable of deploying all steps to the same endpoint , saving money and latency.


As you can see, implementing all these tips and best practices could very well slash your ML budget by an order of magnitude. Please give them a try, and let me know how much you saved :) I’d also love to hear about additional techniques that you’ve come up with. Please feel free to share them here.

As always, thank you for reading. Happy to answer questions here or on Twitter.

Posted on by:

juliensimon profile

Julien Simon


Global Evangelist, AI & Machine Learning, Amazon Web Services


Are you a developer, architect, or community member interested in the cloud? The AWS Developer Relations team loves teaching about AWS and programming for customers of any background


markdown guide