<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Chaim Rand</title>
    <description>The latest articles on DEV Community by Chaim Rand (@crand).</description>
    <link>https://dev.to/crand</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1238496%2F1f9d3215-1d54-492e-b490-90eef168a787.png</url>
      <title>DEV Community: Chaim Rand</title>
      <link>https://dev.to/crand</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/crand"/>
    <language>en</language>
    <item>
      <title>Streaming Data from Cloud Storage with Mountpoint for Amazon S3</title>
      <dc:creator>Chaim Rand</dc:creator>
      <pubDate>Tue, 11 Feb 2025 13:55:44 +0000</pubDate>
      <link>https://dev.to/aws-builders/streaming-data-from-cloud-storage-with-mountpoint-for-amazon-s3-39p9</link>
      <guid>https://dev.to/aws-builders/streaming-data-from-cloud-storage-with-mountpoint-for-amazon-s3-39p9</guid>
      <description>&lt;h2&gt;
  
  
  A First Look at a New Solution for Mounting Cloud Based Data
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F0%2A0iCVeMLVE4xE2_vP" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F0%2A0iCVeMLVE4xE2_vP" width="700" height="465"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photo by &lt;a href="https://unsplash.com/@simonfitall?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Simon Fitall&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These days, AI has become pretty much synonymous with collecting and maintaining large amounts of data. This data is typically stored in a central location and accessed at multiple different phases of the AI application development. An important factor in designing a data storage solution is the speed and efficiency at which the data can be accessed, as this can have a meaningful impact on the speed and cost of development. In our AI development team, we use cloud object storage services such as &lt;a href="https://aws.amazon.com/s3/" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt; to store enormous amounts of data. Consequently, we are obsessed with finding the fastest (and cheapest) ways of consuming cloud-based data for a variety of different scenarios.&lt;/p&gt;

&lt;p&gt;In previous posts (e.g., &lt;a href="https://towardsdatascience.com/streaming-big-data-files-from-cloud-storage-634e54818e75" rel="noopener noreferrer"&gt;here&lt;/a&gt;, &lt;a href="https://towardsdatascience.com/training-in-pytorch-from-amazon-s3-6156d5342d1" rel="noopener noreferrer"&gt;here&lt;/a&gt;, and &lt;a href="https://towardsdatascience.com/training-from-cloud-storage-with-s5cmd-5c8fb5c06056" rel="noopener noreferrer"&gt;here&lt;/a&gt;), we described a number of different tools and techniques for pulling data from the cloud and demonstrated their application to various use cases. It is only natural, then, that with &lt;a href="https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/" rel="noopener noreferrer"&gt;the introduction&lt;/a&gt; of a new option for accessing cloud-based data, we would eagerly set out to explore its capabilities.&lt;/p&gt;

&lt;p&gt;In this post, we will describe our first impressions of &lt;a href="https://github.com/awslabs/mountpoint-s3" rel="noopener noreferrer"&gt;Mountpoint for Amazon S3&lt;/a&gt; - a new open-source solution for interfacing with cloud storage - and assess its performance on two use cases that are of particular interest to us: streaming sequential blocks of a relatively large data files (as detailed &lt;a href="https://towardsdatascience.com/streaming-big-data-files-from-cloud-storage-634e54818e75" rel="noopener noreferrer"&gt;here&lt;/a&gt;) and consuming a large number of relatively small data files (as detailed &lt;a href="https://towardsdatascience.com/training-from-cloud-storage-with-s5cmd-5c8fb5c06056" rel="noopener noreferrer"&gt;here&lt;/a&gt;). For the sake of brevity, we will refer to Mountpoint for Amazon S3 simply as Mountpoint.&lt;/p&gt;

&lt;p&gt;Importantly, keep in mind that, as of the time of this writing, Mountpoint remains under &lt;a href="https://github.com/orgs/awslabs/projects/84/views/1" rel="noopener noreferrer"&gt;active development&lt;/a&gt;. You are strongly advised to stay up to date with the latest release of this tool (and all alternative tools) in order to make the most informed design decisions for your AI projects.&lt;/p&gt;

&lt;p&gt;Although it can support other endpoints, Mountpoint prioritizes performance against Amazon S3. As such, the examples below will be run using Amazon's cloud services. However, our choice of cloud service provider - or the mention of any other tools, frameworks, or APIs - should not be viewed as an endorsement over their alternatives. Furthermore, please do not view this post as a replacement for the existing official documentation (e.g., &lt;a href="https://github.com/awslabs/mountpoint-s3" rel="noopener noreferrer"&gt;here&lt;/a&gt; and &lt;a href="https://aws.amazon.com/blogs/storage/the-inside-story-on-mountpoint-for-amazon-s3-a-high-performance-open-source-file-client/" rel="noopener noreferrer"&gt;here&lt;/a&gt;).&lt;/p&gt;

&lt;h1&gt;
  
  
  Yet Another FUSE Based Object Storage Access Solution
&lt;/h1&gt;

&lt;p&gt;While there are many solutions for reading from and writing to file objects in the cloud, we can broadly divide them into two categories:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Explicit Data Transfer - Solutions that involve explicitly downloading data from the cloud for reading and uploading data for writing.&lt;/li&gt;
&lt;li&gt; File-System Abstraction - Solutions that abstract cloud storage interactions behind a file-system-style interface, allowing seamless access to cloud-hosted files.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The second approach is often implemented using the &lt;a href="https://en.wikipedia.org/wiki/Filesystem_in_Userspace" rel="noopener noreferrer"&gt;Filesystem in Userspace (FUSE)&lt;/a&gt; interface, enabling cloud-based data buckets to be mounted as local directories. This allows existing applications to interact with cloud storage just as they would with a traditional file system - requiring little to no modification.&lt;/p&gt;

&lt;p&gt;Mountpoint is a new FUSE based solution written in the &lt;a href="https://www.rust-lang.org/" rel="noopener noreferrer"&gt;Rust&lt;/a&gt; programming language and based on a &lt;a href="https://github.com/cberner/fuser" rel="noopener noreferrer"&gt;Rust version&lt;/a&gt; of the &lt;a href="https://github.com/libfuse/libfuse/" rel="noopener noreferrer"&gt;Linux FUSE library&lt;/a&gt;. See &lt;a href="https://aws.amazon.com/blogs/storage/the-inside-story-on-mountpoint-for-amazon-s3-a-high-performance-open-source-file-client/" rel="noopener noreferrer"&gt;here&lt;/a&gt; for an explanation of the choice of Rust. Other popular tools in the FUSE-based family of solutions are &lt;a href="https://github.com/kahing/goofys" rel="noopener noreferrer"&gt;&lt;em&gt;goofys&lt;/em&gt;&lt;/a&gt;and &lt;a href="https://github.com/s3fs-fuse/s3fs-fuse" rel="noopener noreferrer"&gt;&lt;em&gt;s3fs&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Using Mountpoint
&lt;/h1&gt;

&lt;p&gt;To install Mountpoint, please follow the guidelines in the &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/mountpoint-installation.html" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt;. The usage instructions of Mountpoint can be retrieved by running &lt;em&gt;mount-s3&lt;/em&gt; with the &lt;em&gt;help&lt;/em&gt; flag. The text block below includes the first few lines of the output as well as a few of the many options that allow us to tune the behavior of the client.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mountpoint for Amazon S3

Usage: mount-s3 [OPTIONS] &amp;lt;BUCKET_NAME&amp;gt; &amp;lt;DIRECTORY&amp;gt;

Arguments:\
  &amp;lt;BUCKET_NAME&amp;gt;\
          Name of bucket to mount

  &amp;lt;DIRECTORY&amp;gt;\
          Directory or FUSE file descriptor to mount the bucket at.

Mount options:\
      --read-only\
          Mount file system in read-only mode

Client options:\
      --maximum-throughput-gbps &amp;lt;N&amp;gt;\
          Maximum throughput in Gbps [default: auto-detected on EC2\
          instances, 10 Gbps elsewhere]

      --max-threads &amp;lt;N&amp;gt;\
          Maximum number of FUSE daemon threads

          [default: 16]

      --part-size &amp;lt;SIZE&amp;gt;\
          Part size for multi-part GET and PUT in bytes

          [default: 8388608]

Caching options:\
      --cache &amp;lt;DIRECTORY&amp;gt;\
          Enable caching of object content to the given directory and set\
          metadata TTL to 60 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Mountpoint assumes appropriate configuration of your &lt;a href="https://docs.aws.amazon.com/sdkref/latest/guide/access.html" rel="noopener noreferrer"&gt;AWS credentials&lt;/a&gt;. Make sure to be aware of the current &lt;a href="https://github.com/awslabs/mountpoint-s3#current-status" rel="noopener noreferrer"&gt;documented limitations&lt;/a&gt; of Mountpoint, as well as any special &lt;a href="http://github.com/awslabs/mountpoint-s3/blob/main/doc/CONFIGURATION.md" rel="noopener noreferrer"&gt;configuration&lt;/a&gt; that might be required.&lt;/p&gt;

&lt;p&gt;In the next sections we will demonstrate the use of Mountpoint for Amazon S3 for two different use-cases and compare its performance with &lt;em&gt;goofys&lt;/em&gt;. The experiments we will describe were conducted on an &lt;a href="https://aws.amazon.com/ec2/instance-types/c5/" rel="noopener noreferrer"&gt;Amazon EC2 c5.4xlarge&lt;/a&gt; instance (with 16 vCPUs). For the sake of simplicity, we chose an Ubuntu (22.04) &lt;a href="https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-pytorch-2-5-ubuntu-22-04/" rel="noopener noreferrer"&gt;AWS Deep Learning AMI&lt;/a&gt;, preinstalled with Python (3.11) and PyTorch (2.5.1). To install &lt;em&gt;mount-s3&lt;/em&gt; and &lt;em&gt;goofys&lt;/em&gt; we ran the following commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# install goofys\
sudo curl -Lo \\
  /usr/local/bin/goofys \\
  https://github.com/kahing/goofys/releases/latest/download/goofys\
sudo chmod +x /usr/local/bin/goofys

# install mount-s3\
wget \\
  https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.deb\
sudo dpkg -i mount-s3.deb\
sudo apt-get install -f -y
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We used the following command lines for mounting and un-mounting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Mountpoint\
mount-s3 --read-only &amp;lt;s3_bucket_name&amp;gt; &amp;lt;local_path&amp;gt;

# goofys\
goofys -o ro &amp;lt;s3_bucket_name&amp;gt; &amp;lt;local_path&amp;gt;

# disable mount\
fusermount -z -u &amp;lt;local_path&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep in mind that the comparative performance results we will share are very much dependent on the details of the environment in which they were run. Furthermore, it is quite likely that with appropriate tuning of the command line controls we could have improved the performance of both the Mountpoint and &lt;em&gt;goofys&lt;/em&gt; trials. We strongly encourage you to conduct your own experiments before drawing conclusions for your own project.&lt;/p&gt;

&lt;h2&gt;
  
  
  An Important Note About the Costs of Pulling Data from Amazon S3
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/s3/pricing/" rel="noopener noreferrer"&gt;Amazon S3 pricing&lt;/a&gt; consists of several components, one of which is based on the number of API calls (e.g., GET, SELECT, PUT, etc.). When using FUSE-based solutions such as Mountpoint or goofys, these API calls are abstracted away from the user, making it more difficult to directly assess the cost of reading data from Amazon S3 compared to explicitly pulling the data. Additionally, the number of API calls - and their associated costs - can be affected by the choice of command-line options.&lt;/p&gt;

&lt;p&gt;A comparative cost analysis of different methods for streaming data from Amazon S3 is beyond the scope of this post, but conducting such an analysis is highly recommended before selecting the best approach for your needs.&lt;/p&gt;

&lt;h1&gt;
  
  
  Streaming Large Data Files
&lt;/h1&gt;

&lt;p&gt;In our first experiment, we evaluated the performance of traversing a 2 GB binary file stored in the cloud. This file was assumed to contain 2,048 blocks of data (e.g., frames or data samples), each 1 MB in size.&lt;/p&gt;

&lt;p&gt;The code block below demonstrates routines for sequentially reading through the file and for sampling data at non-sequential file offsets. Please see our &lt;a href="https://towardsdatascience.com/streaming-big-data-files-from-cloud-storage-634e54818e75" rel="noopener noreferrer"&gt;previous posts&lt;/a&gt; for more details on how we designed the experiment and how we chose the metrics for comparison.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import time

KB = 1024\
MB = KB * KB

def read_sequential(f, t0):\
    t1 = time.time()\
    x = f.read(MB)\
    print(f'time of first sample: {time.time() - t1}')\
    print(f'total to first sample: {time.time() - t0}')\
    t1 = time.time()\
    count = 0\
    while True:\
        x = f.read(MB)\
        if len(x) == 0:\
            break\
        count += 1\
    print(f'time of avg read: {(time.time() - t1)/count}')

def fast_forward(f):\
    t1 = time.time()\
    total = 10\
    for i in range(total):\
        f.seek(i * 100 * MB)\
        t1 = time.time()\
        x = f.read(MB)\
    print(f'time of avg random read: {(time.time() - t1)/total}')

key = '&amp;lt;s3 key&amp;gt;'\
mount_dir = '&amp;lt;local mount&amp;gt;'\
sequential = True # toggle flag to run fast_forward

t0 = time.time()\
with open(f'{mount_dir}/{key}', 'rb') as f:\
    if sequential:\
        read_sequential(f, t0)\
        print(f'total time: {time.time()-t0}')\
    else:\
        fast_forward(f)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the table below we compare the results we received.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy92ccki3hy08b9c6v1a3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy92ccki3hy08b9c6v1a3.png" width="700" height="125"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Comparative Results of Pulling 2 GB File from S3 (by Author)&lt;/p&gt;

&lt;p&gt;Although Mountpoint is slightly slower than &lt;em&gt;goofys&lt;/em&gt; in loading the first frame, it outperforms &lt;em&gt;goofys&lt;/em&gt; in all other metrics. For a comparison with other methods for streaming large files from the cloud, please refer to our &lt;a href="https://towardsdatascience.com/streaming-big-data-files-from-cloud-storage-634e54818e75" rel="noopener noreferrer"&gt;previous post&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Consuming a Large Number of Small Files
&lt;/h1&gt;

&lt;p&gt;In our second experiment we assessed the speed of feeding hundreds of thousands individual cloud-based data samples into a deep learning training environment. The code block below demonstrates the creation of a custom &lt;a href="https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class" rel="noopener noreferrer"&gt;PyTorch Dataset&lt;/a&gt; for loading training samples from the local mount. We measured the speed of traversing thousands of image-label pairs, where each file was 1 MB in size. Please see this &lt;a href="https://towardsdatascience.com/training-from-cloud-storage-with-s5cmd-5c8fb5c06056" rel="noopener noreferrer"&gt;previous post&lt;/a&gt; for more details on how we designed the experiment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from torch.utils.data import Dataset\
import os\
class SingleSampleDataset(Dataset):\
    def __init__(self):\
        super().__init__()\
        self.base_path = '&amp;lt;local_mount&amp;gt;'

    def __len__(self):\
        return 10000

    def get_from_files(self, image_path, label_path):\
        image_file = open(image_path, 'rb')\
        label_file = open(label_path, 'rb')\
        image = image_file.read()\
        label = label_file.read()\
        image_file.close()\
        label_file.close()\
        return {"image": image, "label": label}

    def __getitem__(self, index: int):\
        image_path = os.path.join(self.base_path, f'{index}.image')\
        label_path = os.path.join(self.base_path, f'{index}.label')\
        return self.get_from_files(image_path, label_path)

def get_dataset():\
    return SingleSampleDataset()

import torch, time\
from statistics import mean, variance\
dataset = get_dataset()\
dl = torch.utils.data.DataLoader(dataset, batch_size=4, num_workers=16)\
stats_lst = []\
t0 = time.perf_counter()\
for batch_idx, batch in enumerate(dl, start=1):\
    t = time.perf_counter() - t0\
    print(f'Iteration {batch_idx} Time {t}')\
    stats_lst.append(t)\
    t0 = time.perf_counter()\
mean_calc = mean(stats_lst[1:])\
var_calc = variance(stats_lst[1:])\
print(f'mean {mean_calc} variance {var_calc}')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When downloading large numbers of small files, goofys outperformed Mountpoint, averaging 0.08 seconds per sample compared to Mountpoint's 0.11 seconds. We surmise that the overhead observed at the start of a file in the previous experiment has a more significant impact when dealing with many small files. For results of other methods for consuming large numbers of small files from the cloud, refer to our &lt;a href="https://towardsdatascience.com/training-from-cloud-storage-with-s5cmd-5c8fb5c06056" rel="noopener noreferrer"&gt;previous post&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;It's great to see a new actor in the cloud data streaming space, especially one &lt;a href="https://aws.amazon.com/blogs/storage/the-inside-story-on-mountpoint-for-amazon-s3-a-high-performance-open-source-file-client/" rel="noopener noreferrer"&gt;explicitly intent&lt;/a&gt; on addressing the challenges faced by modern data applications. Some key highlights we found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Performance Tuning - Mountpoint includes many controls that allow fine-tuning for improved performance.&lt;/li&gt;
&lt;li&gt;  Large File Streaming - When traversing large files, Mountpoint outperformed the solution we compared it to.&lt;/li&gt;
&lt;li&gt;  Ongoing Enhancements - Unlike other FUSE-based solutions, Mountpoint is under active development and expected to introduce further improvements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One area for improvement we identified is mass-downloading many small files, where Mountpoint underperformed compared to its &lt;em&gt;goofys&lt;/em&gt; counterpart.&lt;/p&gt;

&lt;p&gt;As Mountpoint for Amazon S3 continues to evolve, we look forward to seeing it extend and enhance its capabilities.&lt;/p&gt;

</description>
      <category>amazon</category>
      <category>s3</category>
      <category>mountpoint</category>
    </item>
    <item>
      <title>On the Programmability of AWS Trainium and Inferentia</title>
      <dc:creator>Chaim Rand</dc:creator>
      <pubDate>Sun, 03 Nov 2024 14:48:37 +0000</pubDate>
      <link>https://dev.to/crand/on-the-programmability-of-aws-trainium-and-inferentia-4ick</link>
      <guid>https://dev.to/crand/on-the-programmability-of-aws-trainium-and-inferentia-4ick</guid>
      <description>&lt;h2&gt;
  
  
  Accelerating AI/ML Model Training with Custom Operators — Part 4
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F0%2AReIcCWndeJTnS-0U" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F0%2AReIcCWndeJTnS-0U" width="700" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photo by &lt;a href="https://unsplash.com/@bresia?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Agata Bres&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this post we continue our exploration of the opportunities for runtime optimization of machine learning (ML) workloads through custom operator development. This time, we focus on the tools provided by the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html" rel="noopener noreferrer"&gt;AWS Neuron SDK&lt;/a&gt; for developing and running new kernels on &lt;a href="https://aws.amazon.com/machine-learning/trainium/" rel="noopener noreferrer"&gt;AWS Trainium&lt;/a&gt; and &lt;a href="https://aws.amazon.com/machine-learning/inferentia/" rel="noopener noreferrer"&gt;AWS Inferentia&lt;/a&gt;. With the rapid development of the low-level model components (e.g., &lt;a href="https://en.wikipedia.org/wiki/Attention_(machine_learning)" rel="noopener noreferrer"&gt;attention layers&lt;/a&gt;) driving the AI revolution, the programmability of the accelerators used for training and running ML models is crucial. Dedicated AI chips, in particular, must offer a worthy alternative to the widely used and highly impactful general-purpose GPU (GPGPU) development frameworks, such as &lt;a href="https://developer.nvidia.com/cuda-toolkit" rel="noopener noreferrer"&gt;CUDA&lt;/a&gt; and &lt;a href="https://triton-lang.org/main/index.html" rel="noopener noreferrer"&gt;Triton&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In previous posts (e.g., &lt;a href="https://towardsdatascience.com/ai-model-optimization-on-aws-inferentia-and-trainium-cfd48e85d5ac" rel="noopener noreferrer"&gt;here&lt;/a&gt; and &lt;a href="https://towardsdatascience.com/a-first-look-at-aws-trainium-1e0605071970" rel="noopener noreferrer"&gt;here&lt;/a&gt;) we explored the opportunity for building and running ML models on AWS's custom-built AI chips using the the dedicated &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/" rel="noopener noreferrer"&gt;AWS Neuron SDK&lt;/a&gt;. In its most recent release of the SDK (version &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#id8" rel="noopener noreferrer"&gt;2.20.0&lt;/a&gt;), AWS introduced the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html" rel="noopener noreferrer"&gt;Neuron Kernel Interface (NKI)&lt;/a&gt; for developing custom kernels for &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html" rel="noopener noreferrer"&gt;NeuronCore-v2&lt;/a&gt;, the underlying accelerator powering both &lt;a href="https://aws.amazon.com/machine-learning/trainium/" rel="noopener noreferrer"&gt;Trainium&lt;/a&gt; and &lt;a href="https://aws.amazon.com/machine-learning/inferentia/" rel="noopener noreferrer"&gt;Inferentia2&lt;/a&gt;. The NKI interface joins another API that enables &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html" rel="noopener noreferrer"&gt;NeuronCore-v2&lt;/a&gt; programmability, &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/index.html" rel="noopener noreferrer"&gt;Neuron Custom C++ Operators&lt;/a&gt;. In this post we will explore both opportunities and demonstrate them in action.&lt;/p&gt;

&lt;h2&gt;
  
  
  Disclaimers
&lt;/h2&gt;

&lt;p&gt;Importantly, this post should not be viewed as a substitute for the official &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html" rel="noopener noreferrer"&gt;AWS Neuron SDK documentation&lt;/a&gt;. At the time of this writing the Neuron SDK APIs for custom kernel development is in Beta, and may change by the time you read this. The examples we share are intended for demonstrative purposes, only. We make no claims as to their optimality, robustness, durability, or accuracy. Please do not view our mention of any platforms, tools, APIs, etc., as an endorsement for their use. The best choices for any project depend on the specifics of the use-case at hand and warrant appropriate investigation and analysis.&lt;/p&gt;

&lt;h1&gt;
  
  
  Developing Custom Kernels for Neuron Cores
&lt;/h1&gt;

&lt;p&gt;Although the list of ML models supported by the Neuron SDK is continuously growing, some operations remain either unsupported or implemented suboptimally. By exposing APIs for Neuron kernel customization, the SDK empowers developers to create and/or optimize the low-level operations that they need, greatly increasing the opportunity for running ML workloads on Trainium and Inferentia.&lt;/p&gt;

&lt;p&gt;As discussed in our &lt;a href="https://towardsdatascience.com/accelerating-ai-ml-model-training-with-custom-operators-163ef2a04b12" rel="noopener noreferrer"&gt;previous posts&lt;/a&gt; in this series, fully leveraging the power of these AI chips requires a detailed understanding their low-level architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Neuron Core Architecture
&lt;/h2&gt;

&lt;p&gt;The NKI documentation includes a &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#" rel="noopener noreferrer"&gt;dedicated section&lt;/a&gt; on the architecture design of NeuronCore-v2 and its implications on custom operator development. Importantly, there are many differences between Neuron cores and their AI accelerator counterparts (e.g., GPUs and TPUs). Optimizing for Neuron cores requires a unique set of strategies and skills.&lt;/p&gt;

&lt;p&gt;Similar to other dedicated AI chips, NeuronCore-v2 includes several internal &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#neuroncore-v2-compute-engines" rel="noopener noreferrer"&gt;acceleration engines&lt;/a&gt;, each of which specializes in performing certain types of computations. The engines can be run asynchronously and in parallel. The &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/index.html" rel="noopener noreferrer"&gt;Neuron Compiler&lt;/a&gt; is responsible for transforming ML models into low-level operations and optimizing the choice of compute engine for each one.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#tensor-engine" rel="noopener noreferrer"&gt;Tensor engine&lt;/a&gt; specializes in matrix multiplication. The &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#vector-engine" rel="noopener noreferrer"&gt;Vector&lt;/a&gt; and &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#scalar-engine" rel="noopener noreferrer"&gt;Scalar&lt;/a&gt; engines both operate on tensors with the Vector engine specializing in reduction operations and the Scalar engine in non-linear functions. &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#gpsimd-engine" rel="noopener noreferrer"&gt;GpSimd&lt;/a&gt; is a general purpose engine capable of running arbitrary C/C++ programs. Note that while the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html" rel="noopener noreferrer"&gt;NKI&lt;/a&gt; interface exposes access to all four compute engines, &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/index.html" rel="noopener noreferrer"&gt;custom C++ operators&lt;/a&gt; are designed specifically for the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#gpsimd-engine" rel="noopener noreferrer"&gt;GpSimd&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;More details on the capabilities of each engine can be found in the architecture documentation. Furthermore, the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/nki.isa.html" rel="noopener noreferrer"&gt;NKI Instruction Set Architecture (ISA)&lt;/a&gt; documentation provides details on the engines on which different low-level operations are run.&lt;/p&gt;

&lt;p&gt;Another important aspect of the Neuron chip is its &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#data-movement" rel="noopener noreferrer"&gt;memory architecture&lt;/a&gt;. A Neuron device includes three types of memory, HBM, SBUF, and PSUM. An intimate understanding of the capacities and capabilities of each one is crucial for optimal kernel development.&lt;/p&gt;

&lt;p&gt;Given the architecture overview, you might conclude that Neuron kernel development requires high expertise. While this may be true for creating fully optimized kernels that leverage all the capabilities of the Neuron core, our aim is to demonstrate the accessibility, value, and potential of the Neuron custom kernel APIs - even for non-expert developers.&lt;/p&gt;

&lt;h1&gt;
  
  
  Custom NKI Kernels
&lt;/h1&gt;

&lt;p&gt;The &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/index.html" rel="noopener noreferrer"&gt;NKI&lt;/a&gt; interface is a Python-level API that exposes the use of the Neuron core compute engines and memory resources to ML developers. The &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/getting_started.html" rel="noopener noreferrer"&gt;NKI Getting Started&lt;/a&gt; guide details the setup instructions and provides a soft landing with a simple, "hello world", kernel. The &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/programming_model.html" rel="noopener noreferrer"&gt;NKI Programming Model&lt;/a&gt; guide details the three stages of a typical NKI kernel (loading inputs, running operations on the computation engines, and storing outputs) and introduces the NKI Tile and &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/programming_model.html#nki-pm-tile" rel="noopener noreferrer"&gt;Tile-based operations&lt;/a&gt;. The &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/tutorials.html" rel="noopener noreferrer"&gt;NKI tutorials&lt;/a&gt; demonstrate a variety of NKI kernel sample applications, with each one introducing new core NKI APIs and capabilities. Given the presumed optimality of the sample kernels, one possible strategy for developing new kernels could be to 1) identify a sample that is similar to the operation you wish to implement and then 2) use it as a baseline and iteratively refine and adjust it to achieve the specific functionality you require.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/index.html#nki-api-reference" rel="noopener noreferrer"&gt;NKI API Reference Manual&lt;/a&gt; details the Python API for kernel development. With a syntax and semantics that are similar to &lt;a href="https://triton-lang.org/main/index.html" rel="noopener noreferrer"&gt;Triton&lt;/a&gt; and &lt;a href="https://numpy.org/doc/stable/" rel="noopener noreferrer"&gt;NumPy&lt;/a&gt;, the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/nki.language.html" rel="noopener noreferrer"&gt;NKI language&lt;/a&gt; definition aims to maximize accessibility and ease of use. However, it is important to note that NKI kernel development is limited to the operations defined in the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/nki.html" rel="noopener noreferrer"&gt;NKI&lt;/a&gt; library, which (as of the time of this writing) are fewer and more constrained than in libraries such as &lt;a href="https://triton-lang.org/main/index.html" rel="noopener noreferrer"&gt;Triton&lt;/a&gt; and &lt;a href="https://numpy.org/doc/stable/" rel="noopener noreferrer"&gt;NumPy&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Toy Example - A GIOU Kernel
&lt;/h2&gt;

&lt;p&gt;As in our &lt;a href="https://towardsdatascience.com/accelerating-ai-ml-model-training-with-custom-operators-163ef2a04b12" rel="noopener noreferrer"&gt;previous posts&lt;/a&gt;, we assess the use of NKI by building a custom implementation of the &lt;a href="https://giou.stanford.edu/" rel="noopener noreferrer"&gt;Generalized Intersection Over Union (GIOU)&lt;/a&gt; operation on a pair of batches of input boxes. Since GIOU involves pixel-wise operations, we used the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/programming_model.html#tile-size-considerations" rel="noopener noreferrer"&gt;*exp *kernel&lt;/a&gt; from the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/programming_model.html" rel="noopener noreferrer"&gt;NKI Programming&lt;/a&gt; guide as a reference point and incorporated the use of NKI's &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/programming_model.html#advanced-tensor-indexing" rel="noopener noreferrer"&gt;advanced tensor indexing&lt;/a&gt; in our implementation. To facilitate debugging in a CPU environment, we also added options to run the code using the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.simulate_kernel.html#nki.simulate_kernel" rel="noopener noreferrer"&gt;nki.simulate_kernel&lt;/a&gt; and &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.language.device_print.html" rel="noopener noreferrer"&gt;nki.language.device_print.html&lt;/a&gt; APIs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import torch
import neuronxcc.nki as nki
import neuronxcc.nki.language as nl
import numpy as np

simulate = False

try:
    # if torch libraries are installed assume that we are running on Neuron
    import torch_xla.core.xla_model as xm
    import torch_neuronx
    from torch_neuronx import nki_jit

    device = xm.xla_device()

    # empty implementation
    def debug_print(*args, **kwargs):
        pass
except:
    # if torch libraries are not installed assume that we are running on CPU
    # and program script to use nki simulation
    simulate = True
    nki_jit = nki.trace
    debug_print = nl.device_print
    device = 'cpu'

@nki_jit
def giou_kernel(preds_ptr,
                targets_ptr,
                output_ptr):
    epsilon = 1e-5
    TILE_M = nl.tile_size.pmax  # 128
    TILE_N = nl.tile_size.psum_fmax  # 512
    TILE_N_OUT = TILE_N // 4

    p_1, p_2 = preds_ptr.shape
    t_1, t_2 = targets_ptr.shape
    o_1, o_2 = output_ptr.shape

    #  verify input
    # batch size must be multiple of 128
    assert p_1 % TILE_M == 0
    assert p_1 == t_1
    assert p_1 == o_1
    # num boxes box *4 must be multiple of 512
    assert p_2 % TILE_N == 0
    assert p_2 == t_2
    assert p_2 // 4 == o_2

    num_tiles_m = p_1 // TILE_M
    num_tiles_n = p_2 // TILE_N

    # Generate tensors for advanced indexing
    i_p = nl.arange(TILE_M)[:, None]
    i_f = nl.arange(TILE_N // 4)[None, :]
    i_f_0 = (4 * i_f)
    i_f_1 = (4 * i_f + 1)
    i_f_2 = (4 * i_f + 2)
    i_f_3 = (4 * i_f + 3)

    # Use affine_range to loop over tiles
    for m in nl.affine_range(num_tiles_m):
        for n in nl.affine_range(num_tiles_n):
            # Load input data from HBM
            preds = nl.load(preds_ptr[m * TILE_M:(m + 1) * TILE_M,
                            n * TILE_N:(n + 1) * TILE_N])
            targets = nl.load(targets_ptr[m * TILE_M:(m + 1) * TILE_M,
                              n * TILE_N:(n + 1) * TILE_N])
            debug_print('preds', preds)
            preds_left = preds[i_p, i_f_0]
            preds_top = preds[i_p, i_f_1]
            preds_right = preds[i_p, i_f_2]
            preds_bottom = preds[i_p, i_f_3]

            gt_left = targets[i_p, i_f_0]
            gt_top = targets[i_p, i_f_1]
            gt_right = targets[i_p, i_f_2]
            gt_bottom = targets[i_p, i_f_3]

            # Compute the area of each box
            area1 = (preds_right - preds_left) * (preds_bottom - preds_top)
            area2 = (gt_right - gt_left) * (gt_bottom - gt_top)

            # Compute the intersection
            left = nl.maximum(preds_left, gt_left)
            top = nl.maximum(preds_top, gt_top)
            right = nl.minimum(preds_right, gt_right)
            bottom = nl.minimum(preds_bottom, gt_bottom)

            inter_w = nl.maximum(right - left, 0)
            inter_h = nl.maximum(bottom - top, 0)
            inter_area = inter_w * inter_h

            union_area = area1 + area2 - inter_area

            iou_val = inter_area / nl.maximum(union_area, epsilon)

            # Compute the smallest enclosing box
            enclose_left = nl.minimum(preds_left, gt_left)
            enclose_top = nl.minimum(preds_top, gt_top)
            enclose_right = nl.maximum(preds_right, gt_right)
            enclose_bottom = nl.maximum(preds_bottom, gt_bottom)

            enclose_w = nl.maximum(enclose_right - enclose_left, 0)
            enclose_h = nl.maximum(enclose_bottom - enclose_top, 0)
            enclose_area = enclose_w * enclose_h

            # Compute GIOU
            delta_area = (enclose_area - union_area)
            enclose_area = nl.maximum(enclose_area, epsilon)
            giou = iou_val - delta_area / enclose_area

            # Store results
            nl.store(output_ptr[m * TILE_M:(m + 1) * TILE_M,
                     n * TILE_N_OUT:(n + 1) * TILE_N_OUT],
                     giou)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run our GIOU kernel, we generate two batches of random boxes and feed them to our function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# generate random data in np
np.random.seed(0)
batch_size = 1024
n_boxes = 256
img_size = 256
boxes = []

for i in range(2):
    # Randomly generate box sizes and positions
    box_sizes = np.random.randint(1, img_size, size=(batch_size,n_boxes,2))
    top_left = np.random.randint(0, img_size-1, size=(batch_size,n_boxes,2))
    bottom_right = np.clip(top_left + box_sizes, 0, img_size - 1)

    # Concatenate top-left and bottom-right coordinates
    rand_boxes = np.concatenate((top_left, bottom_right), axis=2)

    boxes.append(rand_boxes.astype(np.float32))

out = np.empty((batch_size, n_boxes), np.float32)

# convert tensors to PyTorch
t_boxes_0 = torch.tensor(boxes[0]).to(device)
t_boxes_1 = torch.tensor(boxes[1]).to(device)
t_out = torch.tensor(out).to(device)

if simulate:
    # the simulation API requires numpy input
    nki.simulate_kernel(giou_kernel,
                        boxes[0].reshape((batch_size, -1)),
                        boxes[1].reshape((batch_size, -1)),
                        out)
else:
    giou_kernel(t_boxes_0.view((batch_size, -1)),
                t_boxes_1.view((batch_size, -1)),
                t_out)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To assess the performance of our NKI kernel, we will compare it with the following naive implementation of GIOU in PyTorch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def torch_giou(boxes1, boxes2):
    # loosely based on torchvision generalized_box_iou_loss code
    epsilon = 1e-5

    # Compute areas of both sets of boxes
    area1 = (boxes1[...,2]-boxes1[...,0])*(boxes1[...,3]-boxes1[...,1])
    area2 = (boxes2[...,2]-boxes2[...,0])*(boxes2[...,3]-boxes2[...,1])

    # Corners of intersection
    lt = torch.max(boxes1[..., :2], boxes2[..., :2])
    rb = torch.min(boxes1[..., 2:], boxes2[..., 2:])

    # Width and height of intersection
    wh = (rb - lt).clamp(min=0)

    # Area of the intersection
    inter = wh[..., 0] * wh[..., 1]

    # Union of the two boxes
    union = area1 + area2 - inter
    iou = inter / union.clamp(epsilon)

    # Corners of enclosing box
    lti = torch.min(boxes1[..., :2], boxes2[..., :2])
    rbi = torch.max(boxes1[..., 2:], boxes2[..., 2:])

    # Width and height of the enclosing box
    whi = (rbi - lti).clamp(min=0)

    # Area of the enclosing box
    areai = (whi[..., 0] * whi[..., 1]).clamp(epsilon)

    return iou - (areai - union) / areai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use the following benchmarking utility to compare the runtime performance of our two functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import time
def benchmark(f, warmup_iters=20, ntrials: int = 100):
    def run(*args, **kwargs):
        # warmup
        for _ in range(warmup_iters):
            f(*args, **kwargs)
        start_time = time.time()
        for _ in range(ntrials):
            f(*args, **kwargs)
        end_time = time.time()
        # Calculate average time per iteration
        avg_time = (end_time - start_time) / ntrials
        return avg_time

    return run

avg_time = benchmark(torch_giou)(t_boxes_0, t_boxes_1)
print(f'torch_giou: {avg_time}')

avg_time = benchmark(giou_kernel)(t_boxes_0.view((batch_size, -1)),
                                  t_boxes_1.view((batch_size, -1)),
                                  t_out)
print(f'giou_kernel: {avg_time}')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Runtime Environment
&lt;/h2&gt;

&lt;p&gt;We ran our script on an &lt;a href="https://aws.amazon.com/ec2/instance-types/inf2/" rel="noopener noreferrer"&gt;Amazon EC2 inf2.xlarge&lt;/a&gt; instance (containing two &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html#neuroncores-v2-arch" rel="noopener noreferrer"&gt;Neuron cores&lt;/a&gt; and four vCPUs). We used the most recent version of the &lt;a href="https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-ubuntu-22-04/" rel="noopener noreferrer"&gt;Deep Learning AMI for Neuron&lt;/a&gt; available at the time of this writing, "Deep Learning AMI Neuron (Ubuntu 22.04) 20241027", with &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#neuron-2-20-1-10-25-2024" rel="noopener noreferrer"&gt;AWS Neuron 2.20.1&lt;/a&gt; and &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/torch-neuronx/introducing-pytorch-2-1.html" rel="noopener noreferrer"&gt;PyTorch 2.1&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;Our custom GIOU kernel demonstrated an average runtime of 0.211 milliseconds compared to 0.293, amounting to a 39% performance boost. Keep in mind that these results are unique to our toy example. Other operators, particularly ones that include matrix multiplications (and utilize the Tensor engine) are likely to exhibit different comparative results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimizing NKI Kernel Performance
&lt;/h2&gt;

&lt;p&gt;The next step in our kernel development - beyond the scope of this post - would to be to analyze the performance of the GIOU kernel using the dedicated &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/neuron_profile_for_nki.html" rel="noopener noreferrer"&gt;Neuron Profiler&lt;/a&gt; in order to identify bottlenecks and optimize our implementation. Please see the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/nki_perf_guide.html#nki-perf-guide" rel="noopener noreferrer"&gt;NKI performance guide&lt;/a&gt; for more details.&lt;/p&gt;

&lt;h1&gt;
  
  
  Neuron Custom C++ Operators
&lt;/h1&gt;

&lt;p&gt;The second method for creating a custom Neuron kernel is to build a C++ operator for the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#gpsimd-engine" rel="noopener noreferrer"&gt;GpSimd engine&lt;/a&gt;. This method is described in the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/programming-guide/custom-c%2B%2B-operators-devguide.html#feature-custom-operators-devguide" rel="noopener noreferrer"&gt;Neuron Custom C++ Operators Developer Guide&lt;/a&gt; and demonstrated in the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/tutorials/customop-mlp-training.html#neuronx-customop-mlp-tutorial" rel="noopener noreferrer"&gt;Neuron Custom C++ Operators in MLP&lt;/a&gt; and &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/tutorials/customop-mlp-perf-opt.html#neuronx-customop-mlp-perf" rel="noopener noreferrer"&gt;Neuron Custom C++ Operators Performance Optimization&lt;/a&gt; tutorials.&lt;/p&gt;

&lt;p&gt;Neuron Custom C++ Operators presents an opportunity for "kernel fusion" on the GpSimd engine by facilitating the combination of multiple low-level operations into a single kernel execution. This approach can significantly reduce the overhead associated with: 1) loading multiple individual kernels, and 2) transferring data between different memory regions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Toy Example - A GIOU C++ Kernel
&lt;/h2&gt;

&lt;p&gt;In the code block below we implement a C++ GIOU operator for Neuron and save it to a file named &lt;em&gt;giou.cpp&lt;/em&gt;. Our kernel uses the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/api-reference-guide/custom-ops-ref-guide.html#tcm-accessor" rel="noopener noreferrer"&gt;TCM accessor&lt;/a&gt; for optimizing memory read and write performance and applies the *multicore *setting in order to &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/api-reference-guide/custom-ops-ref-guide.html#using-multiple-gpsimd-cores" rel="noopener noreferrer"&gt;use all eight of the GpSimd's internal processors&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#include &amp;lt;stdint.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;torch/torch.h&amp;gt;
#include &amp;lt;neuron/neuron-utils.hpp&amp;gt;
#include &amp;lt;algorithm&amp;gt;

// input boxes of shape 1024x256x4
// output scores of shape 1024x256
torch::Tensor giou(const torch::Tensor&amp;amp; t_pred,
                   const torch::Tensor&amp;amp; t_target) {
  size_t num_samples = t_pred.sizes()[0];
  size_t num_boxes = t_pred.sizes()[1];
  torch::Tensor t_out = get_dst_tensor();

  // get the number of GpSimd processors (8 in NeuronCoreV2)
  uint32_t cpu_count = get_cpu_count();
  // get index of current processor
  uint32_t cpu_id = get_cpu_id();

  // divide the batch size into 8 partitions
  uint32_t partition = num_samples / cpu_count;

  // use tcm buffers to load and write data
  size_t tcm_in_size = num_boxes*4;
  size_t tcm_out_size = num_boxes;
  float *tcm_pred = (float*)torch::neuron::tcm_malloc(
                                             sizeof(float)*tcm_in_size);
  float *tcm_target = (float*)torch::neuron::tcm_malloc(
                                             sizeof(float)*tcm_in_size);
  float *tcm_output = (float*)torch::neuron::tcm_malloc(
                                             sizeof(float)*tcm_in_size);
  auto t_pred_tcm_acc = t_pred.tcm_accessor();
  auto t_target_tcm_acc = t_target.tcm_accessor();
  auto t_out_tcm_acc = t_out.tcm_accessor();

  // iterate over each of the entries in the partition
  for (size_t i = 0; i &amp;lt; partition; i++) {
    // load the pred and target boxes into local memory
    t_pred_tcm_acc.tensor_to_tcm&amp;lt;float&amp;gt;(tcm_pred,
                                        partition*cpu_id + i*tcm_in_size,
                                        tcm_in_size);
    t_target_tcm_acc.tensor_to_tcm&amp;lt;float&amp;gt;(tcm_target,
                                          partition*cpu_id + i*tcm_in_size,
                                          tcm_in_size);

    // iterate over each of the boxes in the entry
    for (size_t j = 0; j &amp;lt; num_boxes; j++) {
      const float epsilon = 1e-5;
      const float* box1 = &amp;amp;tcm_pred[j * 4];
      const float* box2 = &amp;amp;tcm_target[j * 4];
      // Compute area of each box
      float area1 = (box1[2] - box1[0]) * (box1[3] - box1[1]);
      float area2 = (box2[2] - box2[0]) * (box2[3] - box2[1]);

      // Compute the intersection
      float left = std::max(box1[0], box2[0]);
      float top = std::max(box1[1], box2[1]);
      float right = std::min(box1[2], box2[2]);
      float bottom = std::min(box1[3], box2[3]);

      float inter_w = std::max(right - left, 0.f);
      float inter_h = std::max(bottom - top, 0.f);
      float inter_area = inter_w * inter_h;

      // Compute the union area
      float union_area = area1 + area2 - inter_area;

      // IoU
      float iou_val = inter_area / std::max(union_area, epsilon);

      // Compute the smallest enclosing box
      float enclose_left = std::min(box1[0], box2[0]);
      float enclose_top = std::min(box1[1], box2[1]);
      float enclose_right = std::max(box1[2], box2[2]);
      float enclose_bottom = std::max(box1[3], box2[3]);

      float enclose_w = std::max(enclose_right - enclose_left, 0.f);
      float enclose_h = std::max(enclose_bottom - enclose_top, 0.f);
      float enclose_area = std::max(enclose_w * enclose_h, epsilon);

      float result = iou_val - (enclose_area-union_area)/enclose_area;
      tcm_output[j] = result;
    }

    // write the giou scores of all boxes in the current entry
    t_out_tcm_acc.tcm_to_tensor&amp;lt;float&amp;gt;(tcm_output,
                                       partition*cpu_id + i*tcm_out_size,
                                       tcm_out_size);
  }

  torch::neuron::tcm_free(tcm_pred);
  torch::neuron::tcm_free(tcm_target);
  return t_out;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We require a separate &lt;em&gt;shape.cpp&lt;/em&gt; file that defines the output shape of our GIOU function and registers our custom operator with the Neuron library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#include &amp;lt;stdint.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;torch/torch.h&amp;gt;
#include "torchneuron/register.h"

torch::Tensor giou_shape(torch::Tensor boxes1, torch::Tensor boxes2) {
    torch::Tensor t_out = torch::zeros({boxes1.sizes()[0],
                                        boxes1.sizes()[1]},
                                       torch::kFloat);
    return t_out;
}

NEURON_LIBRARY(my_ops, m) {
  m.def("giou", &amp;amp;giou_shape, "giou");
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;em&gt;build.py&lt;/em&gt; script compiles the C++ operator and exposes it as a Python API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os
import torch_neuronx
from torch_neuronx.xla_impl import custom_op

custom_op.load(
    name='giou',
    compute_srcs=['giou.cpp'],
    shape_srcs=['shape.cpp'],
    build_directory=os.getcwd(),
    multicore=True,
    verbose=True
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The compilation script generates a *libgiou.so *library containing the implementation of our C++ GIOU operator. In the code block below we load the library and measure the performance of our custom kernel using the benchmarking utility defined above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from torch_neuronx.xla_impl import custom_op
custom_op.load_library('libgiou.so')

avg_time = benchmark(torch.ops.my_ops.giou)(t_boxes_0, t_boxes_1)
print(f'C++ giou: {avg_time}')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Runtime Environment
&lt;/h2&gt;

&lt;p&gt;We used the same Neuron environment from our NKI experiments to compile and test our C++ kernel. Please note the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-customops/programming-guide/custom-c%2B%2B-operators-devguide.html#setup-installation" rel="noopener noreferrer"&gt;installation steps&lt;/a&gt; that are required for custom C++ operator development.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;Our C++ GIOU kernel demonstrated an average runtime of 0.061 milliseconds - nearly five times faster than our baseline implementation. This is presumably a result of "kernel fusion", as discussed above.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;The table below summarizes the runtime results of our experiments.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7in6igsvbnj3pk715wy8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7in6igsvbnj3pk715wy8.png" width="700" height="136"&gt;&lt;/a&gt;Avg time of different GIOU implementations (lower is better) - by Author&lt;/p&gt;

&lt;p&gt;Please keep in mind that these results are specific to the toy example and runtime environment used in this study. The comparative results of other kernels might be very different - depending on the degree to which they can leverage the Neuron core's internal compute engines.&lt;/p&gt;

&lt;p&gt;The table below summarizes some of the differences we observed between the two methods of AWS Neuron kernel customization.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F05i4a5oi7u5zby977ydv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F05i4a5oi7u5zby977ydv.png" width="700" height="175"&gt;&lt;/a&gt;Comparison between kernel customization tools (by Author)&lt;/p&gt;

&lt;p&gt;Through its high-level Python interface, the NKI APIs expose the power of the Neuron acceleration engines to ML developers in an accessible and user-friendly manner. The low-level C++ Custom Operators library enables even greater programmability, but is limited to the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#gpsimd-engine" rel="noopener noreferrer"&gt;GpSimd engine&lt;/a&gt;. By effectively combining both tools, developers can fully leverage the AWS Neuron architecture's capabilities.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;With the AI revolution in full swing, many companies are developing advanced new AI chips to meet the growing demand for compute. While public announcements often highlight these chips' runtime performance, cost savings, and energy efficiency, several core capabilities are essential to make these chips and their software stacks truly viable for ML development. These capabilities include robust debugging tools, performance analysis and optimization utilities, programmability, and more.&lt;/p&gt;

&lt;p&gt;In this post, we focused on the utilities available for programming AWS's homegrown AI accelerators, &lt;a href="https://aws.amazon.com/machine-learning/trainium/" rel="noopener noreferrer"&gt;Trainium&lt;/a&gt; and &lt;a href="https://aws.amazon.com/machine-learning/inferentia/" rel="noopener noreferrer"&gt;Inferentia&lt;/a&gt;, and demonstrated their use in building custom ML operations. These tools empower developers to optimize the performance of their ML models on AWS's AI chips and open up new opportunities for innovation and creativity.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ec2</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>AI Model Optimization on AWS Inferentia and Trainium</title>
      <dc:creator>Chaim Rand</dc:creator>
      <pubDate>Sun, 03 Nov 2024 14:34:56 +0000</pubDate>
      <link>https://dev.to/crand/ai-model-optimization-on-aws-inferentia-and-trainium-1hh0</link>
      <guid>https://dev.to/crand/ai-model-optimization-on-aws-inferentia-and-trainium-1hh0</guid>
      <description>&lt;h2&gt;
  
  
  Tips for accelerating ML with AWS Neuron SDK
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F0%2A2Rv2SBFg0AgINmyy" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A700%2F0%2A2Rv2SBFg0AgINmyy" width="700" height="523"&gt;&lt;/a&gt;&lt;br&gt;
Photo by &lt;a href="https://unsplash.com/@julientromeur?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;julien Tromeur&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;br&gt;
We are in a golden age of AI, with cutting-edge models disrupting industries and poised to transform life as we know it. Powering these advancements are increasingly powerful AI accelerators, such as &lt;a href="https://www.nvidia.com/en-eu/data-center/h100/" rel="noopener noreferrer"&gt;NVIDIA H100 GPUs&lt;/a&gt;, &lt;a href="https://cloud.google.com/tpu" rel="noopener noreferrer"&gt;Google Cloud TPUs&lt;/a&gt;, &lt;a href="https://aws.amazon.com/machine-learning/trainium/" rel="noopener noreferrer"&gt;AWS's Trainium&lt;/a&gt; and &lt;a href="https://aws.amazon.com/machine-learning/inferentia/" rel="noopener noreferrer"&gt;Inferentia&lt;/a&gt; chips, and more. With the growing number of options comes the challenge of &lt;a href="https://towardsdatascience.com/instance-selection-for-deep-learning-7463d774cff0" rel="noopener noreferrer"&gt;selecting the most optimal platform&lt;/a&gt; for our machine learning (ML) workloads -  a crucial decision considering the high costs associated with AI computation. Importantly, a comprehensive assessment of each option necessitates ensuring that we are maximizing its utilization to fully leverage its capabilities.&lt;/p&gt;

&lt;p&gt;In this post, we will review several techniques for optimizing an ML workload on AWS's custom-built AI chips using the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html" rel="noopener noreferrer"&gt;AWS Neuron SDK&lt;/a&gt;. This continues our ongoing series of posts focused on ML model performance analysis and optimization across various platforms and environments (e.g., see &lt;a href="https://towardsdatascience.com/pytorch-model-performance-analysis-and-optimization-10c3c5822869" rel="noopener noreferrer"&gt;here&lt;/a&gt; and &lt;a href="https://towardsdatascience.com/training-ai-models-on-cpu-3903adc9f388" rel="noopener noreferrer"&gt;here&lt;/a&gt;). While our primary focus will be on an ML training workload and AWS Inferentia2, the techniques discussed are also applicable to AWS Trainium. (Recall that although AWS Inferentia is primarily designed as an AI inference chip, we have &lt;a href="https://towardsdatascience.com/dl-training-on-aws-inferentia-53e103597a03" rel="noopener noreferrer"&gt;previously demonstrated&lt;/a&gt; its effectiveness in training tasks as well.)&lt;/p&gt;

&lt;p&gt;Generally speaking, performance optimization is an iterative process that includes a performance analysis step to appropriately identify performance bottlenecks and resource under-utilization (e.g., see &lt;a href="https://towardsdatascience.com/cloud-ml-performance-checklist-caa51e798002" rel="noopener noreferrer"&gt;here&lt;/a&gt;). However, since the techniques we will discuss are general purpose (i.e., they are potentially applicable to any model, regardless of their performance profile), we defer the discussion on &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/torch-neuronx-profiling-dev-guide.html" rel="noopener noreferrer"&gt;performance analysis with the Neuron SDK&lt;/a&gt; to a future post.&lt;/p&gt;
&lt;h2&gt;
  
  
  Disclaimers
&lt;/h2&gt;

&lt;p&gt;The code we will share is intended for demonstrative purposes only - we make no claims regarding its accuracy, optimality, or robustness. Please do not view this post as a substitute for the official &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html" rel="noopener noreferrer"&gt;Neuron SDK documentation&lt;/a&gt;. Please do not interpret our mention of any platforms, libraries, or optimization techniques as an endorsement for their use. The best options for you will depend greatly on the specifics of your use-case and will require your own in-depth investigation and analysis.&lt;/p&gt;

&lt;p&gt;The experiments described below were run on an &lt;a href="https://aws.amazon.com/ec2/instance-types/inf2/" rel="noopener noreferrer"&gt;Amazon EC2 inf2.xlarge&lt;/a&gt; instance (containing two &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html#neuroncores-v2-arch" rel="noopener noreferrer"&gt;Neuron cores&lt;/a&gt; and four vCPUs). We used the most recent version of the &lt;a href="https://aws.amazon.com/releasenotes/aws-deep-learning-ami-neuron-ubuntu-22-04/" rel="noopener noreferrer"&gt;Deep Learning AMI for Neuron&lt;/a&gt; available at the time of this writing, "Deep Learning AMI Neuron (Ubuntu 22.04) 20240927", with &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html" rel="noopener noreferrer"&gt;AWS Neuron 2.20&lt;/a&gt; and &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/torch-neuronx/introducing-pytorch-2-1.html" rel="noopener noreferrer"&gt;PyTorch 2.1&lt;/a&gt;. See the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/multiframework/multi-framework-ubuntu22-neuron-dlami.html#setup-ubuntu22-multi-framework-dlami" rel="noopener noreferrer"&gt;SDK documentation&lt;/a&gt; for more details on setup and installation. Keep in mind that the Neuron SDK is under active development and that the APIs we refer to, as well as the runtime measurements we report, may become outdated by the time you read this. Please be sure to stay up-to-date with the latest SDK and documentation available.&lt;/p&gt;
&lt;h1&gt;
  
  
  Toy Model
&lt;/h1&gt;

&lt;p&gt;To facilitate our discussion, we introduce the following simple &lt;a href="https://en.wikipedia.org/wiki/Vision_transformer" rel="noopener noreferrer"&gt;Vision Transformer&lt;/a&gt; (ViT)-backed classification model (based on &lt;a href="https://pypi.org/project/timm/" rel="noopener noreferrer"&gt;timm&lt;/a&gt; version 1.0.10):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from torch.utils.data import Dataset
import time, os
import torch
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl
from timm.models.vision_transformer import VisionTransformer

# use random data
class FakeDataset(Dataset):
  def __len__(self):
    return 1000000

  def __getitem__(self, index):
    rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
    label = torch.tensor(data=index % 1000, dtype=torch.int64)
    return rand_image, label

def train(batch_size=16, num_workers=0):
  # Initialize XLA process group for torchrun
  import torch_xla.distributed.xla_backend
  torch.distributed.init_process_group('xla')

  # multi-processing: ensure each worker has same initial weights
  torch.manual_seed(0)
  dataset = FakeDataset()
  model = VisionTransformer()

  # load model to XLA device
  device = xm.xla_device()
  model = model.to(device)
  optimizer = torch.optim.Adam(model.parameters())
  data_loader = torch.utils.data.DataLoader(dataset,
                                            batch_size=batch_size,
                                            num_workers=num_workers)

  data_loader = pl.MpDeviceLoader(data_loader, device)
  loss_function = torch.nn.CrossEntropyLoss()
  summ = 0
  count = 0
  t0 = time.perf_counter()

  for step, (inputs, targets) in enumerate(data_loader, start=1):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss.backward()
    xm.optimizer_step(optimizer)
    batch_time = time.perf_counter() - t0
    if step &amp;gt; 10:  # skip first steps
      summ += batch_time
      count += 1
    t0 = time.perf_counter()
    if step &amp;gt; 500:
      break
  print(f'average step time: {summ/count}')

if __name__ == '__main__':
  train()

# Initialization command:
# torchrun --nproc_per_node=2 train.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running our baseline model on the two cores of our AWS Inferentia instance, results in a training speed of 251.98 samples per second.&lt;/p&gt;

&lt;p&gt;In the next sections, we will iteratively apply a number of potential optimization techniques and assess their impact on step time performance. While we won't go into the full details of each method, we will provide references for further reading (e.g., &lt;a href="https://towardsdatascience.com/pytorch-model-performance-analysis-and-optimization-10c3c5822869" rel="noopener noreferrer"&gt;here&lt;/a&gt;). Importantly, the list we will present is not all-inclusive --- there are many techniques beyond what we will cover. We will organize the methods into three categories: PyTorch optimizations, OpenXLA optimizations, and Neuron-specific optimizations. However, the order of presentation is not binding. In fact, some of the techniques are interdependent --- for example, applying the mixed precision optimization may free up enough device memory to enable increasing the batch size.&lt;/p&gt;

&lt;h1&gt;
  
  
  PyTorch Performance Optimizations
&lt;/h1&gt;

&lt;p&gt;In previous posts (e.g., &lt;a href="https://towardsdatascience.com/pytorch-model-performance-analysis-and-optimization-10c3c5822869" rel="noopener noreferrer"&gt;here&lt;/a&gt;) we have covered the topic of PyTorch model performance analysis and optimization on GPU, extensively. Many of the techniques we discussed are relevant to other AI accelerators. In this section we will revisit few of these techniques and apply them to AWS Inferentia.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-process Data Loading
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://pytorch.org/docs/stable/data.html#single-and-multi-process-data-loading" rel="noopener noreferrer"&gt;multi process data loading&lt;/a&gt; the input data is prepared in one or more dedicated CPU processes rather than in the same process that runs the training step. This allows for overlapping the data loading and training which can increase system utilization and lead to a significant speed-up. The number of processes is controlled by the &lt;em&gt;num_workers&lt;/em&gt; parameter of the &lt;a href="https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader" rel="noopener noreferrer"&gt;PyTorch DataLoader&lt;/a&gt;. In the following block we run our script with *num_workers *set to one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;train(num_workers=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This change results in a training speed of 253.56 samples per second for a boost of less than 1%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Batch Size Optimization
&lt;/h2&gt;

&lt;p&gt;Another important hyperparameter that can influence training speed is the training batch size. Often, we have found that increasing the batch size improves system utilization and results in better performance. However, the effects can vary based on the model and platform. In the case of our toy model on AWS Inferentia, we find that running with a batch size of 8 samples per neuron core results in a speed of 265.68 samples per second - roughly 5% faster than a batch size of 16 samples per core.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;train(batch_size=8, num_workers=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  PyTorch Automatic Mixed Precision
&lt;/h2&gt;

&lt;p&gt;Another common method for boosting performance is to use lower precision floats such as the 16-bit BFloat16. Importantly, some model components might not be compatible with reduced precision floats. PyTorch's &lt;a href="https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html" rel="noopener noreferrer"&gt;Automatic Mixed Precision (AMP)&lt;/a&gt; mode attempts to match the most appropriate floating point type to each model operation automatically. Although, the Neuron compiler offers different options for employing &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html#mixed-precision-and-performance-accuracy-tuning-neuronx-cc" rel="noopener noreferrer"&gt;mixed precision&lt;/a&gt;, it also &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html#automatic-mixed-precision" rel="noopener noreferrer"&gt;supports the option of using PyTorch AMP&lt;/a&gt;. In the code block below we include the modifications required to use PyTorch AMP.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def train(batch_size=16, num_workers=0):
  # Initialize XLA process group for torchrun
  import torch_xla.distributed.xla_backend
  torch.distributed.init_process_group('xla')

  # multi-processing: ensure each worker has same initial weights
  torch.manual_seed(0)
  dataset = FakeDataset()
  model = VisionTransformer()

  # load model to XLA device
  device = xm.xla_device()
  model = model.to(device)
  optimizer = torch.optim.Adam(model.parameters())
  data_loader = torch.utils.data.DataLoader(dataset,
                                            batch_size=batch_size,
                                            num_workers=num_workers)

  data_loader = pl.MpDeviceLoader(data_loader, device)
  loss_function = torch.nn.CrossEntropyLoss()
  summ = 0
  count = 0
  t0 = time.perf_counter()

  for step, (inputs, targets) in enumerate(data_loader, start=1):
    optimizer.zero_grad()

    # use PyTorch AMP
    with torch.autocast(dtype=torch.bfloat16, device_type='cuda'):
      outputs = model(inputs)
      loss = loss_function(outputs, targets)
    loss.backward()
    xm.optimizer_step(optimizer)
    batch_time = time.perf_counter() - t0
    if step &amp;gt; 10:  # skip first steps
      summ += batch_time
      count += 1
    t0 = time.perf_counter()
    if step &amp;gt; 500:
      break
  print(f'average step time: {summ/count}')

if __name__ == '__main__':
  # disable neuron compilar casting
  os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"
  torch.cuda.is_bf16_supported = lambda: True
  train(batch_size=8, num_workers=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The resultant training speed is 196.64 samples per second, about 26% lower than the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html#neuronx-cc-training-mixed-precision" rel="noopener noreferrer"&gt;default mixed precision&lt;/a&gt; setting of the Neuron compiler. It's important to note that while this post focuses on performance, in real-world scenarios, we would also need to evaluate the effect of the mixed precision policy we choose on &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html#performance-accuracy-tradeoffs" rel="noopener noreferrer"&gt;model accuracy&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  OpenXLA Optimizations
&lt;/h1&gt;

&lt;p&gt;As discussed in a &lt;a href="https://towardsdatascience.com/a-first-look-at-aws-trainium-1e0605071970" rel="noopener noreferrer"&gt;previous post&lt;/a&gt;, Neuron Cores are treated as &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html#neuron-xla-device" rel="noopener noreferrer"&gt;XLA devices&lt;/a&gt; and the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/quick-start/torch-neuron.html" rel="noopener noreferrer"&gt;torch-neuronx&lt;/a&gt; Python package implements the &lt;a href="https://github.com/pytorch/xla/" rel="noopener noreferrer"&gt;PyTorch/XLA&lt;/a&gt; API. Consequently, any optimization opportunities provided by the OpenXLA framework, and specifically those offered by the PyTorch/XLA API, can be leveraged on AWS Inferentia and Trainium. In this section we consider a few of these opportunities.&lt;/p&gt;

&lt;h2&gt;
  
  
  BFloat16 Precision
&lt;/h2&gt;

&lt;p&gt;OpenXLA supports the option of &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html#automatic-casting-of-float-tensors-to-bfloat16" rel="noopener noreferrer"&gt;casting all floats to BFloat16&lt;/a&gt; via the XLA_USE_BF16 environment variable, as shown in the code block below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if __name__ == '__main__':
  os.environ['XLA_USE_BF16'] = '1'
  train(batch_size=8, num_workers=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The resultant training speed is 394.51 samples per second, nearly 50% faster than the speed of the &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html#neuronx-cc-training-mixed-precision" rel="noopener noreferrer"&gt;default mixed precision&lt;/a&gt; option.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-process Device Loading
&lt;/h2&gt;

&lt;p&gt;The PyTorch/XLA &lt;a href="https://pytorch.org/xla/master/_modules/torch_xla/distributed/parallel_loader.html#MpDeviceLoader" rel="noopener noreferrer"&gt;MpDeviceLoader&lt;/a&gt; and its internal &lt;a href="https://pytorch.org/xla/master/_modules/torch_xla/distributed/parallel_loader.html" rel="noopener noreferrer"&gt;ParallelLoader&lt;/a&gt;, which are responsible for loading input data on to the accelerator, include a number of parameters for controlling the transfer of data from the host to the device. In the code block below we tune &lt;a href="https://github.com/pytorch/xla/blob/v2.1.0/torch_xla/distributed/parallel_loader.py#L86" rel="noopener noreferrer"&gt;&lt;em&gt;batches_per_execution&lt;/em&gt;&lt;/a&gt; setting which determines the number of batches copied to the device for each execution cycle of the &lt;a href="https://pytorch.org/xla/master/_modules/torch_xla/distributed/parallel_loader.html" rel="noopener noreferrer"&gt;ParallelLoader&lt;/a&gt;. By increasing this setting, we aim to reduce the overhead of the host-to-device communication:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data_loader = torch.utils.data.DataLoader(dataset,
                                          batch_size=batch_size,
                                          num_workers=num_workers
                                          )
data_loader = pl.MpDeviceLoader(data_loader,
                                device, batches_per_execution=10)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As a result of this optimization, the training speed increased to 1,027.39 samples per second, representing an additional 260% speed-up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Torch Compilation with OpenXLA Backend
&lt;/h2&gt;

&lt;p&gt;In previous posts (e.g., &lt;a href="https://towardsdatascience.com/tips-and-tricks-for-upgrading-to-pytorch-2-3127db1d1f3d" rel="noopener noreferrer"&gt;here&lt;/a&gt;), we have demonstrated the potential performance gains from using &lt;a href="https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html" rel="noopener noreferrer"&gt;PyTorch's graph compilation&lt;/a&gt; offering. Although &lt;a href="https://openxla.org/xla" rel="noopener noreferrer"&gt;OpenXLA&lt;/a&gt; includes its own graph creation and Just-In-Time (JIT) compilation mechanisms, &lt;a href="https://pytorch.org/xla/master/torch_compile.html" rel="noopener noreferrer"&gt;torch.compile&lt;/a&gt; can provide additional acceleration by eliminating the need for tracing the model operations at every step. The following code snippet demonstrates the use of the dedicated &lt;em&gt;openxla&lt;/em&gt; backend for compiling the model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model = model.to(device)
model = torch.compile(backend='openxla')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Although torch.compile is currently &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/torch/torch-neuronx/index.html#known-limitations" rel="noopener noreferrer"&gt;not yet supported&lt;/a&gt; by the Neuron SDK, we include its mention in anticipation of its future release.&lt;/p&gt;

&lt;h1&gt;
  
  
  Neuron SDK Optimizations
&lt;/h1&gt;

&lt;p&gt;In this section we consider some of the optimization opportunities offered by the AWS Neuron SDK and, more specifically, by the Neuron compiler.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mixed Precision
&lt;/h2&gt;

&lt;p&gt;The Neuron SDK supports a variety of &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html#mixed-precision-and-performance-accuracy-tuning-neuronx-cc" rel="noopener noreferrer"&gt;mixed precision&lt;/a&gt; settings. In the code block below we program the compiler to cast all floats to BFloat16 via the &lt;em&gt;NEURON_CC_FLAGS&lt;/em&gt; environment variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if __name__ == '__main__':
  os.environ["NEURON_CC_FLAGS"] = "--auto-cast all --auto-cast-type bf16"
  train(batch_size=8, num_workers=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This results (unsurprisingly) in a similar training speed to the OpenXLA BFloat16 experiment described above.&lt;/p&gt;

&lt;h2&gt;
  
  
  FP8
&lt;/h2&gt;

&lt;p&gt;One of the unique features of NeuronCoreV2 is its support of the eight-bit floating point type, fp8_e4m3. The code block below demonstrates how to configure the Neuron compiler to automatically cast all floating-point operations to FP8:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if __name__ == '__main__':
 os.environ["NEURON_CC_FLAGS"] = "--auto-cast all --auto-cast-type fp8_e4m3"
 train(batch_size=8, num_workers=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While FP8 can accelerate training in some cases, maintaining stable convergence can be more challenging than when using BFloat16 due its reduced precision and dynamic range. Please see our &lt;a href="https://towardsdatascience.com/accelerating-pytorch-training-workloads-with-fp8-5a5123aec7d7" rel="noopener noreferrer"&gt;previous post&lt;/a&gt; for more on the potential benefits and challenges of FP8 training.&lt;/p&gt;

&lt;p&gt;In the case of our model, using FP8 actually harms runtime performance compared to BFloat16, reducing the training speed to 940.36 samples per second.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compiler Optimizations
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuronx-cc/api-reference-guide/neuron-compiler-cli-reference-guide.html" rel="noopener noreferrer"&gt;Neuron compiler&lt;/a&gt; includes a number of controls for optimizing the runtime performance of the compiled graph. Two key settings are &lt;em&gt;model-type&lt;/em&gt; and &lt;em&gt;opt-level&lt;/em&gt;. The &lt;em&gt;model-type *setting applies optimizations tailored to specific model architectures, such as transformers, while the *opt-level *setting allows for balancing compilation time against runtime performance. In the code block below, we program the *model-type&lt;/em&gt; setting to &lt;em&gt;tranformer&lt;/em&gt; and the &lt;em&gt;opt-level&lt;/em&gt; setting to the highest performance option. We further specify the &lt;em&gt;target&lt;/em&gt; runtime device, &lt;em&gt;inf2&lt;/em&gt;, to ensure that the model is optimized for the target device.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if __name__ == '__main__':
  os.environ['XLA_USE_BF16'] = '1'
  os.environ["NEURON_CC_FLAGS"] = "--model-type transformer " \
                                  "--optlevel 3" \
                                  " --target inf2"
  train(batch_size=8, num_workers=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above configuration resulted in a training speed of 1093.25 samples per second, amounting to a modest 6% improvement.&lt;/p&gt;

&lt;h1&gt;
  
  
  Results
&lt;/h1&gt;

&lt;p&gt;We summarize the results of our experiments in the table below. Keep in mind that the effect of each of the optimization methods we discussed will depend greatly on the model and the runtime environment.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8jmnhs2gxoz4agf8z0h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8jmnhs2gxoz4agf8z0h.png" width="700" height="255"&gt;&lt;/a&gt;Experiment Results (by Author)&lt;br&gt;
The techniques we employed resulted in a 435% performance boost compared to our baseline experiment. It is likely that additional acceleration could be achieved by revisiting and fine-tuning some of the methods we discussed, or by applying other optimization techniques not covered in this post.&lt;/p&gt;

&lt;p&gt;Our goal has been to demonstrate some of the available optimization strategies and demonstrate their potential impact on runtime performance. However, in a real-world scenario, we would need to assess the manner in which each of these optimizations impact our model convergence. In some cases, adjustments to the model configuration may be necessary to ensure optimal performance without sacrificing accuracy. Additionally, using a performance profiler to identify bottlenecks and measure system resource utilization is essential for guiding and informing our optimization activities.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;Nowadays, we are fortunate to have a wide variety of systems on which to run our ML workloads. No matter which platform we choose, our goal is to maximize its capabilities. In this post, we focused on AWS Inferentia and reviewed several techniques for accelerating ML workloads running on it. Be sure to check out our &lt;a href="https://chaimrand.medium.com/" rel="noopener noreferrer"&gt;other posts&lt;/a&gt; for more optimization strategies across various AI accelerators.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ec2</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Training AI Models on CPU on AWS EC2</title>
      <dc:creator>Chaim Rand</dc:creator>
      <pubDate>Wed, 04 Sep 2024 07:09:43 +0000</pubDate>
      <link>https://dev.to/crand/training-ai-models-on-cpu-on-aws-ec2-4o4p</link>
      <guid>https://dev.to/crand/training-ai-models-on-cpu-on-aws-ec2-4o4p</guid>
      <description>&lt;h2&gt;
  
  
  Revisiting CPU for ML in an Era of GPU Scarcity
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--PcF8MYqj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:700/0%2ATf0e2-5_s5L2MZGZ" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PcF8MYqj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:700/0%2ATf0e2-5_s5L2MZGZ" width="700" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photo by &lt;a href="https://unsplash.com/@quinoal?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Quino Al&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The recent successes in AI are often attributed to the emergence and evolutions of the GPU. The GPU's architecture, which typically includes thousands of multi-processors, high-speed memory, dedicated tensor cores, and more, is particularly well-suited to meet the intensive demands of AI/ML workloads. Unfortunately, the rapid growth in AI development has led to a surge in the demand for GPUs, making them difficult to obtain. As a result, ML developers are increasingly exploring alternative hardware options for training and running their models. In previous posts, we discussed the possibility of training on dedicated AI ASICs such as &lt;a href="https://towardsdatascience.com/tpu-training-6eb84100d138" rel="noopener noreferrer"&gt;Google Cloud TPU&lt;/a&gt;, &lt;a href="https://towardsdatascience.com/training-on-aws-with-habana-gaudi-3126e183048" rel="noopener noreferrer"&gt;Haban Gaudi&lt;/a&gt;, and &lt;a href="https://towardsdatascience.com/a-first-look-at-aws-trainium-1e0605071970" rel="noopener noreferrer"&gt;AWS Trainium&lt;/a&gt;. While these options offer significant cost-saving opportunities, they do not suit all ML models and can, like the GPU, also suffer from limited availability. In this post we return to the good old-fashioned CPU and revisit its relevance to ML applications. Although CPUs are generally less suited to ML workloads compared to GPUs, they are much easier to acquire. The ability to run (at least some of) our workloads on CPU could have significant implications on development productivity.&lt;/p&gt;

&lt;p&gt;In previous posts (e.g., &lt;a href="https://towardsdatascience.com/overcoming-data-preprocessing-bottlenecks-with-tensorflow-data-service-nvidia-dali-and-other-d6321917f851" rel="noopener noreferrer"&gt;here&lt;/a&gt;) we emphasized the importance of analyzing and optimizing the runtime performance of AI/ML workloads as a means of accelerating development and minimizing costs. While this is crucial regardless of the compute engine used, the profiling tools and optimization techniques can vary greatly between platforms. In this post, we will discuss some of the performance optimization options that pertain to CPU. Our focus will be on &lt;a href="https://www.intel.com/content/www/us/en/products/details/processors/xeon.html" rel="noopener noreferrer"&gt;Intel® Xeon® CPU&lt;/a&gt; processors (with &lt;a href="https://www.intel.com/content/www/us/en/architecture-and-technology/avx-512-overview.html" rel="noopener noreferrer"&gt;Intel® AVX-512&lt;/a&gt;) and on the PyTorch (version 2.4) framework (although similar techniques can be applied to other CPUs and frameworks, as well). More specifically, we will run our experiments on an &lt;a href="https://aws.amazon.com/ec2/instance-types/c7i/" rel="noopener noreferrer"&gt;Amazon EC2 c7i&lt;/a&gt; instance with an &lt;a href="https://docs.aws.amazon.com/dlami/" rel="noopener noreferrer"&gt;AWS Deep Learning AMI&lt;/a&gt;. Please do not view our choice of Cloud platform, CPU version, ML framework, or any other tool or library we should mention, as an endorsement over their alternatives.&lt;/p&gt;

&lt;p&gt;Our goal will be to demonstrate that although ML development on CPU may not be our first choice, there are ways to "soften the blow" and - in some cases - perhaps even make it a viable alternative.&lt;/p&gt;

&lt;h2&gt;
  
  
  Disclaimers
&lt;/h2&gt;

&lt;p&gt;Our intention in this post is to demonstrate just a few of the ML optimization opportunities available on CPU. Contrary to most of the online tutorials on the topic of ML optimization on CPU, we will focus on a training workload rather than an inference workload. There are a number of optimization tools focused specifically on inference that we will not cover (e.g., see &lt;a href="https://pytorch.org/tutorials/intermediate/torchserve_with_ipex.html" rel="noopener noreferrer"&gt;here&lt;/a&gt; and &lt;a href="https://pytorch.org/blog/accelerated-cpu-inference/" rel="noopener noreferrer"&gt;here&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Please do not view this post as a replacement of the official documentation on any of the tools or techniques that we mention. Keep in mind that given the rapid pace of AI/ML development, some of the content, libraries, and/or instructions that we mention may become outdated by the time you read this. Please be sure to refer to the most-up-to-date documentation available.&lt;/p&gt;

&lt;p&gt;Importantly, the impact of the optimizations that we discuss on runtime performance is likely to vary greatly based on the model and the details of the environment (e.g., see the high degree of variance between models on the official PyTorch &lt;a href="http://github.com/pytorch/pytorch/issues/93531#issuecomment-1457373890" rel="noopener noreferrer"&gt;TouchInductor CPU Inference Performance Dashboard&lt;/a&gt;). The comparative performance numbers we will share are specific to the toy model and runtime environment that we will use. Be sure to reevaluate all of the proposed optimizations on your own model and runtime environment.&lt;/p&gt;

&lt;p&gt;Lastly, our focus will be solely on throughput performance (as measured in samples per second) - not on training convergence. However, it should be noted that some optimization techniques (e.g., batch size tuning, mixed precision, and more) could have a negative effect on the convergence of certain models. In some cases, this can be overcome through appropriate hyperparameter tuning.&lt;/p&gt;

&lt;h1&gt;
  
  
  Toy Example - ResNet-50
&lt;/h1&gt;

&lt;p&gt;We will run our experiments on a simple image classification model with a &lt;a href="https://pytorch.org/vision/main/models/generated/torchvision.models.resnet50" rel="noopener noreferrer"&gt;ResNet-50&lt;/a&gt; backbone (from &lt;a href="https://arxiv.org/abs/1512.03385" rel="noopener noreferrer"&gt;Deep Residual Learning for Image Recognition&lt;/a&gt;). We will train the model on a fake dataset. The full training script appears in the code block below (loosely based on &lt;a href="https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/training/python-scripts/distributed_data_parallel_training.py" rel="noopener noreferrer"&gt;this example&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import torch

import torchvision

from torch.utils.data import Dataset, DataLoader

import time

# A dataset with random images and labels

class FakeDataset(Dataset):

    def __len__(self):

        return 1000000

    def __getitem__(self, index):

        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)

        label = torch.tensor(data=index % 10, dtype=torch.uint8)

        return rand_image, label

train_set = FakeDataset()

batch_size=128

num_workers=0

train_loader = DataLoader(

    dataset=train_set,

    batch_size=batch_size,

    num_workers=num_workers

)

model = torchvision.models.resnet50()

criterion = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters())

model.train()

t0 = time.perf_counter()

summ = 0

count = 0

for idx, (data, target) in enumerate(train_loader):

    optimizer.zero_grad()

    output = model(data)

    loss = criterion(output, target)

    loss.backward()

    optimizer.step()

    batch_time = time.perf_counter() - t0

    if idx &amp;gt; 10:  # skip first steps

        summ += batch_time

        count += 1

    t0 = time.perf_counter()

    if idx &amp;gt; 100:

        break

print(f'average step time: {summ/count}')

print(f'throughput: {count*batch_size/summ}')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running this script on a c7i.2xlarge (with 8 vCPUs) and the &lt;a href="https://download.pytorch.org/whl/cpu" rel="noopener noreferrer"&gt;CPU&lt;/a&gt; version of PyTorch 2.4, results in a throughput of 9.12 samples per second. For the sake of comparison, we note that the throughput of the same (unoptimized script) on an &lt;a href="https://aws.amazon.com/ec2/instance-types/g5/" rel="noopener noreferrer"&gt;Amazon EC2 g5.2xlarge&lt;/a&gt; instance (with 1 GPU and 8 vCPUs) is 340 samples per second. Taking into account the &lt;a href="https://aws.amazon.com/ec2/pricing/on-demand/" rel="noopener noreferrer"&gt;comparative costs&lt;/a&gt; of these two instance types ($0.357 per hour for a c7i.2xlarge and $1.212 for a g5.2xlarge, as of the time of this writing), we find that training on the GPU instance to give roughly eleven(!!) times better price performance. Based on these results, the preference for using GPUs to train ML models is very well founded. Let's assess some of the possibilities for reducing this gap.&lt;/p&gt;

&lt;h1&gt;
  
  
  PyTorch Performance Optimizations
&lt;/h1&gt;

&lt;p&gt;In this section we will explore some basic methods for increasing the runtime performance of our training workload. Although you may recognize some of these from our &lt;a href="https://towardsdatascience.com/pytorch-model-performance-analysis-and-optimization-10c3c5822869" rel="noopener noreferrer"&gt;post&lt;/a&gt; on GPU optimization, it is important to highlight a significant difference between training optimization on CPU and GPU platforms. On GPU platforms much of our effort was dedicated to maximizing the parallelization between (the training data preprocessing on) the CPU and (the model training on) the GPU. On CPU platforms all of the processing occurs on the CPU and our goal will be to allocate its resources most effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Batch Size
&lt;/h2&gt;

&lt;p&gt;Increasing the training batch size can potentially increase performance by reducing the frequency of the model parameter updates. (On GPUs it has the added benefit of reducing the overhead of CPU-GPU transactions such as kernel loading). However, while on GPU we aimed for a batch size that would maximize the utilization of the GPU memory, the same strategy might hurt performance on CPU. For reasons beyond the scope of this post, CPU memory is more complicated and the best approach for discovering the most optimal batch size may be through trial and error. Keep in mind that changing the batch size could affect training convergence.&lt;/p&gt;

&lt;p&gt;The table below summarizes the throughput of our training workload for a few (arbitrary) choices of batch size:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F744z2ab70dhoo3nkj948.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F744z2ab70dhoo3nkj948.png" width="263" height="119"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Training Throughput as Function of Batch Size (by Author)&lt;/p&gt;

&lt;p&gt;Contrary to our findings on GPU, on the c7i.2xlarge instance type our model appears to prefer lower batch sizes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-process Data Loading
&lt;/h2&gt;

&lt;p&gt;A common technique on GPUs is to &lt;a href="https://pytorch.org/docs/stable/data.html#single-and-multi-process-data-loading" rel="noopener noreferrer"&gt;assign multiple processes&lt;/a&gt; to the data loader so as to reduce the likelihood of starvation of the GPU. On GPU platforms, a general rule of thumb is to set the number of workers according to the number of CPU cores. However, on CPU platforms, where the model training uses the same resources as the data loader, this approach could backfire. Once again, the best approach for choosing the optimal number of workers may be trial and error. The table below shows the average throughput for different choices of &lt;em&gt;num_workers&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuun4zzp5b4p1ms4fxwph.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuun4zzp5b4p1ms4fxwph.png" width="397" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Training Throughput as Function of the Number of Data Loading Workers (by Author)&lt;/p&gt;

&lt;h2&gt;
  
  
  Mixed Precision
&lt;/h2&gt;

&lt;p&gt;Another popular technique is to use lower precision floating point datatypes such as &lt;code&gt;torch.float16&lt;/code&gt; or &lt;code&gt;torch.bfloat16&lt;/code&gt; with the dynamic range of &lt;code&gt;torch.bfloat16&lt;/code&gt; generally considered to be more amiable to ML training. Naturally, reducing the datatype precision can have adverse effects on convergence and should be done carefully. PyTorch comes with &lt;a href="https://pytorch.org/docs/stable/amp.html" rel="noopener noreferrer"&gt;torch.amp&lt;/a&gt;, an automatic mixed precision package for optimizing the use of these datatypes. Intel® AVX-512 includes &lt;a href="https://pytorch.org/blog/empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16/" rel="noopener noreferrer"&gt;support for the bfloat16&lt;/a&gt; datatype. The modified training step appears below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for idx, (data, target) in enumerate(train_loader):

    optimizer.zero_grad()

    with torch.amp.autocast('cpu',dtype=torch.bfloat16):

        output = model(data)

        loss = criterion(output, target)

    loss.backward()

    optimizer.step()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The throughput following this optimization is 24.34 samples per second, an increase of 86%!!&lt;/p&gt;

&lt;h2&gt;
  
  
  Channels Last Memory Format
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html" rel="noopener noreferrer"&gt;Channels last memory format&lt;/a&gt; is a beta-level optimization (at the time of this writing), pertaining primarily to vision models, that supports storing four dimensional (NCHW) tensors in memory such that the channels are the last dimension. This results in all of the data of each pixel being stored together. This optimization pertains primarily to vision models. &lt;a href="https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/performance_tuning/tuning_guide.html#channels-last" rel="noopener noreferrer"&gt;Considered to be more "friendly to Intel platforms"&lt;/a&gt;, this memory format is &lt;a href="https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/" rel="noopener noreferrer"&gt;reported&lt;/a&gt; boost the performance of a ResNet-50 on an &lt;a href="https://www.intel.com/content/www/us/en/products/details/processors/xeon.html" rel="noopener noreferrer"&gt;Intel® Xeon® CPU&lt;/a&gt;. The adjusted training step appears below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for idx, (data, target) in enumerate(train_loader):

    data = data.to(memory_format=torch.channels_last)

    optimizer.zero_grad()

    with torch.amp.autocast('cpu',dtype=torch.bfloat16):

        output = model(data)

        loss = criterion(output, target)

    loss.backward()

    optimizer.step()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The resulting throughput is 37.93 samples per second - an additional 56% improvement and a total of 415% compared to our baseline experiment. We are on a role!!&lt;/p&gt;

&lt;h2&gt;
  
  
  Torch Compilation
&lt;/h2&gt;

&lt;p&gt;In a &lt;a href="https://towardsdatascience.com/tips-and-tricks-for-upgrading-to-pytorch-2-3127db1d1f3d" rel="noopener noreferrer"&gt;previous post&lt;/a&gt; we covered the virtues of PyTorch's support for &lt;a href="https://pytorch.org/docs/stable/generated/torch.compile.html" rel="noopener noreferrer"&gt;graph compilation&lt;/a&gt; and its potential impact on runtime performance. Contrary to the default eager execution mode in which each operation is run independently (a.k.a., "eagerly"), the &lt;a href="https://pytorch.org/docs/stable/generated/torch.compile.html" rel="noopener noreferrer"&gt;compile&lt;/a&gt; API converts the model into an intermediate computation graph which is then JIT-compiled into low-level machine code in a manner that is optimal for the underlying training engine. The API supports compilation via different backend libraries and with multiple configuration options. Here we will limit our evaluation to the *default *(TorchInductor) backend and the &lt;a href="https://github.com/intel/intel-extension-for-pytorch" rel="noopener noreferrer"&gt;&lt;em&gt;ipex&lt;/em&gt;&lt;/a&gt; backend from the &lt;a href="https://pytorch.org/tutorials/recipes/recipes/intel_extension_for_pytorch.html" rel="noopener noreferrer"&gt;Intel® Extension for PyTorch&lt;/a&gt;, a library with dedicated optimizations for Intel hardware. Please see the &lt;a href="https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=cpu&amp;amp;version=v2.4.0%2bcpu&amp;amp;os=linux%2fwsl2&amp;amp;package=pip" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; for appropriate installation and usage instructions. The updated model definition appears below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import intel_extension_for_pytorch as ipex

model = torchvision.models.resnet50()

backend='inductor' # optionally change to 'ipex'

model = torch.compile(model, backend=backend)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the case of our toy model, the impact of torch compilation is only apparent when the "channels last" optimization is disabled (and increase of ~27% for each of the backends). When "channels last" is applied, the performance actually drops. As a result, we drop this optimization from our subsequent experiments.&lt;/p&gt;

&lt;h1&gt;
  
  
  Memory and Thread Optimizations
&lt;/h1&gt;

&lt;p&gt;There are a number of opportunities for &lt;a href="https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#cpu-specific-optimizations" rel="noopener noreferrer"&gt;optimizing the use of the underlying CPU resources&lt;/a&gt;. These include optimizing memory management and thread allocation to the &lt;a href="https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/performance_tuning/tuning_guide.html#intel-cpu-structure" rel="noopener noreferrer"&gt;structure&lt;/a&gt; of the underlying CPU hardware. Memory management can be improved through the use of &lt;a href="https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#switch-memory-allocator" rel="noopener noreferrer"&gt;advanced memory allocators&lt;/a&gt; (such as &lt;a href="https://github.com/jemalloc/jemalloc" rel="noopener noreferrer"&gt;Jemalloc&lt;/a&gt; and &lt;a href="https://google.github.io/tcmalloc/overview.html" rel="noopener noreferrer"&gt;TCMalloc&lt;/a&gt;) and/or reducing memory accesses that are slower (i.e., across &lt;a href="https://en.wikipedia.org/wiki/Non-uniform_memory_access" rel="noopener noreferrer"&gt;NUMA nodes&lt;/a&gt;). Threading allocation can be improved through appropriate &lt;a href="https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#utilize-openmp" rel="noopener noreferrer"&gt;configuration of the OpenMP threading library&lt;/a&gt; and/or use of &lt;a href="https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#utilize-openmp" rel="noopener noreferrer"&gt;Intel's Open MP library&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Generally speaking, these kinds of optimizations require a deep level understanding of the CPU architecture and the features of its supporting SW stack. To simplify matters, PyTorch offers the &lt;a href="https://pytorch.org/tutorials/recipes/xeon_run_cpu.html" rel="noopener noreferrer"&gt;&lt;em&gt;torch.backends.xeon.run_cpu&lt;/em&gt;&lt;/a&gt; script for automatically configuring the memory and threading libraries so as to optimize runtime performance. The command below will result in the use of the dedicated memory and threading libraries. We will return to the topic of NUMA nodes when we discuss the option of distributed training.&lt;/p&gt;

&lt;p&gt;We verify appropriate installation of &lt;a href="https://google.github.io/tcmalloc/overview.html" rel="noopener noreferrer"&gt;TCMalloc&lt;/a&gt; (&lt;code&gt;conda install conda-forge::gperftools&lt;/code&gt;) and &lt;a href="https://pypi.org/project/intel-openmp/" rel="noopener noreferrer"&gt;Intel's Open MP library&lt;/a&gt; (&lt;code&gt;pip install intel-openmp&lt;/code&gt;), and run the following command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python -m torch.backends.xeon.run_cpu train.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The use of the &lt;a href="https://pytorch.org/tutorials/recipes/xeon_run_cpu.html" rel="noopener noreferrer"&gt;&lt;em&gt;run_cpu&lt;/em&gt;&lt;/a&gt; script further boosts our runtime performance to 39.05 samples per second. Note that the &lt;a href="https://pytorch.org/tutorials/recipes/xeon_run_cpu.html" rel="noopener noreferrer"&gt;&lt;em&gt;run_cpu&lt;/em&gt;&lt;/a&gt; script includes many controls for further tuning performance. Be sure to check out the documentation in order to maximize its use.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Intel Extension for PyTorch
&lt;/h1&gt;

&lt;p&gt;The &lt;a href="https://pytorch.org/tutorials/recipes/recipes/intel_extension_for_pytorch.html" rel="noopener noreferrer"&gt;Intel® Extension for PyTorch&lt;/a&gt; includes additional opportunities for &lt;a href="https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/performance_tuning/tuning_guide.html#non-uniform-memory-access-numa" rel="noopener noreferrer"&gt;training optimization&lt;/a&gt; via its &lt;a href="https://intel.github.io/intel-extension-for-pytorch/latest/tutorials/api_doc.html" rel="noopener noreferrer"&gt;ipex.optimize&lt;/a&gt; function. Here we demonstrate its default use. Please see the documentation to learn of its full capabilities.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model = torchvision.models.resnet50()

criterion = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters())

model.train()

model, optimizer = ipex.optimize(

   model,

   optimizer=optimizer,

   dtype=torch.bfloat16

)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combined with the memory and thread optimizations discussed above, the resultant throughput is 40.73 samples per second. (Note that a similar result is reached when disabling the "channels last" configuration.)&lt;/p&gt;

&lt;h1&gt;
  
  
  Distributed Training on CPU
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://www.intel.com/content/www/us/en/products/details/processors/xeon.html" rel="noopener noreferrer"&gt;Intel® Xeon®&lt;/a&gt; processors are designed with &lt;a href="https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/performance_tuning/tuning_guide.html#non-uniform-memory-access-numa" rel="noopener noreferrer"&gt;Non-Uniform Memory Access (NUMA)&lt;/a&gt; in which the CPU memory is divided into groups, a.k.a., NUMA nodes, and each of the CPU cores is assigned to one node. Although any CPU core can access the memory of any NUMA node, the access to its own node (i.e., its local memory) is much faster. This gives rise to the notion of &lt;a href="https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/examples.html#distributed-training" rel="noopener noreferrer"&gt;distributing training across NUMA nodes&lt;/a&gt;, where the CPU cores assigned to each NUMA node act as a single process in a &lt;a href="https://pytorch.org/docs/stable/distributed.html" rel="noopener noreferrer"&gt;distributed process group&lt;/a&gt; and &lt;a href="https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#train-a-model-on-cpu-with-pytorch-distributeddataparallel-ddp-functionality" rel="noopener noreferrer"&gt;data distribution&lt;/a&gt; across nodes is managed by &lt;a href="https://github.com/oneapi-src/oneCCL" rel="noopener noreferrer"&gt;Intel® oneCCL&lt;/a&gt;, Intel's dedicated collective communications library.&lt;/p&gt;

&lt;p&gt;We can run data distributed training across NUMA nodes easily using the &lt;a href="https://intel.github.io/intel-extension-for-pytorch/latest/tutorials/performance_tuning/launch_script.html" rel="noopener noreferrer"&gt;&lt;em&gt;ipexrun&lt;/em&gt;&lt;/a&gt;utility. In the following code block (loosely based on &lt;a href="https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/training/python-scripts/distributed_data_parallel_training.py" rel="noopener noreferrer"&gt;this example&lt;/a&gt;) we adapt our script to run data distributed training (according to usage detailed &lt;a href="https://github.com/intel/torch-ccl?tab=readme-ov-file#usage" rel="noopener noreferrer"&gt;here&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os, time

import torch

from torch.utils.data import Dataset, DataLoader

from torch.utils.data.distributed import DistributedSampler

import torch.distributed as dist

import torchvision

import oneccl_bindings_for_pytorch as torch_ccl

import intel_extension_for_pytorch as ipex

os.environ["MASTER_ADDR"] = "127.0.0.1"

os.environ["MASTER_PORT"] = "29500"

os.environ["RANK"] = os.environ.get("PMI_RANK", "0")

os.environ["WORLD_SIZE"] = os.environ.get("PMI_SIZE", "1")

dist.init_process_group(backend="ccl", init_method="env://")

rank = os.environ["RANK"]

world_size = os.environ["WORLD_SIZE"]

batch_size = 128

num_workers = 0

# define dataset and dataloader

class FakeDataset(Dataset):

    def __len__(self):

        return 1000000

    def __getitem__(self, index):

        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)

        label = torch.tensor(data=index % 10, dtype=torch.uint8)

        return rand_image, label

train_dataset = FakeDataset()

dist_sampler = DistributedSampler(train_dataset)

train_loader = DataLoader(

    dataset=train_dataset,

    batch_size=batch_size,

    num_workers=num_workers,

    sampler=dist_sampler

)

# define model artifacts

model = torchvision.models.resnet50()

criterion = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters())

model.train()

model, optimizer = ipex.optimize(

    model,

    optimizer=optimizer,

    dtype=torch.bfloat16

)

# configure DDP

model = torch.nn.parallel.DistributedDataParallel(model)

# run training loop

# destroy the process group

dist.destroy_process_group()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unfortunately, as of the time of this writing, the &lt;a href="https://aws.amazon.com/ec2/instance-types/c7i/" rel="noopener noreferrer"&gt;Amazon EC2 c7i&lt;/a&gt; instance family does not include a multi-NUMA instance type. To test our distributed training script, we revert back to a &lt;a href="https://aws.amazon.com/ec2/instance-types/c6i/" rel="noopener noreferrer"&gt;Amazon EC2 c6i.32xlarge&lt;/a&gt; instance with 64 vCPUs and 2 NUMA nodes. We verify the &lt;a href="https://github.com/intel/torch-ccl?tab=readme-ov-file#installation" rel="noopener noreferrer"&gt;installation&lt;/a&gt; of &lt;a href="https://github.com/intel/torch-ccl" rel="noopener noreferrer"&gt;Intel® oneCCL Bindings for PyTorch&lt;/a&gt; and run the following command (as documented &lt;a href="https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/training/python-scripts#running-example-scripts" rel="noopener noreferrer"&gt;here&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh

# This example command would utilize all the numa sockets of the processor, taking each socket as a rank.

ipexrun cpu --nnodes 1 --omp_runtime intel train.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following table compares the performance results on the &lt;a href="https://aws.amazon.com/ec2/instance-types/c6i/" rel="noopener noreferrer"&gt;c6i.32xlarge&lt;/a&gt; instance with and without distributed training:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frkufxks9e9sdn8lyye31.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frkufxks9e9sdn8lyye31.png" width="275" height="82"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Distributed Training Across NUMA Nodes (by Author)&lt;/p&gt;

&lt;p&gt;In our experiment, data distribution did &lt;em&gt;not&lt;/em&gt; boost the runtime performance. Please see &lt;a href="https://intel.github.io/intel-extension-for-pytorch/latest/tutorials/performance_tuning/launch_script.html" rel="noopener noreferrer"&gt;&lt;em&gt;ipexrun documentation&lt;/em&gt;&lt;/a&gt; for additional performance tuning options.&lt;/p&gt;

&lt;h1&gt;
  
  
  CPU Training with Torch/XLA
&lt;/h1&gt;

&lt;p&gt;In previous posts (e.g., &lt;a href="https://towardsdatascience.com/how-to-accelerate-your-pytorch-training-with-xla-on-aws-3d599bc8f6a9" rel="noopener noreferrer"&gt;here&lt;/a&gt;) we discussed the &lt;a href="https://github.com/pytorch/xla" rel="noopener noreferrer"&gt;PyTorch/XLA&lt;/a&gt; library and its use of &lt;a href="https://openxla.org/xla" rel="noopener noreferrer"&gt;XLA compilation&lt;/a&gt; to enable PyTorch based training on &lt;a href="https://github.com/pytorch/xla/blob/master/API_GUIDE.md" rel="noopener noreferrer"&gt;&lt;em&gt;XLA devices&lt;/em&gt;&lt;/a&gt;such as TPU, GPU, &lt;em&gt;and&lt;/em&gt; CPU. Similar to torch compilation, XLA uses graph compilation to generate machine code that is optimized for the target device. With the establishment of the &lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/googles-open-source-momentum-openxla-new-partnerships" rel="noopener noreferrer"&gt;OpenXLA Project&lt;/a&gt;, one of the stated goals was to support high performance across all hardware backends, including CPU (see the CPU RFC &lt;a href="https://docs.google.com/document/d/1ZzMcrjxITJeN2IjjgbzUjHh-4W1YgDUus3j25Dvn9ng/edit#heading=h.w9ztr841aqk8" rel="noopener noreferrer"&gt;here&lt;/a&gt;). The code block below demonstrates the adjustments to our original (unoptimized) script required to train using &lt;a href="https://pytorch.org/xla/release/r2.4/index.html#" rel="noopener noreferrer"&gt;PyTorch/XLA&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import torch

import torchvision

import timeimport torch_xla

import torch_xla.core.xla_model as xm

device = xm.xla_device()

model = torchvision.models.resnet50().to(device)

criterion = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters())

model.train()

for idx, (data, target) in enumerate(train_loader):

    data = data.to(device)

    target = target.to(device)

    optimizer.zero_grad()

    output = model(data)

    loss = criterion(output, target)

    loss.backward()

    optimizer.step()

    xm.mark_step()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unfortunately, (as of the time of this writing) the XLA results on our toy model seem far inferior to the (unoptimized) results we saw above (- by as much as 7X). We expect this to improve as PyTorch/XLA's CPU support matures.&lt;/p&gt;

&lt;h1&gt;
  
  
  Results
&lt;/h1&gt;

&lt;p&gt;We summarize the results of a subset of our experiments in the table below. For the sake of comparison, we add the throughput of training our model on &lt;a href="https://aws.amazon.com/ec2/instance-types/g5/" rel="noopener noreferrer"&gt;Amazon EC2 g5.2xlarge&lt;/a&gt; GPU instance following the optimization steps discussed in &lt;a href="https://towardsdatascience.com/pytorch-model-performance-analysis-and-optimization-10c3c5822869" rel="noopener noreferrer"&gt;this post&lt;/a&gt;. The &lt;em&gt;samples per dollar&lt;/em&gt; was calculated based on the&lt;a href="https://aws.amazon.com/ec2/pricing/on-demand/" rel="noopener noreferrer"&gt; Amazon EC2 On-demand pricing&lt;/a&gt; page ($0.357 per hour for a c7i.2xlarge and $1.212 for a g5.2xlarge, as of the time of this writing).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hjoe38ixk4io15oieho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hjoe38ixk4io15oieho.png" width="700" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Performance Optimization Results (by Author)&lt;/p&gt;

&lt;p&gt;Although we succeeded in boosting the training performance of our toy model on the CPU instance by a considerable margin (446%), it remains inferior to the (optimized) performance on the GPU instance. Based on our results, training on GPU would be ~6.7 times cheaper. It is likely that with additional performance tuning and/or applying additional optimizations strategies, we could further close the gap. Once again, we emphasize that the comparative performance results we have reached are unique to this model and runtime environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon EC2 Spot Instances Discounts
&lt;/h2&gt;

&lt;p&gt;The increased availability of cloud-based CPU instance types (compared to GPU instance types) may imply greater opportunity for obtaining compute power at discounted rates, e.g., through Spot Instance utilization. &lt;a href="https://aws.amazon.com/ec2/spot/" rel="noopener noreferrer"&gt;Amazon EC2 Spot Instances&lt;/a&gt; are instances from surplus cloud service capacity that are offered for a discount of as much as 90% off the On-Demand pricing. In exchange for the discounted price, AWS maintains the right to preempt the instance with little to no warning. Given the high demand for GPUs, you may find CPU spot instances easier to get ahold of than their GPU counterparts. At the time of this writing, c7i.2xlarge &lt;a href="https://aws.amazon.com/ec2/spot/pricing/" rel="noopener noreferrer"&gt;Spot Instance price&lt;/a&gt; is $0.1291 which would improve our samples per dollar result to 1135.76 and further reduces the gap between the optimized GPU and CPU price performances (to 2.43X).&lt;/p&gt;

&lt;p&gt;While the runtime performance results of the optimized CPU training of our toy model (and our chosen environment) were lower than the GPU results, it is likely that the same optimization steps applied to other model architectures (e.g., ones that include components that are not supported by GPU) may result in the CPU performance matching or beating that of the GPU. And even in cases where the performance gap is not bridged, there may very well be cases where the shortage of GPU compute capacity would justify running some of our ML workloads on CPU.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;Given the ubiquity of the CPU, the ability to use them effectively for training and/or running ML workloads could have huge implications on development productivity and on end-product deployment strategy. While the nature of the CPU architecture is less amiable to many ML applications when compared to the GPU, there are many tools and techniques available for boosting its performance - a select few of which we have discussed and demonstrated in this post.&lt;/p&gt;

&lt;p&gt;In this post we focused optimizing training on CPU. Please be sure to check out our many &lt;a href="https://chaimrand.medium.com/" rel="noopener noreferrer"&gt;other posts on medium&lt;/a&gt; covering a wide variety of topics pertaining to performance analysis and optimization of machine learning workloads.&lt;/p&gt;

</description>
      <category>ec2</category>
      <category>ai</category>
      <category>aws</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>A Priority Based Scheduler for Amazon SageMaker Training Jobs</title>
      <dc:creator>Chaim Rand</dc:creator>
      <pubDate>Mon, 11 Mar 2024 09:06:14 +0000</pubDate>
      <link>https://dev.to/aws-builders/a-priority-based-scheduler-for-amazon-sagemaker-training-jobs-30f0</link>
      <guid>https://dev.to/aws-builders/a-priority-based-scheduler-for-amazon-sagemaker-training-jobs-30f0</guid>
      <description>&lt;h2&gt;
  
  
  Optimizing the use of limited AI training accelerators — Part 2
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xEw1epqd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:700/0%2APV9sbX7QZa1mVHfI" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xEw1epqd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:700/0%2APV9sbX7QZa1mVHfI" alt="" width="700" height="525"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photo by  &lt;a href="https://unsplash.com/@ahda_gallery?utm_source=medium&amp;amp;utm_medium=referral"&gt;Adrien Aletti&lt;/a&gt;  on  &lt;a href="https://unsplash.com/?utm_source=medium&amp;amp;utm_medium=referral"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This post was created in collaboration with  &lt;a href="https://www.linkedin.com/in/maxrabin/"&gt;Max Rabin&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is the second part of a series of posts on the topic of maximizing the utility of scarce AI resources. In the  &lt;a href="https://towardsdatascience.com/maximizing-the-utility-of-scarce-ai-resources-a-kubernetes-approach-0230ba53965b"&gt;first post&lt;/a&gt;  we noted the increasing limitations on the ability to scale up AI resources at will and, as a consequence, the growing trend of AI development teams to guarantee AI compute capacity by means such as building up an in-house AI server farm and/or reserving dedicated instances in the cloud. The scarcity of AI compute resources motivates the design of specialized scheduling solutions to minimize idle time and prioritize critical workloads. Please see our  &lt;a href="https://towardsdatascience.com/maximizing-the-utility-of-scarce-ai-resources-a-kubernetes-approach-0230ba53965b"&gt;previous post&lt;/a&gt;  in which we proposed a detailed list of requirements for such solutions. The approach we took there was to leverage the existing priority-based  &lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/"&gt;scheduler&lt;/a&gt;  that comes with  &lt;a href="https://kubernetes.io/"&gt;Kubernetes&lt;/a&gt;  and align our training development workflow to its use. In this post we explore the option of maintaining our existing framework for training AI models and enhancing it with our own custom implementation of a priority-based scheduler. Importantly, the need for this type of solution is often motivated not just by the scarcity of AI resources, but also by the desire to increase control over the orchestration and prioritization of training workloads so as to reduce development costs. For example, even in a scenario of abundant capacity, you may choose to limit your use to a fixed number of training instances so as to cap your training expenditure.&lt;/p&gt;

&lt;p&gt;For the purposes of this post, we will assume that our training framework of choice is AWS’s managed service for AI model training,  &lt;a href="https://aws.amazon.com/sagemaker/"&gt;Amazon SageMaker&lt;/a&gt;. The solution we will propose will use additional AWS services such as  &lt;a href="https://aws.amazon.com/pm/dynamodb/"&gt;Amazon DynamoDB&lt;/a&gt;  and  &lt;a href="https://aws.amazon.com/pm/lambda/"&gt;AWS Lambda&lt;/a&gt;. The choice to demonstrate our solution using AWS services should not be viewed as endorsement. There are many cloud-based service offerings available and the best one for you will depend on the particular details of your project. Similar solutions to the one that we will describe can be designed on other cloud-based environments and/or using alternative cloud-based services.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Traditional Method for Starting Up SageMaker Training Jobs
&lt;/h1&gt;

&lt;p&gt;Traditionally, we would start up a SageMaker training job using the  &lt;a href="https://sagemaker.readthedocs.io/en/stable/"&gt;Amazon SageMaker Python SDK&lt;/a&gt;. In the code block below we use the SageMaker SDK (version 2.208) to run a PyTorch training workload on a single instance of type  &lt;a href="https://aws.amazon.com/ec2/instance-types/p5/"&gt;p5.48xlarge&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sagemaker.pytorch import PyTorch  

# define job  
estimator = PyTorch(  
    role='&amp;lt;sagemaker role&amp;gt;',  
    entry_point='train.py',  
    instance_type='ml.p5.48xlarge',  
    instance_count=1,  
    framework_version='2.0.1',  
    py_version='py310',  
    tags=[{'Key': 'priority', 'Value': '100'}  
)  

# start job  
estimator.fit()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the  &lt;em&gt;estimator.fit()&lt;/em&gt;  function is called, the SageMaker library uploads our code to Amazon S3 and then transforms the request to a boto3 SageMaker client  &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_training_job.html"&gt;create_training_job&lt;/a&gt;  request (see  &lt;a href="https://github.com/aws/sagemaker-python-sdk/blob/v2.208.0/src/sagemaker/session.py#L976"&gt;here&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;This method for starting up training jobs is dependent on the availability of the requested resources for its success. In our scenario of scarce AI resources, it is likely to fail more often than not. Although this can be partially mitigated by  &lt;a href="https://medium.com/@chaimrand/retaining-amazon-sagemaker-instance-capacity-with-sagemaker-managed-warm-pools-f7cfd78fa34c"&gt;retaining provisioned compute instances for successive workloads&lt;/a&gt;, the API does  &lt;em&gt;not&lt;/em&gt; provide the appropriate tooling for maximizing their utility. Let’s suppose that we wish to utilize precisely two  &lt;a href="https://aws.amazon.com/ec2/instance-types/p5/"&gt;p5.48xlarge&lt;/a&gt;  instances. To simplify our discussion, let’s assume that each training workload runs on a single instance. Typically, during an AI model development cycle there will be periods when there are more than two training workloads that are waiting to be processed. The existing API would try to start up a third  &lt;a href="https://aws.amazon.com/ec2/instance-types/p5/"&gt;p5.48xlarge&lt;/a&gt;  instance and would most likely fail due to its limited availability. Even when there is instance availability, we may wish to limit our training to just our two designated instances to increase our control over the costs of training.&lt;/p&gt;

&lt;p&gt;We require a new API for submitting jobs for training, one that does not immediately start up a new  &lt;a href="https://aws.amazon.com/ec2/instance-types/p5/"&gt;p5.48xlarge&lt;/a&gt;  instance, but rather enters the jobs to a priority queue. And we need an associated job scheduler that manages the use of our two resources while prioritizing critical workloads.&lt;/p&gt;

&lt;p&gt;Importantly, please note that as of the time of this writing, Amazon SageMaker does  &lt;em&gt;not&lt;/em&gt; support the option of training on  &lt;a href="https://aws.amazon.com/ec2/pricing/reserved-instances/"&gt;reserved Amazon EC2 instances&lt;/a&gt;. And although  &lt;a href="https://aws.amazon.com/savingsplans/ml-pricing/"&gt;Amazon SageMaker Savings Plans&lt;/a&gt;  has similar properties to instance reservations, it does  &lt;em&gt;not&lt;/em&gt; guarantee instance capacity. In a  &lt;a href="https://chaimrand.medium.com/f7cfd78fa34c"&gt;previous post&lt;/a&gt;  we addressed this limitation and proposed using  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html"&gt;SageMaker managed warm pools&lt;/a&gt;  as an alternative method for retaining access to provisioned instances. For the remainder of the post, we will assume that we are able to attain two instances of our choice whether it be through this or some other method.&lt;/p&gt;

&lt;h1&gt;
  
  
  Priority-Based Scheduling for Amazon SageMaker
&lt;/h1&gt;

&lt;p&gt;In this section we will describe the components of our proposed solution. We will use the  &lt;a href="https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-specification.html"&gt;AWS Serverless Application Model (SAM) specification&lt;/a&gt;. More specifically, we will create an  &lt;a href="https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-specification-template-anatomy.html"&gt;AWS SAM template YAML file&lt;/a&gt;  and gradually add the AWS resources that we need. Please see the  &lt;a href="https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html"&gt;documentation&lt;/a&gt;  for details on how to define and deploy serverless solutions using AWS SAM.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Y27unoY6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:560/1%2AuBhGUEeJH9nUcMjLCup-6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Y27unoY6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:560/1%2AuBhGUEeJH9nUcMjLCup-6g.png" alt="" width="560" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Architecture Diagram (by Author)&lt;/p&gt;

&lt;h1&gt;
  
  
  A Private API for Submitting Training Jobs
&lt;/h1&gt;

&lt;p&gt;We start by using  &lt;a href="https://aws.amazon.com/api-gateway/"&gt;Amazon API Gateway&lt;/a&gt;  to define a  &lt;a href="https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-private-apis.html"&gt;private REST API&lt;/a&gt;  for submitting training job requests. We name the API  &lt;em&gt;training-job-queue&lt;/em&gt;. Later, we will add a POST method called  &lt;em&gt;add-job&lt;/em&gt;  and modify our training-job creation code to use this method instead of the SageMaker client  &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_training_job.html"&gt;create_training_job&lt;/a&gt;  API. The code block below contains the definition of the private  &lt;a href="https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-resource-api.html"&gt;API resource&lt;/a&gt;  in SAM. In practice you will likely want to specify access limitations to the API and/or a method of authorization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWSTemplateFormatVersion: '2010-09-09'  
Transform: AWS::Serverless-2016-10-31  

Resources:  
  InternalAPI:  
    Type: AWS::Serverless::Api  
      # Auth: # Add access control to API  
      EndpointConfiguration:  
        Type: PRIVATE  
        # VPCEndpointIds: # Specify VPC Endpoint(s)  
      Name: training-job-queue  
      StageName: prod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Define an AWS DynamoDB Table for Storing Training Job Requests
&lt;/h2&gt;

&lt;p&gt;We will use an  &lt;a href="https://aws.amazon.com/pm/dynamodb/"&gt;Amazon DynamoDB&lt;/a&gt;  table named  &lt;em&gt;sagemaker-queue&lt;/em&gt;  to store the submitted training workloads. Each entry will have the following fields:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; jobName: Stores the unique name of the training job.&lt;/li&gt;
&lt;li&gt; entryTime: Stores the date and time that the job was added.&lt;/li&gt;
&lt;li&gt; jobState: Stores the current state of the training job. The valid values are ‘pending’, ‘running’, and ‘preempted’.&lt;/li&gt;
&lt;li&gt; priority: Stores an integer value representing the relative priority of the job.&lt;/li&gt;
&lt;li&gt; jobDetails: Stores the details of the job request.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We define our DynamoDB table in our SAM template YAML file using the  &lt;a href="https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-resource-simpletable.html"&gt;AWS::Serverless::SimpleTable&lt;/a&gt;  resource.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; DynamoSMQueue:  
    Type: AWS::Serverless::SimpleTable  
    Properties:  
      PrimaryKey:  
        Name: jobName  
        Type: String  
      TableName: sagemaker-queue
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We define a function that creates a table entry from a given training job request. We assume that request contains the same contents as the input to the  &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_training_job.html"&gt;create_training_job&lt;/a&gt;  API in JSON format. We further assume that the  &lt;em&gt;priority&lt;/em&gt;  of the workload is entered as a key-value  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Tag.html"&gt;tag&lt;/a&gt;  in the training job definition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json, boto3, datetime  

dynamodb = boto3.resource('dynamodb')  
table = dynamodb.Table('sagemaker-queue')  

def add_job_entry(job_json):  
    job_details = json.loads(job_json)  

    # extract job_name  
    job_name = job_details['TrainingJobName']  
    print(f'add entry {job_name}')  

    # get current time  
    entry_time = datetime.now().strftime("%Y-%m-%dT%H:%M:%S")  

    # default priority is 0  
    priority = 0  

    # update priority based on tags  
    tags = job_details['Tags']  
    for tag in tags:  
        if tag['Key'] == 'priority':  
            priority = int(tag['Value'])  
            break  

    # create entry  
    entry = {  
       'jobName': job_name,  
       'entryTime': entry_time,  
       'jobState': 'pending',  
       'priority': priority,  
       'jobDetails': job_json  
    }  
    table.put_item(Item=entry) #TODO handle errors  
    print(f'Added job {job_name} to queue')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The REST API  &lt;em&gt;add-job&lt;/em&gt; method that we will soon define will be programmed to call the  &lt;em&gt;add_job_entry&lt;/em&gt; function.&lt;/p&gt;

&lt;p&gt;We define a second function that extracts the pending jobs from the database and returns them in order of priority. In the case that multiple jobs have the same priority, they are ordered according to the amount of time they have been waiting in the queue.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from boto3.dynamodb.conditions import Attr  

# Get a list of all pending jobs sorted by priority  
def get_pending_jobs():  
    response = table.scan(  
        ProjectionExpression='jobName, priority, entryTime',  
        FilterExpression=Attr('jobState').ne('running')  
    )  
    jobs = response.get('Items', [])  

    # sort jobs, first by priority (descending) and then by entryTime  
    sorted_jobs = sorted(jobs,  
                         key=lambda x: (-x['priority'], x['entryTime']))  

    return sorted_jobs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following utility functions will come in handy in the next sections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Get a jobName -&amp;gt; priority mapping of all running jobs  
def get_running_jobs_dict():  
    # Get all running jobs  
    response = table.scan(  
        ProjectionExpression="jobName, priority",  
        FilterExpression=Attr('jobState').eq('running')  
    )  
    jobs = response.get('Items', [])  

    running_jobs = {job['jobName']: job['priority'] for job in jobs}  

    return running_jobs  

# Print the queue state  
def print_queue_state():  
    response = table.scan(  
        ProjectionExpression='jobName, jobState, priority'  
    )  
    jobs = response.get('Items', [])  

    print_table = []  
    for job in jobs:  
        print_table.append([job['jobName'], job['jobState'], job['priority']])  

    # sort by priority  
    sorted_table = sorted(print_table,  
                         key=lambda x: -x[2])  
    # Print the table  
    from tabulate import tabulate  
    print(tabulate(sorted_table, headers=['Job Name', 'State', 'Priority']))  

# get job details  
def get_job_details(job_name):  
    response = table.get_item(  
        Key={'jobName': job_name},  
        ProjectionExpression='jobDetails'  
    )  
    return json.loads(response.get('Item').get('jobDetails'))  

# get job state or None if the job does not exist  
def get_job_state(job_name):  
    response = table.get_item(  
        Key={'jobName': job_name},  
        ProjectionExpression='jobState'  
    )  
    job = response.get('Item')  
    return job.get('jobState') if job else None  

# update the job state  
def update_job_state(job_name, new_state):  
    table.update_item(  
        Key={'jobName': job_name},  
        UpdateExpression="SET jobState = :new_state",  
        ExpressionAttributeValues={":new_state": new_state}  
    )  
    print(f'Update job {job_name} to {new_state}')  

# remove a job entry  
def remove_job(job_name):  
    table.delete_item(  
        Key={'jobName': job_name}  
    )  
    print(f'Removed job {job_name} from queue')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both our choice of DynamoDB and its usage (e.g., our use of the  &lt;a href="https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html"&gt;Scan&lt;/a&gt;  API rather than the  &lt;a href="https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html"&gt;Query&lt;/a&gt;  API) assume that the overall number of jobs in our queue will be in the dozens, at most. For a larger scale solution, you may be better off with a heavier duty database (e.g., one that performs the sorting operation for you) or a more sophisticated use of DynamoDB (e.g., see  &lt;a href="https://aws.amazon.com/blogs/database/implementing-priority-queueing-with-amazon-dynamodb/"&gt;here&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Define the Training Job Queue Manager
&lt;/h2&gt;

&lt;p&gt;The main component of our solution is the training job scheduler. Here we implement a rather simple manager that performs the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Extract the list of queued jobs, ordered by priority. If none exist, return.&lt;/li&gt;
&lt;li&gt; Discover unused instance capacity. For each free instance, start one pending job on SageMaker. If no jobs remain after that, return.&lt;/li&gt;
&lt;li&gt; Calculate the number of SageMaker jobs in the  &lt;em&gt;Stopping&lt;/em&gt; state. If greater than the number of pending jobs, return.&lt;/li&gt;
&lt;li&gt; Assess the need for preemption of running SageMaker jobs by comparing their  &lt;em&gt;priorities&lt;/em&gt; to those of our pending jobs.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# set the limit on total number of instances/jobs  
MAX_CAPACITY = 2  

sagemaker = boto3.client('sagemaker')  

# apply a queue stamp to identify that the job came from the queue  
def apply_qstamp(job_name):  
    return f'{job_name}-qstamp-{datetime.now().strftime("%d%H%M")}'  

# strip the queue stamp  
def strip_qstamp(job_name):  
    return job_name.split('-qstamp-')[0]  

# start a SageMaker job and update job entry in queue  
def start_job(job_name):  
    print(f'start job {job_name}')  
    job_details = get_job_details(job_name)  
    job_details['TrainingJobName'] = apply_qstamp(job_name)  
    if(job_details):  
        # start job with detail from queue  
        # (you may optinally overwrite fields such as the iam role)  
        response = sagemaker.create_training_job(**job_details)  
        if response['ResponseMetadata']['HTTPStatusCode'] == 200:  
            print(f'started job {job_name}')  
            update_job_state(job_name, 'running')  

# preempt a SageMaker job and update job entry in queue  
def preempt_job(job_name):  
    print(f'preempt job {job_name}')  
    response = sagemaker.stop_training_job(TrainingJobName=job_name)  
    if response['ResponseMetadata']['HTTPStatusCode'] == 200:  
        print(f'preempted job {job_name}')  
        update_job_state(strip_qstamp(job_name), 'preempted')  

# get SageMaker jobs  
def get_sagemaker_jobs(status):  
    running = sagemaker.list_training_jobs(StatusEquals=status)  
    return running.get('TrainingJobSummaries', [])  

# queue manager  
def manage_queue():  
    # extract pending jobs to run  
    pending = get_pending_jobs()  

    if not pending:  
        return  

    if len(pending) &amp;gt; MAX_CAPACITY:  
        pending = pending[:MAX_CAPACITY]  

    # get running sagemaker jobs  
    running = get_sagemaker_jobs('InProgress')  
    total_running = len(running)  

    # get stopping sagemaker jobs  
    stopping = get_sagemaker_jobs('Stopping')  
    total_stopping = len(stopping)  

    # calculate the number of free instances   
    free_slots = MAX_CAPACITY - total_running - total_stopping  

    jobs_to_start = min(len(pending), free_slots)  

    # for each free instance, start a job  
    for i in range(jobs_to_start):  
        start_job(pending[i].get('jobName'))  

    still_pending = pending[jobs_to_start:]  

    if not still_pending:  
        return  

    # assume that 'total_stopping' number of jobs will start soon  
    test_for_preemption = len(still_pending) - total_stopping  
    if test_for_preemption &amp;lt;= 0:  
        return  

    # check if preemption is required  
    test_priority = still_pending[total_stopping:]  

    running_jobs = get_running_jobs_dict()  
    priority_dict = {}  
    for job in running:  
        job_name = job['TrainingJobName']  
        priority_dict[job_name] = running_jobs[strip_qstamp(job_name)]  

    # sort running jobs from lowest to highest priority  
    sorted_running = sorted(priority_dict.items(), key=lambda item: item[1])  

    index = 0  
    while index &amp;lt; test_for_preemption and \  
          test_priority[index].get('priority') &amp;gt; sorted_running[index][1]:  
        preempt_job(sorted_running[index][0])  
        index = index + 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Important notes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Our implementation is highly optimistic in the sense that we assume that all the jobs that are inserted are valid and that we will be able to start them up on SageMaker without issue. In practice, appropriate error handling should be added (e.g., removing faulty jobs from the queue with appropriate logging).&lt;/li&gt;
&lt;li&gt; In a production environment, we would need to take into consideration the likely occurrence of a  &lt;a href="https://en.wikipedia.org/wiki/Race_condition"&gt;race condition&lt;/a&gt;  when our  &lt;em&gt;queue_manager&lt;/em&gt; is triggered by multiple concurrent events. There are several ways of addressing this problem (e.g., see  &lt;a href="https://medium.com/@moradiyabhavik/race-condition-understanding-and-solution-926b6d2808cf"&gt;here&lt;/a&gt;) including enforcing atomicity (e.g., by setting our  &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html"&gt;Lambda function concurrency&lt;/a&gt;  to one), using some form of locking mechanism (e.g., as done  &lt;a href="https://aws.amazon.com/blogs/database/implementing-priority-queueing-with-amazon-dynamodb/"&gt;here&lt;/a&gt;), or making our function  &lt;a href="https://en.wikipedia.org/wiki/Idempotence"&gt;idempotent&lt;/a&gt;. Here we have taken the approach of what we call “optimistic idempotence”, where we rely on appropriate use of the API and on the idempotency of our underlying calls to the SageMaker APIs.&lt;/li&gt;
&lt;li&gt; We emphasize that our implementation is naïve. In practice, we recommend a more sophisticated algorithm that 1) accounts for the use of different types of instances and jobs that require more than one instance, 2) takes all edge cases into consideration, and 3) is tailored towards the specific needs of your project.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Define the AWS Lambda Function
&lt;/h2&gt;

&lt;p&gt;The next component of the solution is the Lambda function. The following code block includes the  &lt;a href="https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-resource-function.html"&gt;SAM&lt;/a&gt;  definition of our serverless function. We program the function to run on two different types of events: any call to  &lt;a href="https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-property-function-api.html"&gt;&lt;em&gt;add-job&lt;/em&gt;&lt;/a&gt;  on our private API gateway and a  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/automating-sagemaker-with-eventbridge.html"&gt;change to the state of a SageMaker training job&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ManagedTrainingJobQueue:  
    Type: AWS::Serverless::Function  
    Properties:  
      CodeUri: job-queue/ # the directory containing our index.py file  
      Handler: index.lambda_handler  
      Runtime: python3.12  
      Architectures:  
        - arm64 # use graviton  
      Policies: # allow access to SageMaker and DynamoDB  
        - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonSageMakerFullAccess"  
        - DynamoDBCrudPolicy:  
            TableName: !Ref DynamoSMQueue  
      Events:  
        CreateTraining:  
          Type: Api  
          Properties:  
            Path: /add-job  
            Method: post  
            RestApiId: !Ref InternalAPI  
        SageMakerEvent:  
          Type: EventBridgeRule  
          Properties:  
            Pattern:  
              source:  
                - aws.sagemaker  
              detail-type:  
                - SageMaker Training Job State Change  
              detail:  
                TrainingJobStatus:  
                  - "Completed"  
                  - "Failed"  
                  - "Stopped"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The  &lt;em&gt;lambda_handler&lt;/em&gt; function is implemented as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def lambda_handler(event, context):  
    # identify source of event and take appropriate action  
    if 'requestContext' in event and 'apiId' in event['requestContext']:  
        print('Lambda triggerred by API Gateway')  
        job_details = json.loads(event.get('body'))  
        add_job_entry(job_details)  
    elif 'source' in event and event['source'] == 'aws.sagemaker':  
        print('Lambda triggerred by SageMaker job state change')  
        job_name = event['detail']['TrainingJobName']  
        job_status = event['detail']['TrainingJobStatus']  
        print(f'{job_name} status changed to {job_status}')  

        # strip qstamp from job_name  
        job_name = strip_qstamp(job_name)  

        if job_status in ['Completed' , 'Failed']:  
            remove_job(job_name)  
        elif job_status == 'Stopped':  
            # check if it was manually stopped or preempted by queue manager  
            if get_job_state(job_name) == 'preempted':  
                print(f'job {job_name} preemption completed')  
            else:  
                print(f'job {job_name} {job_status}, remove from queue')  
                remove_job(job_name)  

    # in all cases invoke queue manager  
    manage_queue()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Intercept the Create Training Job Request
&lt;/h2&gt;

&lt;p&gt;The final modification required to make our solution complete is to intercept the call to the SageMaker  &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_training_job.html"&gt;create_training_job&lt;/a&gt;  API and reroute it to our  &lt;em&gt;add-job&lt;/em&gt; method. We do this by overriding the  &lt;a href="https://github.com/aws/sagemaker-python-sdk/blob/v2.208.0/src/sagemaker/session.py#L6212"&gt;_intercept_create_request&lt;/a&gt;  function of the  &lt;a href="https://sagemaker.readthedocs.io/en/stable/api/utility/session.html"&gt;SageMaker Session class&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sagemaker.pytorch import PyTorch  
from sagemaker.session import Session  
import requests, logging  
logger = logging.getLogger('sagemaker')  

def submit_to_training_queue(job):  
    logger.info(f'Adding training-job {job['TrainingJobName']} to queue')  
    logger.debug('train request: {json.dumps(job, indent=4)}')  

    vpce='&amp;lt;vpc endpoint&amp;gt;' # insert id of vpc endpoint  
    region='us-east-1' # specify region  
    url=f'https://{vpce}.execute-api.{region}.vpce.amazonaws.com/prod/add-job'  
    headers = {'x-apigw-api-id': '&amp;lt;api-id&amp;gt;'} # insert api gateway id  

    # submit job  
    response = requests.post(url, headers=headers, json=job)  

class QueueTrainingJobSession(Session):  
    def _intercept_create_request(self, request, create, func_name = None):  
        """This function intercepts the create job request  

        Args:  
          request (dict): the create job request  
          create (functor): a functor calls the sagemaker client create method  
          func_name (str): the name of the function needed intercepting  
        """  
        if func_name == 'train':  
            submit_to_training_queue(request)  
        else:  
            super()._intercept_create_request(request,create,func_name)  

# define job  
estimator = PyTorch(  
    role='&amp;lt;sagemaker role&amp;gt;',  
    entry_point='train.py',  
    instance_type='ml.p5.48xlarge',  
    instance_count=1,  
    framework_version='2.0.1',  
    py_version='py310',  
    tags=[{'Key': 'priority', 'Value': '100'},  
    keep_alive_period_in_seconds=60, # keep warm for 1 minute  
    # use our custom Session class  
    sagemaker_session=QueueTrainingJobSession()  
)  

estimator.fit(wait=False)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Use Case Example
&lt;/h1&gt;

&lt;p&gt;To test our solution we submit the following sequence of jobs. After each call we print the status of the queue (using the  &lt;em&gt;print_queue_state&lt;/em&gt;  function) and sleep for twenty seconds.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Start job1 with priority 1.&lt;/li&gt;
&lt;li&gt; Start job2 with priority 2.&lt;/li&gt;
&lt;li&gt; Start job3 with priority 1.&lt;/li&gt;
&lt;li&gt; Start job4 with priority 3.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first two jobs are immediately submitted to SageMaker and updated to the  &lt;em&gt;running&lt;/em&gt;  state. Since the third job has low priority and we have precisely two training instances, it remains in the  &lt;em&gt;pending&lt;/em&gt;  state and waits its turn. After submitting the first three jobs, the queue state appears as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Job Name    State      Priority  
----------  -------  ----------  
job2        running           2  
job1        running           1  
job3        pending           1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fourth job we submit has a higher priority than all of the jobs in the queue. Consequently, the running job with the lowest priority,  &lt;em&gt;job1&lt;/em&gt;, is preempted. The corresponding SageMaker job is stopped and once the instance is released, the queue state becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Job Name    State        Priority  
----------  ---------  ----------  
job4        running             3  
job2        running             2  
job1        preempted           1  
job3        pending             1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SageMaker job running  &lt;em&gt;job2&lt;/em&gt;  is the first to finish,  &lt;em&gt;job2&lt;/em&gt; is removed from the queue, and our preempted job is resumed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Job Name    State      Priority  
----------  -------  ----------  
job4        running           3  
job1        running           1  
job3        pending           1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once  &lt;em&gt;job4&lt;/em&gt;  is completed, it too is removed from the queue, making room for  &lt;em&gt;job3&lt;/em&gt;. The remaining jobs are also run to completion, ultimately leaving our queue empty.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;The increasing difficulty of acquiring AI compute capacity has forced AI development teams to reevaluate the processes they use for training AI models. The approach we have demonstrated in this post is to augment the traditional APIs for training models with a custom-made priority queue and an associated job scheduler. Importantly, the proposal we have put forth should be viewed as a general scheme, not as a production-worthy solution. Appropriate modifications and enhancements would be required to address the specifics needs of your project.&lt;/p&gt;

</description>
      <category>sagemaker</category>
      <category>aws</category>
      <category>deeplearning</category>
      <category>mlop</category>
    </item>
    <item>
      <title>Retaining Amazon SageMaker Instance Capacity with SageMaker Managed Warm Pools</title>
      <dc:creator>Chaim Rand</dc:creator>
      <pubDate>Mon, 04 Mar 2024 09:39:21 +0000</pubDate>
      <link>https://dev.to/aws-builders/retaining-amazon-sagemaker-instance-capacity-with-sagemaker-managed-warm-pools-4na4</link>
      <guid>https://dev.to/aws-builders/retaining-amazon-sagemaker-instance-capacity-with-sagemaker-managed-warm-pools-4na4</guid>
      <description>&lt;h2&gt;
  
  
  An Alternative Solution to Cloud Instance Reservation
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yfj-uVup--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:700/0%2A9Pk4gA4I9wvgkCRq" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yfj-uVup--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:700/0%2A9Pk4gA4I9wvgkCRq" alt="" width="700" height="467"&gt;&lt;/a&gt;&lt;br&gt;
Photo by  &lt;a href="https://unsplash.com/@flowinteractive?utm_source=medium&amp;amp;utm_medium=referral"&gt;Ivan Slade&lt;/a&gt;  on  &lt;a href="https://unsplash.com/?utm_source=medium&amp;amp;utm_medium=referral"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In previous posts (e.g.,  &lt;a href="https://towardsdatascience.com/6-steps-to-migrating-your-machine-learning-project-to-the-cloud-6d9b6e4f18e0"&gt;here&lt;/a&gt;  and  &lt;a href="https://towardsdatascience.com/a-simple-solution-for-managing-cloud-based-ml-training-c80a69c6939a"&gt;here&lt;/a&gt;), we covered some of the pros and cons of training ML workloads using  &lt;a href="https://aws.amazon.com/sagemaker/"&gt;Amazon SageMaker&lt;/a&gt;. In this post we address one of its more inconvenient limitations — its lack of support (as of the time of this writing) for training on  &lt;a href="https://aws.amazon.com/ec2/pricing/reserved-instances/"&gt;reserved Amazon EC2 instances&lt;/a&gt;. This limitation has become more and more restrictive of late due to the increasing difficulty to acquire the instance types required in a reliable and timely fashion. Recent advances in the field of generative AI have led to  &lt;a href="https://www.google.com/search?q=nvidia+cant+keep+up+with+the+demand&amp;amp;rlz=1C1GCEA_enIL1069IL1074&amp;amp;oq=nvidia+cant+keep+up+with+the+demand&amp;amp;gs_lcrp=EgZjaHJvbWUyCwgAEEUYChg5GKABMgkIARAhGAoYoAEyCQgCECEYChigATIJCAMQIRgKGKAB0gEINzI4OGowajeoAgCwAgA&amp;amp;sourceid=chrome&amp;amp;ie=UTF-8"&gt;unprecedented demand&lt;/a&gt;  for AI compute while challenges in the  &lt;a href="https://en.wikipedia.org/wiki/2021%E2%80%932023_global_supply_chain_crisis"&gt;global supply chain&lt;/a&gt;  continue to linger. In this post we propose a partial mitigation to this limitation using  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html"&gt;SageMaker managed warm pools&lt;/a&gt;. By using  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html"&gt;SageMaker managed warm pools&lt;/a&gt;  you can, under certain circumstances, retain access to provisioned instance capacity for successive training workloads. Not only will this hold on to acquired capacity for as long as you need (up to four weeks), but it can also reduce the latency between experiments.&lt;/p&gt;
&lt;h1&gt;
  
  
  Training With Managed Warm Pools - Example
&lt;/h1&gt;

&lt;p&gt;In the example below, we start up a PyTorch training job on a  &lt;a href="https://aws.amazon.com/ec2/instance-types/p5/"&gt;p5.48xlarge&lt;/a&gt;  instance of type using the  &lt;a href="https://sagemaker.readthedocs.io/en/stable/"&gt;Amazon SageMaker Python SDK&lt;/a&gt;  (version 2.208). We use the  &lt;em&gt;keep_alive_period_in_seconds&lt;/em&gt;  control to configure the instance to remain  &lt;em&gt;warm&lt;/em&gt;  for ten minutes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sagemaker.pytorch import PyTorch  

# define job  
estimator = PyTorch(  
    role='&amp;lt;sagemaker role&amp;gt;',  
    entry_point='train.py',  
    instance_type='ml.p5.48xlarge',  
    instance_count=1,  
    framework_version='2.0.1',  
    py_version='py310',  
    keep_alive_period_in_seconds=60 # keep warm for 1 minute  
)  

# start job  
estimator.fit()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As long as we start up another job with  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html#train-warm-pools-matching-criteria"&gt;matching settings&lt;/a&gt;  within the sixty seconds allotted, the same instance will be retained for the next job. Thus, by configuring the use of SageMaker warm pools we have guaranteed instance capacity for our next workload. As an added bonus, the  &lt;em&gt;start-up time&lt;/em&gt; of the second workload will be noticeably reduced since the instance has already been provisioned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations of Managed Warm Pools
&lt;/h2&gt;

&lt;p&gt;Although this technique offers an instance capacity guarantee similar to the one provided by  &lt;a href="https://aws.amazon.com/ec2/pricing/reserved-instances/"&gt;Amazon EC2 reservations&lt;/a&gt;  (and without the long-term commitment!!), it is important to note its significant limitations.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; The method relies on our ability to secure instance capacity for the first training job. Generally speaking, this is a safe assumption — sooner or later, we will succeed in securing an instance, but it is hard to know how much time and patience will be required.&lt;/li&gt;
&lt;li&gt; The method assumes that our workloads have matching settings, particularly with regards to the number and types of instances. Although AI development teams will frequently have multiple workloads with similar instance requirements, the inability to share resources between jobs with different settings (e.g., with different  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html"&gt;SageMaker Roles&lt;/a&gt;) is limiting.&lt;/li&gt;
&lt;li&gt; The method works only if the subsequent workload is started within the specified warm pool duration, with the maximum duration being one hour. Unless we want to constantly monitor our training jobs to detect when they stop, we will need to implement an automated system for submitting new jobs when our provisioned instances become available.&lt;/li&gt;
&lt;li&gt; In cases where a matching training job is not found during the warm pool durations, we still need to pay for the provisioned instance. Thus, there is a certain risk of waste associated with this method and the way it is used (e.g., the most appropriate warm pool duration setting) should be planned accordingly.&lt;/li&gt;
&lt;li&gt; The maximum period of time during which an instance can be retained in this manner is  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html#train-warm-pools-considerations"&gt;twenty-eight days&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Please see the  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html#train-warm-pools-considerations"&gt;official documentation&lt;/a&gt;  for more details on how warm pooling works as well as additional  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html#train-warm-pools-considerations"&gt;considerations&lt;/a&gt;  associated with its use.&lt;/p&gt;

&lt;h1&gt;
  
  
  Reducing Cost with SageMaker Savings Plans
&lt;/h1&gt;

&lt;p&gt;The method we have described for retaining control of instances is relevant for AI teams that have a consistent requirement for AI compute. This is manifested as a continuous backlog of training experiments waiting to be processed. In situations where this requirement is expected to last for an extended period of time,  &lt;a href="https://aws.amazon.com/savingsplans/ml-pricing/"&gt;Amazon SageMaker Savings Plans&lt;/a&gt;  may provide a great opportunity for training-cost savings. SageMaker Savings Plans offer significant discounts in exchange for a commitment to pay for consistent usage. The instance types offered under this plan can vary — please refer to the documentation for the most up-do-date details. Importantly, despite some similarities to  &lt;a href="https://aws.amazon.com/ec2/pricing/reserved-instances/"&gt;Amazon EC2 reservations&lt;/a&gt;, SageMaker Savings Plans does  &lt;em&gt;not&lt;/em&gt;  guarantee instance capacity. However, the method described in this post for retaining control of provisioned instance capacity can help you take the most advantage of the instances you have committed to using.&lt;/p&gt;

&lt;p&gt;SageMaker Savings Plans is not for everyone. Make sure to fully understand all the terms of the offering before deciding whether it is the right solution for your team.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;A common approach for dealing with the difficulty of acquiring AI/ML compute resources in the cloud is to guarantee capacity by purchasing instance reservations. Unfortunately, as of the time of this writing, Amazon SageMaker does not support instance reservation. In this post, we have demonstrated how SageMaker warm pools can be used to maintain control over instance capacity for successive training workloads. We noted that for this type of solution to be effective, we require some form of mechanism for automating the detection of available warm pool instances and triggering a job with matching settings. In a future post we will propose a solution that addresses this challenge.&lt;/p&gt;

</description>
      <category>sagemaker</category>
      <category>cloudml</category>
      <category>aws</category>
    </item>
    <item>
      <title>Optimizing Instance Type Selection for AI Development in Cloud Spot Markets</title>
      <dc:creator>Chaim Rand</dc:creator>
      <pubDate>Wed, 24 Jan 2024 10:13:37 +0000</pubDate>
      <link>https://dev.to/aws-builders/optimizing-instance-type-selection-for-ai-development-in-cloud-spot-markets-29i1</link>
      <guid>https://dev.to/aws-builders/optimizing-instance-type-selection-for-ai-development-in-cloud-spot-markets-29i1</guid>
      <description>&lt;h2&gt;
  
  
  Instance Selection for Deep Learning — Part 2
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ogV2x4v4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:700/0%2AVlgMC_E3ruOci979" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ogV2x4v4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:700/0%2AVlgMC_E3ruOci979" alt="" width="700" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photo by  &lt;a href="https://unsplash.com/@mikeenerio?utm_source=medium&amp;amp;utm_medium=referral"&gt;Mike Enerio&lt;/a&gt;  on  &lt;a href="https://unsplash.com/?utm_source=medium&amp;amp;utm_medium=referral"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This post was written in collaboration with  &lt;a href="https://www.linkedin.com/in/tomerberkovich/"&gt;Tomer Berkovich&lt;/a&gt;,  &lt;a href="https://www.linkedin.com/in/yitzhak-levi-49a217201/"&gt;Yitzhak Levi&lt;/a&gt;, and  &lt;a href="https://www.linkedin.com/in/maxrabin/"&gt;Max Rabin&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Appropriate instance selection for machine learning (ML) workloads is an important decision with potentially significant implications on the speed and cost of development. In a  &lt;a href="https://towardsdatascience.com/instance-selection-for-deep-learning-7463d774cff0"&gt;previous post&lt;/a&gt;  we expanded on this process, proposed a metric for making this important decision, and highlighted some of the many factors you should take into consideration. In this post we will demonstrate the opportunity for reducing AI model training costs by taking  &lt;a href="https://aws.amazon.com/ec2/spot/?cards.sort-by=item.additionalFields.startDateTime&amp;amp;cards.sort-order=asc"&gt;Spot Instance&lt;/a&gt;  availability into account when making your cloud-based instance selection decision.&lt;/p&gt;

&lt;h1&gt;
  
  
  Reducing Costs Using Spot Instances
&lt;/h1&gt;

&lt;p&gt;One of the most significant opportunities for cost savings in the cloud is to take advantage of low cost  &lt;a href="https://aws.amazon.com/ec2/spot/?cards.sort-by=item.additionalFields.startDateTime&amp;amp;cards.sort-order=asc"&gt;Amazon EC2 Spot Instances&lt;/a&gt;. Spot instances are discounted compute engines from surplus cloud service capacity. In exchange for the discounted price, AWS maintains the right to preempt the instance with little to no warning. Consequently, the relevance of Spot instance utilization is limited to workloads that are fault tolerant. Fortunately, through effective use of  &lt;a href="https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html"&gt;model checkpointing&lt;/a&gt;  ML training workloads can be designed to be fault tolerant and to take advantage of the Spot instance offering. In fact, Amazon SageMaker, AWS’s managed service for developing ML, makes it easy to train on Spot instances by  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html"&gt;managing the end-to-end Spot life-cycle&lt;/a&gt;  for you.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Challenge of Anticipating Spot Instance Capacity
&lt;/h1&gt;

&lt;p&gt;Unfortunately,  &lt;em&gt;Spot instance capacity&lt;/em&gt;, which measures the availability of Spot instances for use, is subject to constant fluctuations and can be very difficult to predict. Amazon offers partial assistance in assessing the  &lt;em&gt;Spot instance capacity&lt;/em&gt; of an instance type of choice via its  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-placement-score.html"&gt;Spot placement score&lt;/a&gt;  (SPS) feature which indicates the likelihood that a Spot request will succeed in a given  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html"&gt;region or availability zone (AZ)&lt;/a&gt;. This is especially helpful when you have the freedom to choose to train your model in one of several different locations. However, the SPS feature offers no guarantees.&lt;/p&gt;

&lt;p&gt;When you choose to train a model on one or more Spot instances, you are taking the risk that your instance type of choice does not have any Spot capacity (i.e., your training job will not start), or worse, that you will enter an iterative cycle in which your training repeatedly runs for just a small number of training steps and is stopped before you have made any meaningful progress — which can tally up your training costs without any return.&lt;/p&gt;

&lt;p&gt;Over the past couple of years, the challenges of spot instance utilization have been particularly acute when it comes to multi-GPU  &lt;a href="https://aws.amazon.com/ec2/"&gt;EC2&lt;/a&gt;  instance types such as  &lt;a href="https://aws.amazon.com/ec2/instance-types/g5/"&gt;g5.12xlarge&lt;/a&gt;  and  &lt;a href="https://aws.amazon.com/ec2/instance-types/p4/"&gt;p4d.24xlarge&lt;/a&gt;. A huge increase in demand for powerful training accelerators (driven in part by advances in the field of Generative AI) combined with disruptions in the global supply chain, have made it virtually impossible to reliably depend on multi-GPU Spot instances for ML training. The natural fallback is to use the more costly  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-on-demand-instances.html"&gt;On-Demand&lt;/a&gt;  (OD) or  &lt;a href="https://aws.amazon.com/ec2/pricing/reserved-instances/"&gt;reserved instances&lt;/a&gt;. However, in our  &lt;a href="https://towardsdatascience.com/instance-selection-for-deep-learning-7463d774cff0"&gt;previous post&lt;/a&gt;  we emphasized the value of considering many different alternatives for your choice of instance type. In this post we will demonstrate the potential gains of replacing multi-GPU On Demand instances with multiple single-GPU Spot instances.&lt;/p&gt;

&lt;p&gt;Although our demonstration will use Amazon Web Services, similar conclusions can be reached on alternative cloud service platforms (CSPs). Please do not interpret our choice of CSP or services as an endorsement. The best option for you will depend on the unique details of your project. Furthermore, please take into consideration the possibility that the type of cost savings we will demonstrate will not reproduce in the case of your project and/or that the solution we propose will not be applicable (e.g., for some reason beyond the scope of this post). Be sure to conduct a detailed evaluation of the relevance and efficacy of the proposal before adapting it to your use case.&lt;/p&gt;

&lt;h1&gt;
  
  
  When Multiple Single-GPU Instances are Better than a Single Multi-GPU Instance
&lt;/h1&gt;

&lt;p&gt;Nowadays, training AI models on multiple GPU devices in parallel — a process called  &lt;em&gt;distributed training&lt;/em&gt; — is commonplace. Setting aside instance pricing, when you have the choice between an instance type with multiple GPUs and multiple instance types with the same type of single GPUs, you would typically choose the multi-GPU instance. Distributed training typically requires a considerable amount of data communication (e.g., gradient sharing) between the GPUs. The proximity of the GPUs on a single instance is bound to facilitate higher network bandwidth and lower latency. Moreover, some multi-GPU instances include dedicated GPU-to-GPU inter-connections that can further accelerate the communication (e.g.,  &lt;a href="https://www.nvidia.com/en-eu/data-center/nvlink/"&gt;NVLink&lt;/a&gt;  on  &lt;a href="https://aws.amazon.com/ec2/instance-types/p4/"&gt;p4d.24xlarge&lt;/a&gt;). However, when Spot capacity is limited to single GPU instances, the option of training on multiple single GPU instances at a much lower cost becomes more compelling. At the very least, it warrants evaluation of its opportunity for cost-savings.&lt;/p&gt;

&lt;h1&gt;
  
  
  Optimizing Data Communication Between Multiple EC2 Instances
&lt;/h1&gt;

&lt;p&gt;When distributed training runs on multiple instances, the GPUs communicate with one another via the network between the host machines. To optimize the speed of training and reduce the likelihood and/or impact of a network bottleneck, we need to ensure minimal network latency and maximal data throughput. These can be affected by a number of factors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Instance Collocation
&lt;/h2&gt;

&lt;p&gt;Network latency can be greatly impacted by the relative locations of the EC2 instances. Ideally, when we request multiple cloud-based instances we would like them to all be collocated on the same physical rack. In practice, without appropriate configuration, they may not even be in the same city. In our demonstration below we will use a  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html"&gt;VPC Config&lt;/a&gt;  object to program an Amazon SageMaker training job to use a single subnet of an  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html"&gt;Amazon Virtual Private Cloud (VPC)&lt;/a&gt;. This technique will ensure that all the requested training instances will be in the same  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html"&gt;availability zone (AZ)&lt;/a&gt;. However, collocation in the same AZ, may not suffice. Furthermore, the method we described involves choosing a subnet associated with one specific AZ (e.g., the one with the highest  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-placement-score.html"&gt;Spot placement score&lt;/a&gt;). A preferred API would fulfill the request in any AZ that has sufficient capacity.&lt;/p&gt;

&lt;p&gt;A better way to control the placement of our instances is to launch them inside a  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html"&gt;placement group&lt;/a&gt;, specifically a  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html#placement-groups-cluster"&gt;cluster placement group&lt;/a&gt;. Not only will this guarantee that all of the instances will be in the same  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html"&gt;AZ&lt;/a&gt;, but it will also place them on “the same high-bisection bandwidth segment of the network” so as to maximize the performance of the network traffic between them. However, as of the time of this writing SageMaker does  &lt;em&gt;not&lt;/em&gt;  provide the option to specify a  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html"&gt;placement group&lt;/a&gt;. To take advantage of placement groups we would need to use an alternative training service solution (as we will demonstrate below).&lt;/p&gt;

&lt;h2&gt;
  
  
  EC2 Network  &lt;strong&gt;B&lt;/strong&gt;andwidth Constraints
&lt;/h2&gt;

&lt;p&gt;Be sure to take into account the maximal  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html"&gt;network bandwidth&lt;/a&gt;  supported by the EC2 instances that you choose. Note, in particular, that the network bandwidths associated with single-GPU machines are often documented as being “up to” a certain number of Gbps. Make sure to understand what that means and how it can impact the speed of training over time.&lt;/p&gt;

&lt;p&gt;Keep in mind that the GPU-to-GPU data communication (e.g., gradient sharing) might need to share the limited network bandwidth with other data flowing through the network such as training samples being streamed into the training instances or training artifacts being uploaded to persistent storage. Consider ways of reducing the payload of each of the categories of data to minimize the likelihood of a network bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  Elastic Fabric Adapter (EFA)
&lt;/h2&gt;

&lt;p&gt;A growing number of EC2 instance types support  &lt;a href="https://aws.amazon.com/hpc/efa/"&gt;Elastic Fabric Adapter (EFA)&lt;/a&gt;, a dedicated network interface for optimizing inter-node communication. Using EFA can have a decisive impact on the runtime performance of your training workload. Note that the bandwidth on the EFA network channel is different than the documented bandwidth of the standard network. As of the time of this writing, detailed documentation of the EFA capabilities is hard to come by and it is usually best to evaluate its impact through trial and error. Consider using an  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types"&gt;EC2 instance that supports EFA type&lt;/a&gt;  when relevant.&lt;/p&gt;

&lt;h1&gt;
  
  
  Toy Example
&lt;/h1&gt;

&lt;p&gt;We will now demonstrate the comparative price performance of training on four single-GPU  &lt;a href="https://aws.amazon.com/ec2/instance-types/g5/"&gt;EC2 g5&lt;/a&gt;  Spot instances (ml.g5.2xlarge and ml.g5.4xlarge) vs. a single four-GPU On-Demand instance (ml.g5.12xlarge). We will use the training script below containing a Vision Transformer (ViT) backed classification model (trained on synthetic data).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os, torch, time  
import torch.distributed as dist  
from torch.utils.data import Dataset, DataLoader  
from torch.cuda.amp import autocast  
from torch.nn.parallel import DistributedDataParallel as DDP  
from timm.models.vision_transformer import VisionTransformer  

batch_size = 128  
log_interval = 10  

# use random data  
class FakeDataset(Dataset):  
    def __len__(self):  
        return 1000000  

    def __getitem__(self, index):  
        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)  
        label = torch.tensor(data=[index % 1000], dtype=torch.int64)  
        return rand_image, label  

def mp_fn():  
    local_rank = int(os.environ['LOCAL_RANK'])  
    dist.init_process_group("nccl")  
    torch.cuda.set_device(local_rank)  

    # model definition  
    model = VisionTransformer()  
    loss_fn = torch.nn.CrossEntropyLoss()  
    model.to(torch.cuda.current_device())  
    model = DDP(model)  
    optimizer = torch.optim.Adam(params=model.parameters())  

    # dataset definition  
    num_workers = os.cpu_count()//int(os.environ['LOCAL_WORLD_SIZE'])  
    dl = DataLoader(FakeDataset(), batch_size=batch_size, num_workers=num_workers)  

    model.train()  
    t0 = time.perf_counter()  
    for batch_idx, (x, y) in enumerate(dl, start=1):  
        optimizer.zero_grad(set_to_none=True)  
        x = x.to(torch.cuda.current_device())  
        y = torch.squeeze(y.to(torch.cuda.current_device()), -1)  
        with autocast(enabled=True, dtype=torch.bfloat16):  
            outputs = model(x)  
            loss = loss_fn(outputs, y)  
        loss.backward()  
        optimizer.step()  
        if batch_idx % log_interval == 0 and local_rank == 0:  
            time_passed = time.perf_counter() - t0  
            samples_processed = dist.get_world_size() * batch_size * log_interval  
            print(f'{samples_processed / time_passed} samples/second')  
            t0 = time.perf_counter()  

if __name__ == '__main__':  
    mp_fn()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code block below demonstrates how we used the  &lt;a href="https://sagemaker.readthedocs.io/en/stable/"&gt;SageMaker Python package&lt;/a&gt;  (version 2.203.1) to run our experiments. Note that for the four-instance experiments, we configure the use of a VPC with a single subnet, as explained above.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sagemaker.pytorch import PyTorch  
from sagemaker.vpc_utils import VPC_CONFIG_DEFAULT  


# Toggle flag to switch between multiple single-GPU nodes and  
# single multi-GPU node  
multi_inst = False  

inst_count=1  
inst_type='ml.g5.12xlarge'  
use_spot_instances=False  
max_wait=None #max seconds to wait for Spot job to complete  
subnets=None  
security_group_ids=None  

if multi_inst:  
    inst_count=4  
    inst_type='ml.g5.4xlarge' #  optinally change to ml.g5.2xlarge  
    use_spot_instances=True  
    max_wait=24*60*60 #24 hours  
    # configure vpc settings  
    subnets=['&amp;lt;VPC subnet&amp;gt;']  
    security_group_ids=['&amp;lt;Security Group&amp;gt;']  


estimator = PyTorch(  
    role='&amp;lt;sagemaker role&amp;gt;',  
    entry_point='train.py',  
    source_dir='&amp;lt;path to source dir&amp;gt;',  
    instance_type=inst_type,  
    instance_count=inst_count,  
    framework_version='2.1.0',  
    py_version='py310',  
    distribution={'torch_distributed': {'enabled': True}},  
    subnets=subnets,  
    security_group_ids=security_group_ids,  
    use_spot_instances=use_spot_instances,  
    max_wait=max_wait  
)  

# start job  
estimator.fit()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that our code depends on the third-party  &lt;a href="https://pypi.org/project/timm/"&gt;&lt;em&gt;timm&lt;/em&gt;&lt;/a&gt;  Python package that we point to in a requirements.txt file in the root of the source directory. This assumes that the VPC has been configured to  &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Internet_Gateway.html"&gt;enable internet access&lt;/a&gt;. Alternatively, you could define a private PyPI server (as described  &lt;a href="https://aws.amazon.com/blogs/machine-learning/hosting-a-private-pypi-server-for-amazon-sagemaker-studio-notebooks-in-a-vpc/"&gt;here&lt;/a&gt;), or create a custom image with your third party dependencies preinstalled (as described  &lt;a href="https://towardsdatascience.com/customizing-your-cloud-based-machine-learning-training-environment-part-2-b65a6cf91812"&gt;here&lt;/a&gt;).&lt;/p&gt;

&lt;h1&gt;
  
  
  Results
&lt;/h1&gt;

&lt;p&gt;We summarize the results of our experiment in the table below. The On-Demand prices were taken from the  &lt;a href="https://aws.amazon.com/sagemaker/pricing/"&gt;SageMaker pricing page&lt;/a&gt;  (as of the time of this writing, January 2024). The Spot saving values were collected from the reported  &lt;em&gt;managed spot training savings&lt;/em&gt; of the completed job. Please see the  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances-history.html"&gt;EC2 Spot pricing documentation&lt;/a&gt;  to get a sense for how the reported Spot savings are calculated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jNb7lHw4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:700/1%2A8GTS6c7JylvxiHNjcZxFiQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jNb7lHw4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:700/1%2A8GTS6c7JylvxiHNjcZxFiQ.png" alt="" width="700" height="119"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Experiment Results (by Author)&lt;/p&gt;

&lt;p&gt;Our results clearly demonstrate the potential for considerable savings when using four single-GPU Spot instances rather than a single four-GPU On Demand instance. They further demonstrate that although the cost of an On Demand g5.4xlarge instance type is higher, the increased CPU power and/or network bandwidth combined with higher Spot savings, resulted in much greater savings.&lt;/p&gt;

&lt;p&gt;Importantly, keep in mind that the relative performance results can vary considerably based on the details of your job as well the Spot prices at the time that you run your experiments.&lt;/p&gt;

&lt;h1&gt;
  
  
  Enforcing EC2 Instance Co-location Using a Cluster Placement Group
&lt;/h1&gt;

&lt;p&gt;In a  &lt;a href="https://towardsdatascience.com/a-simple-solution-for-managing-cloud-based-ml-training-c80a69c6939a"&gt;previous post&lt;/a&gt;  we described how to create a customized managed environment on top of an unmanaged service, such as  &lt;a href="https://aws.amazon.com/ec2/"&gt;Amazon EC2&lt;/a&gt;. One of the motivating factors listed there was the desire to have greater control over device placement in a multi-instance setup, e.g., by using a  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html#placement-groups-cluster"&gt;cluster placement group&lt;/a&gt;, as discussed above. In this section, we demonstrate the creation of a multi-node setup using a cluster placement group.&lt;/p&gt;

&lt;p&gt;Our code assumes the presence of a  &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html"&gt;default VPC&lt;/a&gt;  as well as the (one-time) creation of a  &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html#placement-groups-cluster"&gt;cluster placement group&lt;/a&gt;, demonstrated here using the  &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/index.html"&gt;AWS Python SDK&lt;/a&gt;  (version 1.34.23):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3  

ec2 = boto3.client('ec2')  
ec2.create_placement_group(  
    GroupName='cluster-placement-group',  
    Strategy='cluster'  
) 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the code block below we use the  &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/index.html"&gt;AWS Python SDK&lt;/a&gt;  to launch our Spot instances:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3  

ec2 = boto3.resource('ec2')  
instances = ec2.create_instances(  
    MaxCount=4,  
    MinCount=4,  
    ImageId='ami-0240b7264c1c9e6a9', # replace with image of choice  
    InstanceType='g5.4xlarge',  
    Placement={'GroupName':'cluster-placement-group'},  
    InstanceMarketOptions={  
        'MarketType': 'spot',  
        'SpotOptions': {  
            "SpotInstanceType": "one-time",  
            "InstanceInterruptionBehavior": "terminate"  
        }  
    },  
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Please see our  &lt;a href="https://towardsdatascience.com/a-simple-solution-for-managing-cloud-based-ml-training-c80a69c6939a"&gt;previous post&lt;/a&gt;  for step-by-step tips on how to extend this to an automated training solution.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;In this post, we have illustrated how demonstrating flexibility in your choice of training instance type can increase your ability to leverage Spot instance capacity and reduce the overall cost of training.&lt;/p&gt;

&lt;p&gt;As the sizes of AI models continue to grow and the costs of AI training accelerators continue to rise, it becomes increasingly important that we explore ways to mitigate training expenses. The technique outlined here is just one among several methods for optimizing cost performance. We encourage you to explore our  &lt;a href="https://chaimrand.medium.com/"&gt;previous posts&lt;/a&gt;  for insights into additional opportunities in this realm.&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>ec2</category>
      <category>mlops</category>
      <category>spot</category>
    </item>
    <item>
      <title>Debugging and Tuning Amazon SageMaker Training Jobs with SageMaker SSH Helper</title>
      <dc:creator>Chaim Rand</dc:creator>
      <pubDate>Fri, 29 Dec 2023 08:52:26 +0000</pubDate>
      <link>https://dev.to/aws-builders/debugging-and-tuning-amazon-sagemaker-training-jobs-with-sagemaker-ssh-helper-2ci</link>
      <guid>https://dev.to/aws-builders/debugging-and-tuning-amazon-sagemaker-training-jobs-with-sagemaker-ssh-helper-2ci</guid>
      <description>&lt;h2&gt;
  
  
  A new tool that increases the debuggability of managed training workloads
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7ijVErVT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:1050/0%2AOxo3P2aLT9HAmSFb" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7ijVErVT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:1050/0%2AOxo3P2aLT9HAmSFb" alt="" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photo by  &lt;a href="https://unsplash.com/@tumbao1949?utm_source=medium&amp;amp;utm_medium=referral"&gt;James Wainscoat&lt;/a&gt;  on  &lt;a href="https://unsplash.com/?utm_source=medium&amp;amp;utm_medium=referral"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Considering all the new Amazon SageMaker features announced over the past year (2023), including at the most recent  &lt;a href="https://press.aboutamazon.com/2023/11/aws-announces-five-new-amazon-sagemaker-capabilities-for-scaling-with-models"&gt;AWS re:invent&lt;/a&gt;, it would have been easy to have overlooked  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper"&gt;SageMaker SSH Helper&lt;/a&gt;  — a new utility for connecting to remote SageMaker training environments. But  &lt;strong&gt;sometimes it is the quiet enhancements that have the potential to make the greatest impact on your daily development.&lt;/strong&gt;  In this post we will review  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper"&gt;SageMaker SSH Helper&lt;/a&gt;  and demonstrate how it can increase your ability to 1) investigate and solve errors that arise in your training applications and 2) optimize their runtime performance.&lt;/p&gt;

&lt;p&gt;In  &lt;a href="https://towardsdatascience.com/6-steps-to-migrating-your-machine-learning-project-to-the-cloud-6d9b6e4f18e0"&gt;previous posts&lt;/a&gt;, we discussed at length the benefits of training in the cloud. Cloud-based managed training services, such as  &lt;a href="https://aws.amazon.com/sagemaker/"&gt;Amazon SageMaker&lt;/a&gt;, have simplified many of the complexities surrounding AI model development and greatly increased accessibility to both AI-specific machinery and  &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html"&gt;pretrained AI models&lt;/a&gt;. To train in Amazon SageMaker, all you need to do is define a training environment (including an instance type) and point to the code you wish to run, and the training service will 1) set up the requested environment, 2) deliver your code to the training machine, 3) run your training script, 4) copy the training output to persistent storage, and 5) tear everything down when the training completes (so that you pay only for what you need). Sounds easy… right? However, managed training is not without its flaws, one of which — the limited access it enables to the training environment — will be discussed in this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Disclaimers
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; Please do not interpret our use of Amazon SageMaker,  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper"&gt;SageMaker SSH Helper&lt;/a&gt;, or any other framework or utility we should mention as an endorsement for their use. There are many different methodologies for developing AI models. The best solution for you will depend on the details of your project.&lt;/li&gt;
&lt;li&gt; Please be sure to verify the contents of this post, particularly the code samples, against the most up to date SW and documentation available at the time that you read this. The landscape of AI development tools is in constant flux and it is likely that some of the APIs we refer to will change over time.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Disadvantage of Managed Training — Inaccessibility of the Training Environment
&lt;/h1&gt;

&lt;p&gt;As seasoned developers are well aware, a significant chunk of the application development-time is actually spent on debugging. Rarely do our programs work “out of the box”; More often than not, they require hours of laborious debugging to get them to run as desired. Of course, to be able to debug effectively, you need to have direct access to your application environment. Trying to debug an application without access to its environment is like trying to fix a faucet without a wrench.&lt;/p&gt;

&lt;p&gt;Another important step in AI model development is to tune the runtime performance of the training application. Training AI models can be expensive and our ability to maximize the utilization of our compute resources can have a decisive cost on training. In a  &lt;a href="https://towardsdatascience.com/cloud-ml-performance-checklist-caa51e798002"&gt;previous post&lt;/a&gt;  we described the iterative process of analyzing and optimizing training performance. Similar to debugging, direct access to the runtime environment will greatly increase and accelerate our ability to reach the best results.&lt;/p&gt;

&lt;p&gt;Unfortunately, one of the side-effects of the “fire and forget” nature of training in SageMaker, is the lack of ability to freely connect to the training environment. Of course, you could always debug and optimize performance using the training job output logs and debug prints (i.e., add prints, study the output logs, modify your code, and repeat until you’ve solved all your bugs and reached the desired performance) but this would be a very primitive and time-consuming solution.&lt;/p&gt;

&lt;p&gt;There are a number of best practices that address the problem of debugging managed training workloads, each with its own advantages and disadvantages. We will review three of these, discuss their limitations, and then demonstrate how the new  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper"&gt;SageMaker SSH Helper&lt;/a&gt;  completely alters the playing field.&lt;/p&gt;

&lt;h2&gt;
  
  
  Debug in Your Local Environment
&lt;/h2&gt;

&lt;p&gt;It is recommended that you run a few training steps in your local environment before launching your job to the cloud. Although this may require a few modifications to your code (e.g., to enable training on a CPU device), it is usually worth the effort as it enables you to identify and fix silly coding errors. It is certainly more cost effective than discovering them on an expensive GPU machine in the cloud. Ideally, your local environment would be as similar to the SageMaker training environment (e.g., using the same versions of Python and Python packages) but in most cases there is a limit to the extent that this is possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Debug Locally within the SageMaker Docker Container
&lt;/h2&gt;

&lt;p&gt;The second option is to pull the  &lt;a href="https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html"&gt;deep learning container (DLC) image&lt;/a&gt;  that SageMaker uses and run your training script within the container on your local PC. This method allows you to get a good understanding of the SageMaker training environment including the packages (and package versions) that are installed. It is extremely useful in identifying missing dependencies and addressing dependency conflicts. Please see  &lt;a href="https://github.com/aws/deep-learning-containers/blob/master/available_images.md"&gt;the documentation&lt;/a&gt;  for details on how to login and pull the appropriate image. Note that the SageMaker APIs support pulling and training within a DLC via its  &lt;a href="https://sagemaker.readthedocs.io/en/stable/overview.html#local-mode"&gt;local mode&lt;/a&gt;  feature. However, running the image on your own will enable you to explore and study the image more freely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Debug in the Cloud on an Unmanaged Instance
&lt;/h2&gt;

&lt;p&gt;Another option is to train on an unmanaged  &lt;a href="https://aws.amazon.com/ec2/"&gt;Amazon EC2&lt;/a&gt;  instance in the cloud. The advantage to this option is the ability to run on the same instance type that you use in SageMaker. This will enable you to reproduce issues that you may not be able to reproduce in your local environment, e.g., issues related to your use of the GPU resources. The easiest way to do this would be to run your instance with a  &lt;a href="https://aws.amazon.com/machine-learning/amis/"&gt;machine image&lt;/a&gt;  that is most similar to your SageMaker environment (e.g., the same OS, Python, and Python package versions). Alternatively, you could pull the SageMaker DLC and run it on the remote instance. However, keep in mind that although this also runs in the cloud, the runtime environment may still be significantly different than SageMaker’s environment. SageMaker configures a whole bunch of system settings during initialization. Trying to reproduce the same environment may require quite a bit of effort. Given that debugging in the cloud is more costly than the previous two methods, our goal should be to try to clean up our code as much as possible before resorting to this option.&lt;/p&gt;

&lt;h2&gt;
  
  
  Debugging Limitations
&lt;/h2&gt;

&lt;p&gt;Although each of the above options are useful for solving for certain types of bugs, none of them offer a way to perfectly replicate the SageMaker environment. Consequently, you may run into issues when running in SageMaker that you are not able to reproduce, and thus not able to correct, when using these methods. In particular, there are a number of features that are supported  &lt;em&gt;only&lt;/em&gt; when running in the SageMaker environment (e.g., SageMaker’s  &lt;a href="https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/"&gt;Pipe input&lt;/a&gt;  and  &lt;a href="https://aws.amazon.com/about-aws/whats-new/2021/10/amazon-sagemaker-fast-file-mode/"&gt;Fast File&lt;/a&gt;  modes for accessing data from Amazon S3). If your issue is related to one of those features, you will  &lt;em&gt;not&lt;/em&gt;  be able to reproduce it outside of SageMaker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tuning Limitations
&lt;/h2&gt;

&lt;p&gt;In addition, the options above do not provide an effective solution for performance tuning. Runtime performance can be extremely susceptible to even the slightest changes in the environment. While a simulated environment might provide some general optimization hints (e.g., the comparative performance overhead of different data augmentations), an accurate profiling analysis can be performed only in the SageMaker runtime environment.&lt;/p&gt;

&lt;h1&gt;
  
  
  SageMaker SSH Helper
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper"&gt;SageMaker SSH Helper&lt;/a&gt;  introduces that ability to connect to the remote SageMaker training environment. This is enabled via an SSH connection over  &lt;a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent.html"&gt;AWS SSM&lt;/a&gt;. As we will demonstrate, the steps required to set this up are quite simple and very well worth the effort. The  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper/tree/v2.1.0"&gt;official documentation&lt;/a&gt;  includes comprehensive details on the value of this utility and how it can be used.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;In the code block below we demonstrate how to enable remote connection to a SageMaker training job using  &lt;a href="https://pypi.org/project/sagemaker-ssh-helper/"&gt;sagemaker-ssh-helper&lt;/a&gt;  (version 2.1.0). We pass in our full code source directory but replace our usual  &lt;em&gt;entry_point&lt;/em&gt; (&lt;em&gt;train.py&lt;/em&gt;)  with a new  &lt;em&gt;run_ssh.py&lt;/em&gt;  script that we place in the root of the  &lt;em&gt;source_dir&lt;/em&gt;. Note that we add the  &lt;em&gt;SSHEstimatorWrapper&lt;/em&gt;  to the list of project dependencies since our  &lt;em&gt;start_ssh.py&lt;/em&gt;  script will require it. Alternatively, we could have added  &lt;a href="https://pypi.org/project/sagemaker-ssh-helper/"&gt;sagemaker-ssh-helper&lt;/a&gt;  to our  &lt;em&gt;requirements.txt&lt;/em&gt;  file. Here we have set the  &lt;em&gt;connection_wait_time_seconds&lt;/em&gt; setting to two minutes. As we will see, this will impact the behavior of our training script.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sagemaker.pytorch import PyTorch  
from sagemaker_ssh_helper.wrapper import SSHEstimatorWrapper  
MINUTE = 60  

estimator = PyTorch(  
    role='&amp;lt;sagemaker role&amp;gt;',  
    entry_point='run_ssh.py',  
    source_dir='&amp;lt;path to source dir&amp;gt;',  
    instance_type='ml.g5.xlarge',  
    instance_count=1,  
    framework_version='2.0.1',  
    py_version='py310',  
    dependencies=[SSHEstimatorWrapper.dependency_dir()]  
)  


# configure the SSH wrapper. Set the wait time for the connection.  
ssh_wrapper = SSHEstimatorWrapper.create(estimator.framework,   
                                    connection_wait_time_seconds=2*MINUTE)  

# start job  
estimator.fit()  

# wait to receive an instance id for the connection over SSM  
instance_ids = ssh_wrapper.get_instance_ids()  

print(f'To connect run: aws ssm start-session --target {instance_ids[0]}')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As usual, the SageMaker service will allocate a machine instance, build the requested environment, download and unpack our source code, and install the requested dependencies. At that point, the runtime environment will be identical to the one in which we usually run our training script. Only now, instead of training we will run our  &lt;em&gt;start_ssh.py&lt;/em&gt; script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import sagemaker_ssh_helper  
from time import sleep  

# setup SSH and wait for connection_wait_time_seconds seconds  
# (to give opportunity for the user to connect before script resumes)  
sagemaker_ssh_helper.setup_and_start_ssh()  

# place any code here... e.g. your training code  
# we choose to sleep for two hours to enable connecting in an SSH window  
# and running trials there  
HOUR = 60*60  
sleep(2*HOUR)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper/blob/main/sagemaker_ssh_helper/__init__.py#L10"&gt;&lt;em&gt;setup_and_start_ssh&lt;/em&gt;&lt;/a&gt;  function will start the SSH service, then block for the allotted time we defined above (&lt;em&gt;connection_wait_time_seconds&lt;/em&gt;) to allow an SSH client to connect, and then proceed with the rest of the script. In our case it will sleep for two hours and then exit the training job. During that time we can connect to the machine using the  &lt;a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-sessions-start.html"&gt;&lt;em&gt;aws ssm start-session&lt;/em&gt;&lt;/a&gt;  command and the instance-id that was returned by the  &lt;em&gt;ssh_wrapper&lt;/em&gt;  (which typically starts with an “&lt;em&gt;mi-”&lt;/em&gt;  prefix for “&lt;em&gt;managed instance&lt;/em&gt;”) and play to our hearts desire. In particular, we can explicitly run our original training script (which was uploaded as part of the  &lt;em&gt;source_dir&lt;/em&gt;) and monitor the training behavior.&lt;/p&gt;

&lt;p&gt;The method we have described, enables us to run our training script iteratively while we identify and fix bugs. It also provides an ideal setting for optimizing performance — one in which we can 1) run a few training steps, 2) identify performance bottlenecks (e.g., using  &lt;a href="https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html"&gt;PyTorch Profiler&lt;/a&gt;), 3) tune our code to address them, and 4) repeat, until we achieve the desired runtime performance.&lt;/p&gt;

&lt;p&gt;Importantly, keep in mind that the instance will be terminated as soon as the  &lt;em&gt;start_ssh.py&lt;/em&gt; script completes. Make sure to copy all important files (e.g., code modifications, profile traces, etc.) to persistent storage before it is too late.&lt;/p&gt;

&lt;h2&gt;
  
  
  Port Forwarding Over AWS SSM
&lt;/h2&gt;

&lt;p&gt;We can extend our  &lt;em&gt;aws ssm start-session&lt;/em&gt;  command to enable  &lt;a href="https://aws.amazon.com/blogs/aws/new-port-forwarding-using-aws-system-manager-sessions-manager/"&gt;port forwarding&lt;/a&gt;. This allows you to securely connect to server applications running on your cloud instance. This is particularly exciting for developers who are accustomed to using the  &lt;a href="https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"&gt;TensorBoard Profiler&lt;/a&gt;  plugin for analyzing runtime performance (as  &lt;a href="https://towardsdatascience.com/pytorch-model-performance-analysis-and-optimization-10c3c5822869"&gt;we are&lt;/a&gt;). The command below demonstrates how to set up port forwarding over AWS SSM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws ssm start-session \  
  --target mi-0748ce18cf28fb51b \  
  --document-name AWS-StartPortForwardingSession  
  --parameters '{"portNumber":["6006"],"localPortNumber":["9999"]}'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Additional Modes of Use
&lt;/h2&gt;

&lt;p&gt;The  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper/tree/v2.1.0"&gt;SageMaker SSH Helper documentation&lt;/a&gt;  describes several different ways of using the SSH functionality. In the  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper/tree/v2.1.0#step-3-modify-your-training-script"&gt;basic example&lt;/a&gt;  the  &lt;em&gt;setup_and_start_ssh&lt;/em&gt; command is added to the top of the existing training script (instead of defining a dedicated script). This allows you time (as defined by the  &lt;em&gt;connection_wait_time_seconds&lt;/em&gt; setting) to connect to the machine before the training begins so that you can monitor its behavior (from a separate process) as it runs.&lt;/p&gt;

&lt;p&gt;The more  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper/tree/v2.1.0#pycharm-debug-server"&gt;advanced examples&lt;/a&gt;  include different methods for using SageMaker SSH Helper to debug the training script running in the SageMaker environment from an IDE running in our local environment. The setup is more complicated but may very well be worth the reward of being able to perform line-by-line debugging from a local IDE.&lt;/p&gt;

&lt;p&gt;Additional use cases cover  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper/blob/v2.1.0/FAQ.md#im-running-sagemaker-in-a-vpc-do-i-need-to-make-extra-configuration"&gt;training in a VPC&lt;/a&gt;, integration with  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper/tree/v2.1.0#studio"&gt;SageMaker Studio&lt;/a&gt;,  &lt;a href="https://github.com/aws-samples/sagemaker-ssh-helper/tree/v2.1.0#connecting-to-sagemaker-inference-endpoints-with-ssm"&gt;connecting to SageMaker inference endpoints&lt;/a&gt;, and more. Please be sure to see the documentation for details.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use SageMaker SSH Helper
&lt;/h2&gt;

&lt;p&gt;Given the advantages of debugging with SageMaker SSH Helper, you might wonder if there is any reason to use the three debugging methods we described above. We would argue that, despite the fact that you  &lt;em&gt;could&lt;/em&gt; perform all of your debugging in the cloud, it is still highly recommended that you perform your initial development and experimentation phase — to the extent possible — in your local environment (using the first two methods we described). Only once you have exhausted your ability to debug locally, should you move to debugging in the cloud using SageMaker SSH Helper. The last thing you would want would be to spend hours cleaning up silly syntax errors on a super expensive cloud-based GPU machine.&lt;/p&gt;

&lt;p&gt;Contrary to debugging, analyzing and optimizing performance has little value  &lt;em&gt;unless&lt;/em&gt; it is performed directly on the target training environment. Thus, it  &lt;em&gt;would&lt;/em&gt;  be advised to perform your optimization efforts on the SageMaker instance using SageMaker SSH Helper.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;Until now, one of the most painful side effects of training on Amazon SageMaker has been the loss of direct access to the training environment. This restricted our ability to debug and tune our training workloads in an effective manner. The recent release of SageMaker SSH Helper and its support for unmediated access to the training environment opens up a wealth of new opportunities for developing, debugging, and tuning. These can have a distinctive impact on the efficiency and speed of your ML development life cycle. It is for this reason that SageMaker SSH Helper is one of our favorite new cloud-ML features of 2023.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>sagemaker</category>
      <category>mlops</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>Using Server-less Functions to Govern and Monitor Cloud-Based Training Experiments</title>
      <dc:creator>Chaim Rand</dc:creator>
      <pubDate>Sat, 23 Dec 2023 16:46:46 +0000</pubDate>
      <link>https://dev.to/aws-builders/using-server-less-functions-to-govern-and-monitor-cloud-based-training-experiments-4bah</link>
      <guid>https://dev.to/aws-builders/using-server-less-functions-to-govern-and-monitor-cloud-based-training-experiments-4bah</guid>
      <description>&lt;h2&gt;
  
  
  A simple routine that can save you loads of money
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XfGf2SUG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:1050/0%2ADsBFhvT4LD5Mn8mV" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XfGf2SUG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:1050/0%2ADsBFhvT4LD5Mn8mV" alt="" width="800" height="533"&gt;&lt;/a&gt;&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@teadrinker42?utm_source=medium&amp;amp;utm_medium=referral"&gt;Ziyou Zhang&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=medium&amp;amp;utm_medium=referral"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This blog post was co-authored with my colleague &lt;a href="https://www.linkedin.com/in/shay-margalit-40581668/?originalSubdomain=il"&gt;Shay Margalit&lt;/a&gt;. It summarizes his research into how &lt;a href="https://aws.amazon.com/lambda/"&gt;AWS Lambda&lt;/a&gt; functions can be used to increase the control over the usage and costs of the &lt;a href="https://aws.amazon.com/sagemaker/"&gt;Amazon SageMaker&lt;/a&gt; training service. Interested? Please read on :).&lt;/p&gt;

&lt;p&gt;We are fortunate (or very unfortunate - &lt;a href="https://www.bbc.co.uk/news/uk-65746524"&gt;depending on who you ask&lt;/a&gt;) to be sharing a front row seat to an AI revolution that is expected by many to change the world as we know it. Powered by advances in hardware development and access to enormous amounts of data, this revolution is likely to impact many aspects of our daily lives - although precisely how, no one can say for sure. To support the growing appetite for artificial intelligence, the sizes of the underlying machine learning models are increasing rapidly as are the resources that are required to train them. The bottom line is that staying relevant in the AI development playing field requires a sizable investment into heavy, and expensive, machinery.&lt;/p&gt;

&lt;p&gt;Cloud-based managed training services, such as Amazon SageMaker, Google Vertex AI, and Microsoft Azure ML, have lowered the entry barrier to AI development by enabling developers to train on machines that they could otherwise not afford. Although such services reduce the upfront costs of AI and enable you to pay only for the time you spend training, the potential for the variable costs to add up warrants careful planning of how the training services will be used and how they will contribute to your overall training expense. However, inevitably, things don't always go according to plan. To paraphrase an old Yiddish proverb "developers plan and the programming gods laugh". When the stakes are high, as when training AI models --- where an errant experiment can result in hundreds or thousands of dollars worth of wasted compute time, it is wise to institute multiple lines of defense.&lt;/p&gt;

&lt;h2&gt;
  
  
  First Line of Defense - Encourage Healthy Development Habits
&lt;/h2&gt;

&lt;p&gt;The first line of defense should address the development practices of the ML algorithm engineers. Here are examples of some &lt;a href="https://towardsdatascience.com/cloud-ml-performance-checklist-caa51e798002"&gt;guiding principles&lt;/a&gt; you might consider:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Encourage appropriate and cost-optimal use of the hardware resources used for training (e.g., see &lt;a href="https://towardsdatascience.com/instance-selection-for-deep-learning-7463d774cff0"&gt;here&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt; Identify and terminate failing experiments early.&lt;/li&gt;
&lt;li&gt; Increase price performance by regularly analyzing and optimizing runtime performance (e.g., see &lt;a href="https://medium.com/towards-data-science/pytorch-model-performance-analysis-and-optimization-10c3c5822869"&gt;here&lt;/a&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;While formulating and adapting AI development principles such as the ones above are likely to increase your productivity and reduce waste, they do not offer full protection against all possible failures. For example, a dedicated failure detection runtime process may not help address a situation in which a training experiment stalls (e.g., due to a deadlock in the training application's processes) but the training job remains active until it is actively stopped or times out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Second Line of Defense - Deploy Cross-project Guardrails
&lt;/h2&gt;

&lt;p&gt;In this post we propose instituting a second line of defense that monitors all of the training activities in the project (or organization), verifies their compliance with a predetermined set of rules, and takes appropriate action in the case that errant training experiments are identified. One way to do this is to use dedicated server-less functions that are triggered at different stages of a training job and programmed to evaluate the job's state and optionally stop or restart it (possibly with changes to the job settings), accordingly. In the next sections we will demonstrate a few examples of how to use AWS Lambda as a second line of defense against errant Amazon SageMaker training experiments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Disclaimers
&lt;/h2&gt;

&lt;p&gt;Although we have chosen Amazon SageMaker and AWS Lambda for our demonstrations, the contents of the post are just as relevant to other services and similar functionality can be implemented for them. Please do not interpret our choice of these services as an endorsement of their use over their alternatives. There are multiple options available for cloud-based training each with their own advantages and disadvantages. The best choice for you will greatly depend on the details of your project.&lt;/p&gt;

&lt;p&gt;While we will share a few Python examples of server-less code, we will not go into the details of how to create and deploy them as AWS Lambda functions. There are many ways of interacting with AWS Lambda. We refer the reader to the &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/python-tracing.html"&gt;official AWS documentation&lt;/a&gt; to learn more about them.&lt;/p&gt;

&lt;p&gt;The examples below were created for demonstrative purposes. They will likely require modification to suit the specific needs of your project. Be sure to fully understand all of the details of the code and the associated service costs before adapting the type of solution we propose. Importantly, the code we will share has not undergone rigorous testing. Any solution that includes creation and invocation of multiple Lambda functions and &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html"&gt;Amazon CloudWatch alarms&lt;/a&gt; (as described here) requires appropriate validation to prevent the accumulation of redundant/orphan artifacts.&lt;/p&gt;

&lt;p&gt;We highly advise that you verify the details of this post against the most up-to-date AWS Lambda documentation and most up-to-date versions of the supporting libraries.&lt;/p&gt;

&lt;h1&gt;
  
  
  Enforcing Developer Compliance
&lt;/h1&gt;

&lt;p&gt;While &lt;a href="https://www.redhat.com/en/topics/automation/what-is-cloud-governance"&gt;cloud governance&lt;/a&gt; is often vital for successful and efficient use of cloud services, its enforcement can sometimes be challenging. For example: Amazon SageMaker includes an API for appending &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Tag.html"&gt;tags&lt;/a&gt; to training jobs. These can be used to include metadata associated with the SageMaker job such as the name of the training project, the stage of development, the goal of the current trial, the name of the development group or user running the job, etc. This metadata can be used to collect statistics such as the cost of development per project or group. In the code block below, we demonstrate the application of several tags to a SageMaker training job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sagemaker.pytorch import PyTorch

tags = [{'Key': 'username', 'Value': 'johndoe'},\
        {'Key': 'model_name', 'Value': 'mnist'},\
        {'Key': 'training_phase', 'Value': 'finetune'},\
        {'Key': 'description', 'Value': 'fine tune final linear layer'}]

# define the training job with tags\
estimator = PyTorch(\
    entry_point='train.py',\
    framework_version='2.1.0',\
    role='&amp;lt;arn role&amp;gt;',\
    py_version='py310',\
    job_name='demo',\
    instance_type='ml.g5.xlarge',\
    instance_count=1,\
    tags=tags\
)

# deploy the job to the cloud\
estimator.fit()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Naturally, these tags are only helpful if we can enforce their application. This is where &lt;a href="https://aws.amazon.com/lambda/"&gt;AWS Lambda&lt;/a&gt; comes to the rescue. Using &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/automating-sagemaker-with-eventbridge.html"&gt;Amazon EventBridge&lt;/a&gt; we can monitor changes in the status of a SageMaker training jobs and register a function that will be triggered on every change. In the code block below, we propose a Python routine that will verify the presence of specific SageMaker tags every time a job is started. In case a tag is missing the job is automatically terminated. The structure of the event is documented &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/automating-sagemaker-with-eventbridge.html#eventbridge-training"&gt;here&lt;/a&gt;. Note the use of (the more detailed) &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html"&gt;&lt;em&gt;SecondaryStatus&lt;/em&gt;&lt;/a&gt; field to poll the status of the training job (rather than &lt;em&gt;TrainingJobStatus&lt;/em&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3\
def stop_training_job(training_job_name):\
    sm_client = boto3.client("sagemaker")\
    response = sm_client.stop_training_job(TrainingJobName=training_job_name)\
    assert response['ResponseMetadata']['HTTPStatusCode'] == 200\
    # TODO - optionally send an email notification

def enforce_required_tags(training_job_name, event):\
    event_tags = event['detail']['Tags']\
    if 'model_name' not in event_tags:\
        stop_training_job(training_job_name)

# define lambda handler\
def sagemaker_event_handler(event, _):\
    job_name = event['detail']['TrainingJobName']\
    job_secondary_status = event['detail']['SecondaryStatus']\
    if job_secondary_status == 'Starting':\
        enforce_required_tags(job_name, event)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS offers multiple ways for creating a Lambda function. Please see the &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/API_CreateFunction.html"&gt;AWS Lambda&lt;/a&gt; documentation for details. Once created, make sure to set the function as the target of the &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/automating-sagemaker-with-eventbridge.html#eventbridge-model"&gt;EventBridge rule&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The same function can be used to enforce additional development rules that are aimed at controlling cost such as: the types of instances that can be used, the maximum number of instances per job, the maximum runtime of a job, and more.&lt;/p&gt;

&lt;h1&gt;
  
  
  Stopping Stalled Experiments
&lt;/h1&gt;

&lt;p&gt;Imagine the following scenario: You have planned a large cloud-based training job that will run on eight $30-an-hour ML compute instances for a period of three days. For the purpose of this task, you have secured a budget of $17,280 (8 instances x $30 an hour x 24 hours x 3 days). You start up the training job just before heading out for a three-day holiday weekend. When you return from your holiday weekend, you discover that an hour into the job, the training process stalled causing the expensive machinery to essentially remain completely idle for three long days. Not only have you wasted $17,280 (good luck explaining that to your boss) but your development has now been pushed back by three days!!&lt;/p&gt;

&lt;p&gt;One way to protect yourself against this type of occurrence, is to monitor the utilization of the underlying training job resources. For example, if the GPU utilization your training instances remains below a certain threshold for an extended period of time, this is likely to be a sign that something has gone wrong and that the training job should be stopped immediately.&lt;/p&gt;

&lt;p&gt;We will do this by defining an &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html"&gt;Amazon CloudWatch alarm&lt;/a&gt; that monitors the GPU utilization of one of the training instances of each SageMaker job and invokes an AWS Lambda function that terminates the job if the alarm is triggered. Setting this up requires three components: an &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html"&gt;Amazon CloudWatch alarm&lt;/a&gt; (one per training job), an &lt;a href="https://aws.amazon.com/lambda/"&gt;AWS Lambda&lt;/a&gt; function, and an &lt;a href="https://docs.aws.amazon.com/sns/"&gt;Amazon Simple Notification Service (SNS) topic&lt;/a&gt; that is used to link the Lambda function to the CloudWatch alarms.&lt;/p&gt;

&lt;p&gt;First, we create an SNS topic. This can be done via the Amazon SNS Console or in Python, as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3

sns_client = boto3.client('sns')\
# Create a SNS notification topic.\
topic = sns_client.create_topic(Name="SageMakerTrainingJobIdleTopic")\
topic_arn = topic.arn\
print(f"Created SNS topic arn: {topic_arn}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we extend the &lt;em&gt;sagemaker_event_handler&lt;/em&gt; function we defined above to create a unique alarm each time a training job is started. We program the alarm to measure the average GPU utilization over five-minute periods and to alert our SNS topic when there are three consecutive measurements below 1%. The alarm is deleted when the job is completed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def create_training_alarm(job_name):\
    topic_arn = '&amp;lt;sns topic arn&amp;gt;'

    SAMPLE_PERIOD_SECONDS = 60 * 5 # 5 minutes\
    SAMPLE_POINTS_LIMIT = 3\
    GPU_UTIL_THRESHOLD_PERCENTAGE = 1

    cloudwatch_client = boto3.client('cloudwatch')

    # A new sample is generated each SAMPLE_PERIOD_SECONDS seconds.\
    # The alarm will set off it there will be more than SAMPLE_POINTS_LIMIT\
    # below the limit.\
    response = cloudwatch_client.put_metric_alarm(\
        AlarmName=job_name + 'GPUUtil',\
        AlarmActions=topic_arn,\
        MetricName='GPUUtilization',\
        Namespace='/aws/sagemaker/TrainingJobs',\
        Statistic='Average',\
        Dimensions=[{\
            "Name": "Host",\
            "Value": job_name+"/algo-1"\
        }],\
        Period=SAMPLE_PERIOD_SECONDS,\
        EvaluationPeriods=SAMPLE_POINTS_LIMIT,\
        DatapointsToAlarm=SAMPLE_POINTS_LIMIT,\
        Threshold=GPU_UTIL_THRESHOLD_PERCENTAGE,\
        ComparisonOperator='LessThanOrEqualToThreshold',\
        TreatMissingData='notBreaching'\
    )\
    assert response['ResponseMetadata']['HTTPStatusCode'] == 200

def delete_training_alarm(job_name):\
    cloudwatch_client = boto3.client('cloudwatch')\
    response = cloudwatch_client.delete_alarms(\
                                   AlarmNames=[job_name+'GPUUtil'])

def sagemaker_event_handler(event, _):\
    job_name = event['detail']['TrainingJobName']\
    job_secondary_status = event['detail']['SecondaryStatus']\
    if job_secondary_status == 'Starting':\
        enforce_required_tags(job_name, event)\
    elif job_secondary_status == 'Training':\
        create_training_alarm(job_name)\
    elif job_secondary_status in ['Completed', 'Failed', 'Stopped']:\
        delete_training_alarm(job_name)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Last, we define a second Python AWS Lambda function that &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/with-sns-create-package.html#with-sns-example-deployment-pkg-python"&gt;parses messages received from the SNS topic&lt;/a&gt; and terminates the training job associated with the alarm.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3, json

def lambda_sns_handler(event, context):\
    data = json.loads(event['Records'][0]['Sns']['Message'])\
    alarm_name = data['AlarmName']\
    training_job_name = alarm_name.replace('GPUUtil', '')\
    stop_training_job(training_job_name)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS offers multiple mechanisms for subscribing a Lambda function to an SNS topic including the &lt;a href="https://docs.aws.amazon.com/sns/latest/dg/lambda-console.html"&gt;AWS Console&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/with-sns-example.html"&gt;AWS CLI&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-resource-function.html"&gt;the AWS Serverless Application Model (AWS SAM)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The solution we described is summarized in the following diagram:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cD1GAk4P--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:1050/1%2AfH4bgTouVhVfp3L1lIxlbw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cD1GAk4P--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:1050/1%2AfH4bgTouVhVfp3L1lIxlbw.png" alt="AWS Architecture Diagram (by Author)" width="715" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that the same architecture can be used to enforce a minimum level of GPU utilization of your ML training projects. GPUs are typically the most expensive resource in your training infrastructure and your goal should be to maximize the utilization of all of your training workloads. By dictating a minimum level of utilization (e.g. 80%) you can ensure that all developers &lt;a href="https://towardsdatascience.com/cloud-ml-performance-checklist-caa51e798002"&gt;optimize their workloads appropriately&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Ensuring Continuity of Development
&lt;/h1&gt;

&lt;p&gt;In our previous example, we demonstrated how to identify and stop a stalled experiment. In the large training job scenario that we described, this helped save a lot of money, but it did not address the three day delay to development. Obviously, if the source of the stall is in your code, it makes sense to postpone resuming training until the problem is fixed. However, we often encounter training interruptions that are not caused by our code but rather by sporadic failures in the service environment. In such scenarios, your priority may be to ensure training continuity rather than having to wait for someone to manually resume the training job (using the most recent &lt;a href="https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html"&gt;training checkpoint&lt;/a&gt;). In the code block below, we use the &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/index.html"&gt;boto3&lt;/a&gt; &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_training_job.html"&gt;&lt;em&gt;create_training_job&lt;/em&gt;&lt;/a&gt; API to extend our &lt;em&gt;sagemaker_event_handler&lt;/em&gt; function to (naively) resume any training job that has failed after running for at least two hours.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3, datetime

def clone_job(training_name, disable_spot=False):\
    # get description\
    client = boto3.client('sagemaker')\
    desc = client.describe_training_job(TrainingJobName=training_name)

    # update the training name\
    new_training_name = training_name + 'clone'

    use_spots = (not disable_spot) and desc["EnableManagedSpotTraining"]

    if disable_spot:\
        desc["StoppingCondition"].pop("MaxWaitTimeInSeconds", None)

    client.create_training_job(\
        TrainingJobName=new_training_name,\
        HyperParameters=desc["HyperParameters"],\
        AlgorithmSpecification=desc["AlgorithmSpecification"],\
        RoleArn=desc["RoleArn"],\
        OutputDataConfig=desc["OutputDataConfig"],\
        ResourceConfig=desc["ResourceConfig"],\
        StoppingCondition=desc["StoppingCondition"],\
        EnableNetworkIsolation=desc["EnableNetworkIsolation"],\
        EnableInterContainerTrafficEncryption=desc[\
            "EnableInterContainerTrafficEncryption"\
        ],\
        EnableManagedSpotTraining=use_spots,\
        Tags=client.list_tags(ResourceArn=desc['TrainingJobArn'])\
     )

def sagemaker_event_handler(event, _):\
    TRAIN_TIME_THRESHOLD = 2 * 60 * 60: # 2 hours\
    job_name = event['detail']['TrainingJobName']\
    job_secondary_status = event['detail']['SecondaryStatus']\
    if job_secondary_status == 'Starting':\
        enforce_required_tags(job_name, event)\
    elif job_secondary_status == 'Training':\
        create_training_alarm(job_name)\
    elif job_secondary_status in ['Completed', 'Failed', 'Stopped']:\
        delete_training_alarm(job_name)

    if job_secondary_status == 'Failed':\
        start_time = datetime.datetime.utcfromtimestamp(\
                                     event['detail']['CreationTime']/1000)\
        end_time = datetime.datetime.utcfromtimestamp(\
                                     event['detail']['TrainingEndTime']/1000)\
        training_time_seconds = (end_time - start_time).seconds\
        if training_time_seconds &amp;gt;= TRAIN_TIME_THRESHOLD:\
            clone_job(job_name)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The function above automatically resumes any job that fails after two hours. A more practical solution might attempt to diagnose the type of error to determine whether resuming the job would be appropriate. One way to do this is to parse the failure description message and/or the CloudWatch logs associated with the failing job.&lt;/p&gt;

&lt;h1&gt;
  
  
  Advanced Spot-instance Utilization
&lt;/h1&gt;

&lt;p&gt;One of the compelling features of Amazon SageMaker is its support for &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html"&gt;managed spot training&lt;/a&gt;. &lt;a href="https://aws.amazon.com/ec2/spot/"&gt;Amazon EC2 Spot Instances&lt;/a&gt; allow you to take advantage of unused EC2 capacity at discounted prices. The catch is that these instances can be taken away ("interrupted") in the middle of their use. Thus, Spot instances should be used only for fault-tolerant workloads. SageMaker makes it easy to take advantage of Spot instances by identifying Spot interruptions on your behalf and automatically restarting jobs when new Spot instances become available. While &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html"&gt;managed spot&lt;/a&gt; instances can be used to reduce cost of training, sometimes this strategy can backfire. For example, when there is low spot capacity your training jobs might time out before starting. Alternatively, the job might experience frequent interruptions that prevent it from making any meaningful progress. Both occurrences can interfere with development and reduce productivity. These types of situations can be monitored and addressed using AWS Lambda. In the code block below, we extend our &lt;em&gt;sagemaker_event_handler&lt;/em&gt; function to identify a training job that has been interrupted more than three times and replace it with a cloned job in which the managed spot training is disabled.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def sagemaker_event_handler(event, _):\
    TRAIN_TIME_THRESHOLD = 2 * 60 * 60: # 2 hours\
    MIN_ITERRUPTS = 3\
    job_name = event['detail']['TrainingJobName']\
    job_secondary_status = event['detail']['SecondaryStatus']\
    if job_secondary_status == 'Starting':\
        enforce_required_tags(job_name, event)\
    elif job_secondary_status == 'Training':\
        create_training_alarm(job_name)\
    elif job_secondary_status in ['Completed', 'Failed', 'Stopped']:\
        delete_training_alarm(job_name)

    if job_secondary_status == 'Failed':\
        start_time = datetime.datetime.utcfromtimestamp(\
                                     event['detail']['CreationTime']/1000)\
        end_time = datetime.datetime.utcfromtimestamp(\
                                     event['detail']['TrainingEndTime']/1000)\
        training_time_seconds = (end_time - start_time).seconds\
        if training_time_seconds &amp;gt;= TRAIN_TIME_THRESHOLD:\
            clone_job(job_name)

    if job_secondary_status == 'Interrupted':\
        transitions = event['detail']["SecondaryStatusTransitions"]\
        interrupts = [e for e in transitions if e["Status"] == "Interrupted"]\
        num_interrupts = len(interrupts)\
        if num_interrupts &amp;gt; MIN_ITERRUPTS:\
            stop_training_job(job_name)\
            clone_job(job_name, disable_spot=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation above determined the spot usage strategy based solely on the number of interruptions of the training job in question. A more elaborate solution might take into account other jobs (that use the same instance types), the duration of time across which the interruptions occurred, the amount of active training time, and/or the number of recent jobs that timed out due to low Spot instance capacity.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;Effective AI model development requires the definition of a creative and detailed training infrastructure architecture in order to minimize cost and maximize productivity. In this post we have demonstrated how serverless AWS Lambda functions can be used to augment Amazon SageMaker's managed training service in order to address some common issues that can occur during training. Naturally, the precise manner in which you might apply these kinds of techniques will depend greatly on the specifics of your project.&lt;/p&gt;

&lt;p&gt;Please feel free to reach out with questions, comments, and corrections. Be sure to check out our &lt;a href="https://chaimrand.medium.com/"&gt;other posts&lt;/a&gt; on the topic of DL training optimization.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Accelerating PyTorch Training Workloads with FP8</title>
      <dc:creator>Chaim Rand</dc:creator>
      <pubDate>Thu, 21 Dec 2023 14:35:00 +0000</pubDate>
      <link>https://dev.to/aws-builders/accelerating-pytorch-training-workloads-with-fp8-3ni5</link>
      <guid>https://dev.to/aws-builders/accelerating-pytorch-training-workloads-with-fp8-3ni5</guid>
      <description>&lt;h1&gt;
  
  
  How to make the most of your modern-day GPU
&lt;/h1&gt;

&lt;p&gt;The past few years have seen revolutionary advancements in the field of AI, perhaps best exemplified by the recent popularity and proliferation of LLM-based applications such as &lt;a href="https://en.wikipedia.org/wiki/ChatGPT"&gt;ChatGPT&lt;/a&gt;. These breakthroughs have been powered by equally exciting developments in the machinery used to train AI models. New and innovative architectures, sophisticated tensor processing cores, and dedicated HW accelerators have enabled the convergence of AI models of ever-increasing sizes, at faster and faster rates. In this post, we will focus on one particular advancement in AI-specialized HW — the inclusion of dedicated 8-bit floating-point (FP8) tensor processing cores. Appearing in the most modern AI HW architectures (e.g., &lt;a href="https://www.nvidia.com/en-eu/data-center/technologies/hopper-architecture/"&gt;Nvidia Hopper&lt;/a&gt;, &lt;a href="https://www.nvidia.com/en-eu/geforce/ada-lovelace-architecture/"&gt;Nvidia Ada Lovelace&lt;/a&gt;, and &lt;a href="https://www.intel.com/content/www/us/en/developer/articles/technical/habana-gaudi2-processor-for-deep-learning.html"&gt;Habana Gaudi2&lt;/a&gt;) the FP8 tensor cores enable a significant increase in floating-point operations per second (FLOPS), as well as opportunities for memory optimization and energy savings for both AI training and inference workloads.&lt;/p&gt;

&lt;p&gt;Taking advantage of the HW-level FP8 capabilities requires appropriate support in the SW stack and development framework that we use to build our AI training and inference applications. In this post we will describe how to &lt;strong&gt;modify a PyTorch training script so as to utilize the built-in support for the FP8 datatype of an &lt;a href="https://www.nvidia.com/en-eu/data-center/h100/"&gt;Nvidia H100 GPU&lt;/a&gt;&lt;/strong&gt;. We will start by providing some motivation for the use of the FP8 datatype. We will then review the FP8-specific PyTorch API support exposed by the &lt;a href="https://github.com/NVIDIA/TransformerEngine"&gt;Transformer Engine&lt;/a&gt; library and show how to integrate them into a simple training script. Although we will not go into the theory behind the use of FP8 for AI training, we will note the potential challenges involved in its use. Last, we will demonstrate the significant optimization opportunities of the FP8 datatype.&lt;/p&gt;

&lt;h3&gt;
  
  
  Disclaimers
&lt;/h3&gt;

&lt;p&gt;Please do not interpret our mention of any SW component, methodology, or service as an endorsement for its use. The best design for ML development will vary greatly based on the specific details of your own AI workload. Please also keep in mind that the APIs and behaviors of some of the SW packages and components we will mention may change by the time you read this post. You are highly encouraged to evaluate any potential design decisions based on the most up to date HW and SW available.&lt;/p&gt;

&lt;h1&gt;
  
  
  Motivation
&lt;/h1&gt;

&lt;p&gt;As AI models grow more and more sophisticated, so does the machinery required to train them. The &lt;a href="https://www.nvidia.com/en-eu/data-center/h100/"&gt;Nvidia H100 GPU&lt;/a&gt;, said to support “unprecedented performance and scalability”, is (at the time of this writing) Nvidia’s newest and strongest AI accelerator, purposely designed with the goal of enabling the AI development of the next generation. With the current AI hype in full swing, the demand for these GPUs has been huge (e.g., see &lt;a href="https://www.hpcwire.com/2023/08/17/nvidia-h100-are-550000-gpus-enough-for-this-year/"&gt;here&lt;/a&gt;). Accordingly, and unsurprisingly, the cost of these GPUs has been extremely high — perhaps even forbidding for many of our readers. Fortunately, cloud service providers such as AWS, GCP, and Microsoft Azure, offer “pay as you go” (per hour/per second) access to H100 powered machines thereby opening up the opportunity for their use to a much greater community of AI developers.&lt;/p&gt;

&lt;p&gt;In AWS, H100 GPUs are offered as a component of the &lt;a href="https://aws.amazon.com/blogs/aws/new-amazon-ec2-p5-instances-powered-by-nvidia-h100-tensor-core-gpus-for-accelerating-generative-ai-and-hpc-applications/"&gt;recently announced&lt;/a&gt; &lt;a href="https://aws.amazon.com/ec2/instance-types/p5/"&gt;AWS EC2 p5 instance family&lt;/a&gt;. These instances are claimed to “accelerate your time to solution by up to 4x compared to previous-generation GPU-based EC2 instances and reduce cost to train ML models by up to 40%”.&lt;/p&gt;

&lt;p&gt;In a &lt;a href="https://towardsdatascience.com/instance-selection-for-deep-learning-7463d774cff0"&gt;recent post&lt;/a&gt; we discussed some of the considerations that should go into the choice of an ML training instance. We highlighted the fact that the most optimal instance type will be very-much dependent on the project at hand. Specifically, when it comes to ML training instances — &lt;strong&gt;bigger is not always better&lt;/strong&gt;. This is particularly true of the p5 instance family. True — the p5 will likely out-perform any other instance type — after all, the H100 is an undisputed performance beast. But once you factor in the cost of the p5 ($98.32 per hour for the 8-GPU p5.48xlarge instance — at the time of this writing), you might find other instance types to be more suitable.&lt;/p&gt;

&lt;p&gt;In the next section we will train a relatively large computer vision model on a p5.48xlarge and compare its performance to a &lt;a href="https://aws.amazon.com/ec2/instance-types/p4/"&gt;p4d.24xlarge&lt;/a&gt; containing 8 &lt;a href="https://www.nvidia.com/en-eu/data-center/a100/"&gt;Nvidia A100 GPUs&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Toy Model
&lt;/h3&gt;

&lt;p&gt;In the code-block below we define a &lt;a href="https://en.wikipedia.org/wiki/Vision_transformer"&gt;Vision Transformer&lt;/a&gt; (ViT)-backed classification model (using the popular &lt;a href="https://pypi.org/project/timm/"&gt;timm&lt;/a&gt; Python package version 0.9.10) along with a randomly generated dataset. ViT backbones come in many shapes and sizes. Here we have chosen what is often referred to as the ViT-Huge configuration — with &lt;strong&gt;632&lt;/strong&gt; million parameters — in order to take better advantage of the capacity the H100 has for large models.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import torch, time
import torch.optim
import torch.utils.data
import torch.distributed as dist
from torch.nn.parallel.distributed import DistributedDataParallel as DDP
import torch.multiprocessing as mp

# modify batch size according to GPU memory
batch_size = 64

from timm.models.vision_transformer import VisionTransformer

from torch.utils.data import Dataset


# use random data
class FakeDataset(Dataset):
    def __len__(self):
        return 1000000

    def __getitem__(self, index):
        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
        label = torch.tensor(data=[index % 1000], dtype=torch.int64)
        return rand_image, label


def mp_fn(local_rank, *args):
    # configure process
    dist.init_process_group("nccl",
                            rank=local_rank,
                            world_size=torch.cuda.device_count())
    torch.cuda.set_device(local_rank)
    device = torch.cuda.current_device()

    # create dataset and dataloader
    train_set = FakeDataset()
    train_loader = torch.utils.data.DataLoader(
        train_set, batch_size=batch_size,
        num_workers=12, pin_memory=True)

    # define ViT-Huge model
    model = VisionTransformer(
            embed_dim=1280,
            depth=32,
            num_heads=16,
        ).cuda(device)
    model = DDP(model, device_ids=[local_rank])

    # define loss and optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

    model.train()

    t0 = time.perf_counter()
    summ = 0
    count = 0

    for step, data in enumerate(train_loader):
        # copy data to GPU
        inputs = data[0].to(device=device, non_blocking=True)
        label = data[1].squeeze(-1).to(device=device, non_blocking=True)

        # use mixed precision to take advantage of bfloat16 support
        with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
            outputs = model(inputs)
            loss = criterion(outputs, label)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

        # capture step time
        batch_time = time.perf_counter() - t0
        if step &amp;gt; 10:  # skip first steps
            summ += batch_time
            count += 1
        t0 = time.perf_counter()
        if step &amp;gt; 50:
            break
    print(f'average step time: {summ/count}')


if __name__ == '__main__':
    mp.spawn(mp_fn,
             args=(),
             nprocs=torch.cuda.device_count(),
             join=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;We trained this model on both the &lt;a href="https://aws.amazon.com/ec2/instance-types/p5/"&gt;p5.48xlarge&lt;/a&gt; and &lt;a href="https://aws.amazon.com/ec2/instance-types/p4/"&gt;p4d.24xlarge&lt;/a&gt; instance types using the dedicated PyTorch 2.1 &lt;a href="https://github.com/aws/deep-learning-containers/blob/master/available_images.md"&gt;AWS deep learning container&lt;/a&gt; (763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2).&lt;/p&gt;

&lt;p&gt;Unsurprisingly, the p5 step-time performance blows away the p4d performance — 0.199 seconds per step compared to 0.41 — more than twice as fast!! That would mean halving the time to train your large ML models. However, when you take into account the difference in cost ($32.77 per-hour for the p4d vs $98.32 per-hour for the p5 — as of the time of this writing) a completely different story unfolds. The &lt;strong&gt;price-performance of the p5 is ~30% worse than the p4d!!&lt;/strong&gt; This is very far from the 40% improvement that appeared in the &lt;a href="https://aws.amazon.com/blogs/aws/new-amazon-ec2-p5-instances-powered-by-nvidia-h100-tensor-core-gpus-for-accelerating-generative-ai-and-hpc-applications/"&gt;p5 announcement&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;At this point you might draw one of two possible conclusions. The first possibility is that, despite all the hype, the p5 is simply not the right machine for you. The second is that the p5 could still be viable, but that adaptations would be required to your model in order to take full advantage of its potential. In the next sections we will adopt the second approach and demonstrate how using the FP8 datatype — unique to the p5 instance type — can completely alter the comparative price-performance results.&lt;/p&gt;
&lt;h1&gt;
  
  
  Integrating FP8 with Transformer Engine
&lt;/h1&gt;

&lt;p&gt;The first thing we should emphasize is that, as of the time of this writing, PyTorch (version 2.1) does not include a native 8-bit floating datatype. To program our script to use FP8 we will use &lt;a href="https://github.com/NVIDIA/TransformerEngine"&gt;Transformer Engine&lt;/a&gt; (TE) a dedicated library for accelerating Transformer models on NVIDIA GPUs. TE (version 0.12) comes preinstalled in the AWS PyTorch 2.1 DL container.&lt;/p&gt;

&lt;p&gt;Although the theory behind the use of FP8 for training is beyond the scope of this post (e.g., see &lt;a href="https://arxiv.org/pdf/2209.05433.pdf"&gt;here&lt;/a&gt;), it is important to be aware that &lt;strong&gt;the mechanics of using FP8 are far more complex than the &lt;a href="https://pytorch.org/docs/stable/amp.html"&gt;16-bit alternatives&lt;/a&gt;&lt;/strong&gt; (float16 and bfloat16). Fortunately, the TE imlementation hides all of the messy details from the user. Please see the official &lt;a href="https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html"&gt;documentation&lt;/a&gt; as well as this simple &lt;a href="https://github.com/NVIDIA/TransformerEngine/blob/main/examples/pytorch/mnist/main.py"&gt;example&lt;/a&gt; for instructions on how to use the TE APIs. To learn more about what is going on behind the scenes be sure to see the following two video tutorials.&lt;br&gt;
&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51393/?source=post_page-----5a5123aec7d7--------------------------------" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--kJqTlB_S--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdnsecakmi.kaltura.com/p/2935771/thumbnail/entry_id/1_88mj6hcs/width/1200" height="450" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51393/?source=post_page-----5a5123aec7d7--------------------------------" rel="noopener noreferrer" class="c-link"&gt;
          FP8 Training with Transformer Engine | NVIDIA On-Demand
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          The session will include an introduction to FP8 and mixed precision, an overview of Transformer Engine features, and a code demo on how to use the library.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
        nvidia.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s52166/?source=post_page-----5a5123aec7d7--------------------------------" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--6W7Z1Iv1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdnsecakmi.kaltura.com/p/2935771/thumbnail/entry_id/1_imfejfnz/width/1200" height="450" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s52166/?source=post_page-----5a5123aec7d7--------------------------------" rel="noopener noreferrer" class="c-link"&gt;
          FP8 for Deep Learning | NVIDIA On-Demand
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          FP8 is a natural progression for accelerating deep learning (DL) training beyond the 16-bit formats common in modern processors
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
        nvidia.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;p&gt;To modify our model to use TE, we wrap TE’s specialized Transformer Layer with a custom transformer block class that conforms to timm’s &lt;a href="https://github.com/huggingface/pytorch-image-models/blob/v0.9.10/timm/models/vision_transformer.py#L114"&gt;block layer signature&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import transformer_engine.pytorch as te
from transformer_engine.common import recipe


class TE_Block(te.transformer.TransformerLayer):
    def __init__(
            self,
            dim,
            num_heads,
            mlp_ratio=4.,
            qkv_bias=False,
            qk_norm=False,
            proj_drop=0.,
            attn_drop=0.,
            init_values=None,
            drop_path=0.,
            act_layer=None,
            norm_layer=None,
            mlp_layer=None
    ):
        super().__init__(
            hidden_size=dim,
            ffn_hidden_size=int(dim * mlp_ratio),
            num_attention_heads=num_heads,
            hidden_dropout=proj_drop,
            attention_dropout=attn_drop
            )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we modify the VisionTransformer initialization to use our custom block layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  model = VisionTransformer(
      embed_dim=1280,
      depth=32,
      num_heads=16,
      block_fn=TE_Block
      ).cuda(device)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Until now we have not made any H100-specific changes — the same code can be run on our A100-powered p4d instance type. The last modification is wrapping the model forward-pass with a te.fp8_autocast context manager. This change requires a GPU that supports FP8:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
    with te.fp8_autocast(enabled=True):
        outputs = model(inputs)
    loss = criterion(outputs, label)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  A Few Cautionary Remarks Regarding the Use of FP8
&lt;/h3&gt;

&lt;p&gt;Usage of an 8-bit floating-point representation (as opposed to a 16 or 32-bit representation) implies lower precision and a lower dynamic range. These can have a meaningful impact on the attainability and/or speed of your model convergence. Although the underlying TE FP8 implementation is designed to address this challenge, there is no guarantee that this will work for your model. You may need to fiddle with the underlying FP8 mechanics (e.g., using the TE &lt;a href="https://github.com/NVIDIA/TransformerEngine/blob/release_v0.12/transformer_engine/common/recipe.py"&gt;recipe&lt;/a&gt; APIs), tune some of the hyperparameters, and/or limit the application of the FP8 to subsections of the model. You might find that despite all of your attempts, your model is simply not compatible with FP8.&lt;/p&gt;

&lt;h1&gt;
  
  
  Results
&lt;/h1&gt;

&lt;p&gt;In the table below we summarize the results of our experiments on both p4d.24xlarge and p5.48xlarge EC2 instance types, with and without the TE library. For the p5.48xlarge experiments we doubled the batch size in order to increase the utilization of the 80 GB GPU memory. Using FP8 reduces the GPU memory consumption enabling a further increase to the batch size.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cFYXV44V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4whovmmk5bwtx7qsbq5y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cFYXV44V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4whovmmk5bwtx7qsbq5y.png" alt="Image description" width="800" height="234"&gt;&lt;/a&gt;&lt;br&gt;
We can see that the use of the TE transformer block increased the price-performance on both the p4d (~19%) and the p5 (~32%) instance types. Using FP8 boosts the performance on the p5 by an additional ~20%. Following the TE and FP8 optimizations &lt;strong&gt;the price performance of the H100-based p5.48large beats that of the A100-based p4d.24large&lt;/strong&gt; — although not by a very wide margin (~2%). Taking into account the 3x increase in training speed, we can safely conclude that the p5 would be the better instance type for training our optimized model.&lt;/p&gt;

&lt;p&gt;Note that the relatively small increase in price-performance (far lower than the 40% mentioned in the p5 announcement) leaves us wishing for additional H100-specific optimizations… but those will have to wait for another post :).&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;In this post we have demonstrated how to program a PyTorch training script to use 8-bit floating types. We further demonstrated how the use of FP8 can be a key factor in getting the best performance out of modern GPUs such as Nvidia H100. Importantly, the viability of FP8 as well as its impact on training performance can vary a great deal based on the details of the model.&lt;/p&gt;

&lt;p&gt;This post continues a long series of publications on the topic of optimizing machine learning workloads. Be sure to see some of our &lt;a href="https://chaimrand.medium.com/"&gt;other posts&lt;/a&gt; on this important topic.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
    </item>
  </channel>
</rss>
