DEV Community: Chaim Rand

Streaming Data from Cloud Storage with Mountpoint for Amazon S3

Chaim Rand — Tue, 11 Feb 2025 13:55:44 +0000

A First Look at a New Solution for Mounting Cloud Based Data

These days, AI has become pretty much synonymous with collecting and maintaining large amounts of data. This data is typically stored in a central location and accessed at multiple different phases of the AI application development. An important factor in designing a data storage solution is the speed and efficiency at which the data can be accessed, as this can have a meaningful impact on the speed and cost of development. In our AI development team, we use cloud object storage services such as Amazon S3 to store enormous amounts of data. Consequently, we are obsessed with finding the fastest (and cheapest) ways of consuming cloud-based data for a variety of different scenarios.

In previous posts (e.g., here, here, and here), we described a number of different tools and techniques for pulling data from the cloud and demonstrated their application to various use cases. It is only natural, then, that with the introduction of a new option for accessing cloud-based data, we would eagerly set out to explore its capabilities.

In this post, we will describe our first impressions of Mountpoint for Amazon S3 - a new open-source solution for interfacing with cloud storage - and assess its performance on two use cases that are of particular interest to us: streaming sequential blocks of a relatively large data files (as detailed here) and consuming a large number of relatively small data files (as detailed here). For the sake of brevity, we will refer to Mountpoint for Amazon S3 simply as Mountpoint.

Importantly, keep in mind that, as of the time of this writing, Mountpoint remains under active development. You are strongly advised to stay up to date with the latest release of this tool (and all alternative tools) in order to make the most informed design decisions for your AI projects.

Although it can support other endpoints, Mountpoint prioritizes performance against Amazon S3. As such, the examples below will be run using Amazon's cloud services. However, our choice of cloud service provider - or the mention of any other tools, frameworks, or APIs - should not be viewed as an endorsement over their alternatives. Furthermore, please do not view this post as a replacement for the existing official documentation (e.g., here and here).

Yet Another FUSE Based Object Storage Access Solution

While there are many solutions for reading from and writing to file objects in the cloud, we can broadly divide them into two categories:

Explicit Data Transfer - Solutions that involve explicitly downloading data from the cloud for reading and uploading data for writing.
File-System Abstraction - Solutions that abstract cloud storage interactions behind a file-system-style interface, allowing seamless access to cloud-hosted files.

The second approach is often implemented using the Filesystem in Userspace (FUSE) interface, enabling cloud-based data buckets to be mounted as local directories. This allows existing applications to interact with cloud storage just as they would with a traditional file system - requiring little to no modification.

Mountpoint is a new FUSE based solution written in the Rust programming language and based on a Rust version of the Linux FUSE library. See here for an explanation of the choice of Rust. Other popular tools in the FUSE-based family of solutions are goofysand s3fs.

Using Mountpoint

To install Mountpoint, please follow the guidelines in the official documentation. The usage instructions of Mountpoint can be retrieved by running mount-s3 with the help flag. The text block below includes the first few lines of the output as well as a few of the many options that allow us to tune the behavior of the client.

Mountpoint for Amazon S3

Usage: mount-s3 [OPTIONS] <BUCKET_NAME> <DIRECTORY>

Arguments:\
  <BUCKET_NAME>\
          Name of bucket to mount

  <DIRECTORY>\
          Directory or FUSE file descriptor to mount the bucket at.

Mount options:\
      --read-only\
          Mount file system in read-only mode

Client options:\
      --maximum-throughput-gbps <N>\
          Maximum throughput in Gbps [default: auto-detected on EC2\
          instances, 10 Gbps elsewhere]

      --max-threads <N>\
          Maximum number of FUSE daemon threads

          [default: 16]

      --part-size <SIZE>\
          Part size for multi-part GET and PUT in bytes

          [default: 8388608]

Caching options:\
      --cache <DIRECTORY>\
          Enable caching of object content to the given directory and set\
          metadata TTL to 60 seconds

Mountpoint assumes appropriate configuration of your AWS credentials. Make sure to be aware of the current documented limitations of Mountpoint, as well as any special configuration that might be required.

In the next sections we will demonstrate the use of Mountpoint for Amazon S3 for two different use-cases and compare its performance with goofys. The experiments we will describe were conducted on an Amazon EC2 c5.4xlarge instance (with 16 vCPUs). For the sake of simplicity, we chose an Ubuntu (22.04) AWS Deep Learning AMI, preinstalled with Python (3.11) and PyTorch (2.5.1). To install mount-s3 and goofys we ran the following commands:

# install goofys\
sudo curl -Lo \\
  /usr/local/bin/goofys \\
  https://github.com/kahing/goofys/releases/latest/download/goofys\
sudo chmod +x /usr/local/bin/goofys

# install mount-s3\
wget \\
  https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.deb\
sudo dpkg -i mount-s3.deb\
sudo apt-get install -f -y

We used the following command lines for mounting and un-mounting:

# Mountpoint\
mount-s3 --read-only <s3_bucket_name> <local_path>

# goofys\
goofys -o ro <s3_bucket_name> <local_path>

# disable mount\
fusermount -z -u <local_path>

Keep in mind that the comparative performance results we will share are very much dependent on the details of the environment in which they were run. Furthermore, it is quite likely that with appropriate tuning of the command line controls we could have improved the performance of both the Mountpoint and goofys trials. We strongly encourage you to conduct your own experiments before drawing conclusions for your own project.

An Important Note About the Costs of Pulling Data from Amazon S3

Amazon S3 pricing consists of several components, one of which is based on the number of API calls (e.g., GET, SELECT, PUT, etc.). When using FUSE-based solutions such as Mountpoint or goofys, these API calls are abstracted away from the user, making it more difficult to directly assess the cost of reading data from Amazon S3 compared to explicitly pulling the data. Additionally, the number of API calls - and their associated costs - can be affected by the choice of command-line options.

A comparative cost analysis of different methods for streaming data from Amazon S3 is beyond the scope of this post, but conducting such an analysis is highly recommended before selecting the best approach for your needs.

Streaming Large Data Files

In our first experiment, we evaluated the performance of traversing a 2 GB binary file stored in the cloud. This file was assumed to contain 2,048 blocks of data (e.g., frames or data samples), each 1 MB in size.

The code block below demonstrates routines for sequentially reading through the file and for sampling data at non-sequential file offsets. Please see our previous posts for more details on how we designed the experiment and how we chose the metrics for comparison.

import time

KB = 1024\
MB = KB * KB

def read_sequential(f, t0):\
    t1 = time.time()\
    x = f.read(MB)\
    print(f'time of first sample: {time.time() - t1}')\
    print(f'total to first sample: {time.time() - t0}')\
    t1 = time.time()\
    count = 0\
    while True:\
        x = f.read(MB)\
        if len(x) == 0:\
            break\
        count += 1\
    print(f'time of avg read: {(time.time() - t1)/count}')

def fast_forward(f):\
    t1 = time.time()\
    total = 10\
    for i in range(total):\
        f.seek(i * 100 * MB)\
        t1 = time.time()\
        x = f.read(MB)\
    print(f'time of avg random read: {(time.time() - t1)/total}')

key = '<s3 key>'\
mount_dir = '<local mount>'\
sequential = True # toggle flag to run fast_forward

t0 = time.time()\
with open(f'{mount_dir}/{key}', 'rb') as f:\
    if sequential:\
        read_sequential(f, t0)\
        print(f'total time: {time.time()-t0}')\
    else:\
        fast_forward(f)

In the table below we compare the results we received.

Comparative Results of Pulling 2 GB File from S3 (by Author)

Although Mountpoint is slightly slower than goofys in loading the first frame, it outperforms goofys in all other metrics. For a comparison with other methods for streaming large files from the cloud, please refer to our previous post.

Consuming a Large Number of Small Files

In our second experiment we assessed the speed of feeding hundreds of thousands individual cloud-based data samples into a deep learning training environment. The code block below demonstrates the creation of a custom PyTorch Dataset for loading training samples from the local mount. We measured the speed of traversing thousands of image-label pairs, where each file was 1 MB in size. Please see this previous post for more details on how we designed the experiment.

from torch.utils.data import Dataset\
import os\
class SingleSampleDataset(Dataset):\
    def __init__(self):\
        super().__init__()\
        self.base_path = '<local_mount>'

    def __len__(self):\
        return 10000

    def get_from_files(self, image_path, label_path):\
        image_file = open(image_path, 'rb')\
        label_file = open(label_path, 'rb')\
        image = image_file.read()\
        label = label_file.read()\
        image_file.close()\
        label_file.close()\
        return {"image": image, "label": label}

    def __getitem__(self, index: int):\
        image_path = os.path.join(self.base_path, f'{index}.image')\
        label_path = os.path.join(self.base_path, f'{index}.label')\
        return self.get_from_files(image_path, label_path)

def get_dataset():\
    return SingleSampleDataset()

import torch, time\
from statistics import mean, variance\
dataset = get_dataset()\
dl = torch.utils.data.DataLoader(dataset, batch_size=4, num_workers=16)\
stats_lst = []\
t0 = time.perf_counter()\
for batch_idx, batch in enumerate(dl, start=1):\
    t = time.perf_counter() - t0\
    print(f'Iteration {batch_idx} Time {t}')\
    stats_lst.append(t)\
    t0 = time.perf_counter()\
mean_calc = mean(stats_lst[1:])\
var_calc = variance(stats_lst[1:])\
print(f'mean {mean_calc} variance {var_calc}')

When downloading large numbers of small files, goofys outperformed Mountpoint, averaging 0.08 seconds per sample compared to Mountpoint's 0.11 seconds. We surmise that the overhead observed at the start of a file in the previous experiment has a more significant impact when dealing with many small files. For results of other methods for consuming large numbers of small files from the cloud, refer to our previous post.

Summary

It's great to see a new actor in the cloud data streaming space, especially one explicitly intent on addressing the challenges faced by modern data applications. Some key highlights we found:

Performance Tuning - Mountpoint includes many controls that allow fine-tuning for improved performance.
Large File Streaming - When traversing large files, Mountpoint outperformed the solution we compared it to.
Ongoing Enhancements - Unlike other FUSE-based solutions, Mountpoint is under active development and expected to introduce further improvements.

One area for improvement we identified is mass-downloading many small files, where Mountpoint underperformed compared to its goofys counterpart.

As Mountpoint for Amazon S3 continues to evolve, we look forward to seeing it extend and enhance its capabilities.

On the Programmability of AWS Trainium and Inferentia

Chaim Rand — Sun, 03 Nov 2024 14:48:37 +0000

Accelerating AI/ML Model Training with Custom Operators — Part 4

Photo by Agata Bres on Unsplash

In this post we continue our exploration of the opportunities for runtime optimization of machine learning (ML) workloads through custom operator development. This time, we focus on the tools provided by the AWS Neuron SDK for developing and running new kernels on AWS Trainium and AWS Inferentia. With the rapid development of the low-level model components (e.g., attention layers) driving the AI revolution, the programmability of the accelerators used for training and running ML models is crucial. Dedicated AI chips, in particular, must offer a worthy alternative to the widely used and highly impactful general-purpose GPU (GPGPU) development frameworks, such as CUDA and Triton.

In previous posts (e.g., here and here) we explored the opportunity for building and running ML models on AWS's custom-built AI chips using the the dedicated AWS Neuron SDK. In its most recent release of the SDK (version 2.20.0), AWS introduced the Neuron Kernel Interface (NKI) for developing custom kernels for NeuronCore-v2, the underlying accelerator powering both Trainium and Inferentia2. The NKI interface joins another API that enables NeuronCore-v2 programmability, Neuron Custom C++ Operators. In this post we will explore both opportunities and demonstrate them in action.

Disclaimers

Importantly, this post should not be viewed as a substitute for the official AWS Neuron SDK documentation. At the time of this writing the Neuron SDK APIs for custom kernel development is in Beta, and may change by the time you read this. The examples we share are intended for demonstrative purposes, only. We make no claims as to their optimality, robustness, durability, or accuracy. Please do not view our mention of any platforms, tools, APIs, etc., as an endorsement for their use. The best choices for any project depend on the specifics of the use-case at hand and warrant appropriate investigation and analysis.

Developing Custom Kernels for Neuron Cores

Although the list of ML models supported by the Neuron SDK is continuously growing, some operations remain either unsupported or implemented suboptimally. By exposing APIs for Neuron kernel customization, the SDK empowers developers to create and/or optimize the low-level operations that they need, greatly increasing the opportunity for running ML workloads on Trainium and Inferentia.

As discussed in our previous posts in this series, fully leveraging the power of these AI chips requires a detailed understanding their low-level architecture.

The Neuron Core Architecture

The NKI documentation includes a dedicated section on the architecture design of NeuronCore-v2 and its implications on custom operator development. Importantly, there are many differences between Neuron cores and their AI accelerator counterparts (e.g., GPUs and TPUs). Optimizing for Neuron cores requires a unique set of strategies and skills.

Similar to other dedicated AI chips, NeuronCore-v2 includes several internal acceleration engines, each of which specializes in performing certain types of computations. The engines can be run asynchronously and in parallel. The Neuron Compiler is responsible for transforming ML models into low-level operations and optimizing the choice of compute engine for each one.

The Tensor engine specializes in matrix multiplication. The Vector and Scalar engines both operate on tensors with the Vector engine specializing in reduction operations and the Scalar engine in non-linear functions. GpSimd is a general purpose engine capable of running arbitrary C/C++ programs. Note that while the NKI interface exposes access to all four compute engines, custom C++ operators are designed specifically for the GpSimd.

More details on the capabilities of each engine can be found in the architecture documentation. Furthermore, the NKI Instruction Set Architecture (ISA) documentation provides details on the engines on which different low-level operations are run.

Another important aspect of the Neuron chip is its memory architecture. A Neuron device includes three types of memory, HBM, SBUF, and PSUM. An intimate understanding of the capacities and capabilities of each one is crucial for optimal kernel development.

Given the architecture overview, you might conclude that Neuron kernel development requires high expertise. While this may be true for creating fully optimized kernels that leverage all the capabilities of the Neuron core, our aim is to demonstrate the accessibility, value, and potential of the Neuron custom kernel APIs - even for non-expert developers.

Custom NKI Kernels

The NKI interface is a Python-level API that exposes the use of the Neuron core compute engines and memory resources to ML developers. The NKI Getting Started guide details the setup instructions and provides a soft landing with a simple, "hello world", kernel. The NKI Programming Model guide details the three stages of a typical NKI kernel (loading inputs, running operations on the computation engines, and storing outputs) and introduces the NKI Tile and Tile-based operations. The NKI tutorials demonstrate a variety of NKI kernel sample applications, with each one introducing new core NKI APIs and capabilities. Given the presumed optimality of the sample kernels, one possible strategy for developing new kernels could be to 1) identify a sample that is similar to the operation you wish to implement and then 2) use it as a baseline and iteratively refine and adjust it to achieve the specific functionality you require.

The NKI API Reference Manual details the Python API for kernel development. With a syntax and semantics that are similar to Triton and NumPy, the NKI language definition aims to maximize accessibility and ease of use. However, it is important to note that NKI kernel development is limited to the operations defined in the NKI library, which (as of the time of this writing) are fewer and more constrained than in libraries such as Triton and NumPy.

Toy Example - A GIOU Kernel

As in our previous posts, we assess the use of NKI by building a custom implementation of the Generalized Intersection Over Union (GIOU) operation on a pair of batches of input boxes. Since GIOU involves pixel-wise operations, we used the *exp *kernel from the NKI Programming guide as a reference point and incorporated the use of NKI's advanced tensor indexing in our implementation. To facilitate debugging in a CPU environment, we also added options to run the code using the nki.simulate_kernel and nki.language.device_print.html APIs.

import torch
import neuronxcc.nki as nki
import neuronxcc.nki.language as nl
import numpy as np

simulate = False

try:
    # if torch libraries are installed assume that we are running on Neuron
    import torch_xla.core.xla_model as xm
    import torch_neuronx
    from torch_neuronx import nki_jit

    device = xm.xla_device()

    # empty implementation
    def debug_print(*args, **kwargs):
        pass
except:
    # if torch libraries are not installed assume that we are running on CPU
    # and program script to use nki simulation
    simulate = True
    nki_jit = nki.trace
    debug_print = nl.device_print
    device = 'cpu'

@nki_jit
def giou_kernel(preds_ptr,
                targets_ptr,
                output_ptr):
    epsilon = 1e-5
    TILE_M = nl.tile_size.pmax  # 128
    TILE_N = nl.tile_size.psum_fmax  # 512
    TILE_N_OUT = TILE_N // 4

    p_1, p_2 = preds_ptr.shape
    t_1, t_2 = targets_ptr.shape
    o_1, o_2 = output_ptr.shape

    #  verify input
    # batch size must be multiple of 128
    assert p_1 % TILE_M == 0
    assert p_1 == t_1
    assert p_1 == o_1
    # num boxes box *4 must be multiple of 512
    assert p_2 % TILE_N == 0
    assert p_2 == t_2
    assert p_2 // 4 == o_2

    num_tiles_m = p_1 // TILE_M
    num_tiles_n = p_2 // TILE_N

    # Generate tensors for advanced indexing
    i_p = nl.arange(TILE_M)[:, None]
    i_f = nl.arange(TILE_N // 4)[None, :]
    i_f_0 = (4 * i_f)
    i_f_1 = (4 * i_f + 1)
    i_f_2 = (4 * i_f + 2)
    i_f_3 = (4 * i_f + 3)

    # Use affine_range to loop over tiles
    for m in nl.affine_range(num_tiles_m):
        for n in nl.affine_range(num_tiles_n):
            # Load input data from HBM
            preds = nl.load(preds_ptr[m * TILE_M:(m + 1) * TILE_M,
                            n * TILE_N:(n + 1) * TILE_N])
            targets = nl.load(targets_ptr[m * TILE_M:(m + 1) * TILE_M,
                              n * TILE_N:(n + 1) * TILE_N])
            debug_print('preds', preds)
            preds_left = preds[i_p, i_f_0]
            preds_top = preds[i_p, i_f_1]
            preds_right = preds[i_p, i_f_2]
            preds_bottom = preds[i_p, i_f_3]

            gt_left = targets[i_p, i_f_0]
            gt_top = targets[i_p, i_f_1]
            gt_right = targets[i_p, i_f_2]
            gt_bottom = targets[i_p, i_f_3]

            # Compute the area of each box
            area1 = (preds_right - preds_left) * (preds_bottom - preds_top)
            area2 = (gt_right - gt_left) * (gt_bottom - gt_top)

            # Compute the intersection
            left = nl.maximum(preds_left, gt_left)
            top = nl.maximum(preds_top, gt_top)
            right = nl.minimum(preds_right, gt_right)
            bottom = nl.minimum(preds_bottom, gt_bottom)

            inter_w = nl.maximum(right - left, 0)
            inter_h = nl.maximum(bottom - top, 0)
            inter_area = inter_w * inter_h

            union_area = area1 + area2 - inter_area

            iou_val = inter_area / nl.maximum(union_area, epsilon)

            # Compute the smallest enclosing box
            enclose_left = nl.minimum(preds_left, gt_left)
            enclose_top = nl.minimum(preds_top, gt_top)
            enclose_right = nl.maximum(preds_right, gt_right)
            enclose_bottom = nl.maximum(preds_bottom, gt_bottom)

            enclose_w = nl.maximum(enclose_right - enclose_left, 0)
            enclose_h = nl.maximum(enclose_bottom - enclose_top, 0)
            enclose_area = enclose_w * enclose_h

            # Compute GIOU
            delta_area = (enclose_area - union_area)
            enclose_area = nl.maximum(enclose_area, epsilon)
            giou = iou_val - delta_area / enclose_area

            # Store results
            nl.store(output_ptr[m * TILE_M:(m + 1) * TILE_M,
                     n * TILE_N_OUT:(n + 1) * TILE_N_OUT],
                     giou)

To run our GIOU kernel, we generate two batches of random boxes and feed them to our function:

# generate random data in np
np.random.seed(0)
batch_size = 1024
n_boxes = 256
img_size = 256
boxes = []

for i in range(2):
    # Randomly generate box sizes and positions
    box_sizes = np.random.randint(1, img_size, size=(batch_size,n_boxes,2))
    top_left = np.random.randint(0, img_size-1, size=(batch_size,n_boxes,2))
    bottom_right = np.clip(top_left + box_sizes, 0, img_size - 1)

    # Concatenate top-left and bottom-right coordinates
    rand_boxes = np.concatenate((top_left, bottom_right), axis=2)

    boxes.append(rand_boxes.astype(np.float32))

out = np.empty((batch_size, n_boxes), np.float32)

# convert tensors to PyTorch
t_boxes_0 = torch.tensor(boxes[0]).to(device)
t_boxes_1 = torch.tensor(boxes[1]).to(device)
t_out = torch.tensor(out).to(device)

if simulate:
    # the simulation API requires numpy input
    nki.simulate_kernel(giou_kernel,
                        boxes[0].reshape((batch_size, -1)),
                        boxes[1].reshape((batch_size, -1)),
                        out)
else:
    giou_kernel(t_boxes_0.view((batch_size, -1)),
                t_boxes_1.view((batch_size, -1)),
                t_out)

To assess the performance of our NKI kernel, we will compare it with the following naive implementation of GIOU in PyTorch:

def torch_giou(boxes1, boxes2):
    # loosely based on torchvision generalized_box_iou_loss code
    epsilon = 1e-5

    # Compute areas of both sets of boxes
    area1 = (boxes1[...,2]-boxes1[...,0])*(boxes1[...,3]-boxes1[...,1])
    area2 = (boxes2[...,2]-boxes2[...,0])*(boxes2[...,3]-boxes2[...,1])

    # Corners of intersection
    lt = torch.max(boxes1[..., :2], boxes2[..., :2])
    rb = torch.min(boxes1[..., 2:], boxes2[..., 2:])

    # Width and height of intersection
    wh = (rb - lt).clamp(min=0)

    # Area of the intersection
    inter = wh[..., 0] * wh[..., 1]

    # Union of the two boxes
    union = area1 + area2 - inter
    iou = inter / union.clamp(epsilon)

    # Corners of enclosing box
    lti = torch.min(boxes1[..., :2], boxes2[..., :2])
    rbi = torch.max(boxes1[..., 2:], boxes2[..., 2:])

    # Width and height of the enclosing box
    whi = (rbi - lti).clamp(min=0)

    # Area of the enclosing box
    areai = (whi[..., 0] * whi[..., 1]).clamp(epsilon)

    return iou - (areai - union) / areai

We use the following benchmarking utility to compare the runtime performance of our two functions:

import time
def benchmark(f, warmup_iters=20, ntrials: int = 100):
    def run(*args, **kwargs):
        # warmup
        for _ in range(warmup_iters):
            f(*args, **kwargs)
        start_time = time.time()
        for _ in range(ntrials):
            f(*args, **kwargs)
        end_time = time.time()
        # Calculate average time per iteration
        avg_time = (end_time - start_time) / ntrials
        return avg_time

    return run

avg_time = benchmark(torch_giou)(t_boxes_0, t_boxes_1)
print(f'torch_giou: {avg_time}')

avg_time = benchmark(giou_kernel)(t_boxes_0.view((batch_size, -1)),
                                  t_boxes_1.view((batch_size, -1)),
                                  t_out)
print(f'giou_kernel: {avg_time}')

Runtime Environment

We ran our script on an Amazon EC2 inf2.xlarge instance (containing two Neuron cores and four vCPUs). We used the most recent version of the Deep Learning AMI for Neuron available at the time of this writing, "Deep Learning AMI Neuron (Ubuntu 22.04) 20241027", with AWS Neuron 2.20.1 and PyTorch 2.1.

Results

Our custom GIOU kernel demonstrated an average runtime of 0.211 milliseconds compared to 0.293, amounting to a 39% performance boost. Keep in mind that these results are unique to our toy example. Other operators, particularly ones that include matrix multiplications (and utilize the Tensor engine) are likely to exhibit different comparative results.

Optimizing NKI Kernel Performance

The next step in our kernel development - beyond the scope of this post - would to be to analyze the performance of the GIOU kernel using the dedicated Neuron Profiler in order to identify bottlenecks and optimize our implementation. Please see the NKI performance guide for more details.

Neuron Custom C++ Operators

The second method for creating a custom Neuron kernel is to build a C++ operator for the GpSimd engine. This method is described in the Neuron Custom C++ Operators Developer Guide and demonstrated in the Neuron Custom C++ Operators in MLP and Neuron Custom C++ Operators Performance Optimization tutorials.

Neuron Custom C++ Operators presents an opportunity for "kernel fusion" on the GpSimd engine by facilitating the combination of multiple low-level operations into a single kernel execution. This approach can significantly reduce the overhead associated with: 1) loading multiple individual kernels, and 2) transferring data between different memory regions.

Toy Example - A GIOU C++ Kernel

In the code block below we implement a C++ GIOU operator for Neuron and save it to a file named giou.cpp. Our kernel uses the TCM accessor for optimizing memory read and write performance and applies the *multicore *setting in order to use all eight of the GpSimd's internal processors.

#include <stdint.h>
#include <stdlib.h>
#include <torch/torch.h>
#include <neuron/neuron-utils.hpp>
#include <algorithm>

// input boxes of shape 1024x256x4
// output scores of shape 1024x256
torch::Tensor giou(const torch::Tensor& t_pred,
                   const torch::Tensor& t_target) {
  size_t num_samples = t_pred.sizes()[0];
  size_t num_boxes = t_pred.sizes()[1];
  torch::Tensor t_out = get_dst_tensor();

  // get the number of GpSimd processors (8 in NeuronCoreV2)
  uint32_t cpu_count = get_cpu_count();
  // get index of current processor
  uint32_t cpu_id = get_cpu_id();

  // divide the batch size into 8 partitions
  uint32_t partition = num_samples / cpu_count;

  // use tcm buffers to load and write data
  size_t tcm_in_size = num_boxes*4;
  size_t tcm_out_size = num_boxes;
  float *tcm_pred = (float*)torch::neuron::tcm_malloc(
                                             sizeof(float)*tcm_in_size);
  float *tcm_target = (float*)torch::neuron::tcm_malloc(
                                             sizeof(float)*tcm_in_size);
  float *tcm_output = (float*)torch::neuron::tcm_malloc(
                                             sizeof(float)*tcm_in_size);
  auto t_pred_tcm_acc = t_pred.tcm_accessor();
  auto t_target_tcm_acc = t_target.tcm_accessor();
  auto t_out_tcm_acc = t_out.tcm_accessor();

  // iterate over each of the entries in the partition
  for (size_t i = 0; i < partition; i++) {
    // load the pred and target boxes into local memory
    t_pred_tcm_acc.tensor_to_tcm<float>(tcm_pred,
                                        partition*cpu_id + i*tcm_in_size,
                                        tcm_in_size);
    t_target_tcm_acc.tensor_to_tcm<float>(tcm_target,
                                          partition*cpu_id + i*tcm_in_size,
                                          tcm_in_size);

    // iterate over each of the boxes in the entry
    for (size_t j = 0; j < num_boxes; j++) {
      const float epsilon = 1e-5;
      const float* box1 = &tcm_pred[j * 4];
      const float* box2 = &tcm_target[j * 4];
      // Compute area of each box
      float area1 = (box1[2] - box1[0]) * (box1[3] - box1[1]);
      float area2 = (box2[2] - box2[0]) * (box2[3] - box2[1]);

      // Compute the intersection
      float left = std::max(box1[0], box2[0]);
      float top = std::max(box1[1], box2[1]);
      float right = std::min(box1[2], box2[2]);
      float bottom = std::min(box1[3], box2[3]);

      float inter_w = std::max(right - left, 0.f);
      float inter_h = std::max(bottom - top, 0.f);
      float inter_area = inter_w * inter_h;

      // Compute the union area
      float union_area = area1 + area2 - inter_area;

      // IoU
      float iou_val = inter_area / std::max(union_area, epsilon);

      // Compute the smallest enclosing box
      float enclose_left = std::min(box1[0], box2[0]);
      float enclose_top = std::min(box1[1], box2[1]);
      float enclose_right = std::max(box1[2], box2[2]);
      float enclose_bottom = std::max(box1[3], box2[3]);

      float enclose_w = std::max(enclose_right - enclose_left, 0.f);
      float enclose_h = std::max(enclose_bottom - enclose_top, 0.f);
      float enclose_area = std::max(enclose_w * enclose_h, epsilon);

      float result = iou_val - (enclose_area-union_area)/enclose_area;
      tcm_output[j] = result;
    }

    // write the giou scores of all boxes in the current entry
    t_out_tcm_acc.tcm_to_tensor<float>(tcm_output,
                                       partition*cpu_id + i*tcm_out_size,
                                       tcm_out_size);
  }

  torch::neuron::tcm_free(tcm_pred);
  torch::neuron::tcm_free(tcm_target);
  return t_out;
}

We require a separate shape.cpp file that defines the output shape of our GIOU function and registers our custom operator with the Neuron library:

#include <stdint.h>
#include <stdlib.h>
#include <torch/torch.h>
#include "torchneuron/register.h"

torch::Tensor giou_shape(torch::Tensor boxes1, torch::Tensor boxes2) {
    torch::Tensor t_out = torch::zeros({boxes1.sizes()[0],
                                        boxes1.sizes()[1]},
                                       torch::kFloat);
    return t_out;
}

NEURON_LIBRARY(my_ops, m) {
  m.def("giou", &giou_shape, "giou");
}

The build.py script compiles the C++ operator and exposes it as a Python API:

import os
import torch_neuronx
from torch_neuronx.xla_impl import custom_op

custom_op.load(
    name='giou',
    compute_srcs=['giou.cpp'],
    shape_srcs=['shape.cpp'],
    build_directory=os.getcwd(),
    multicore=True,
    verbose=True
)

The compilation script generates a *libgiou.so *library containing the implementation of our C++ GIOU operator. In the code block below we load the library and measure the performance of our custom kernel using the benchmarking utility defined above:

from torch_neuronx.xla_impl import custom_op
custom_op.load_library('libgiou.so')

avg_time = benchmark(torch.ops.my_ops.giou)(t_boxes_0, t_boxes_1)
print(f'C++ giou: {avg_time}')

Runtime Environment

We used the same Neuron environment from our NKI experiments to compile and test our C++ kernel. Please note the installation steps that are required for custom C++ operator development.

Results

Our C++ GIOU kernel demonstrated an average runtime of 0.061 milliseconds - nearly five times faster than our baseline implementation. This is presumably a result of "kernel fusion", as discussed above.

Conclusion

The table below summarizes the runtime results of our experiments.
Avg time of different GIOU implementations (lower is better) - by Author

Please keep in mind that these results are specific to the toy example and runtime environment used in this study. The comparative results of other kernels might be very different - depending on the degree to which they can leverage the Neuron core's internal compute engines.

The table below summarizes some of the differences we observed between the two methods of AWS Neuron kernel customization.

Comparison between kernel customization tools (by Author)

Through its high-level Python interface, the NKI APIs expose the power of the Neuron acceleration engines to ML developers in an accessible and user-friendly manner. The low-level C++ Custom Operators library enables even greater programmability, but is limited to the GpSimd engine. By effectively combining both tools, developers can fully leverage the AWS Neuron architecture's capabilities.

Summary

With the AI revolution in full swing, many companies are developing advanced new AI chips to meet the growing demand for compute. While public announcements often highlight these chips' runtime performance, cost savings, and energy efficiency, several core capabilities are essential to make these chips and their software stacks truly viable for ML development. These capabilities include robust debugging tools, performance analysis and optimization utilities, programmability, and more.

In this post, we focused on the utilities available for programming AWS's homegrown AI accelerators, Trainium and Inferentia, and demonstrated their use in building custom ML operations. These tools empower developers to optimize the performance of their ML models on AWS's AI chips and open up new opportunities for innovation and creativity.

AI Model Optimization on AWS Inferentia and Trainium

Chaim Rand — Sun, 03 Nov 2024 14:34:56 +0000

Tips for accelerating ML with AWS Neuron SDK

Photo by julien Tromeur on Unsplash
We are in a golden age of AI, with cutting-edge models disrupting industries and poised to transform life as we know it. Powering these advancements are increasingly powerful AI accelerators, such as NVIDIA H100 GPUs, Google Cloud TPUs, AWS's Trainium and Inferentia chips, and more. With the growing number of options comes the challenge of selecting the most optimal platform for our machine learning (ML) workloads - a crucial decision considering the high costs associated with AI computation. Importantly, a comprehensive assessment of each option necessitates ensuring that we are maximizing its utilization to fully leverage its capabilities.

In this post, we will review several techniques for optimizing an ML workload on AWS's custom-built AI chips using the AWS Neuron SDK. This continues our ongoing series of posts focused on ML model performance analysis and optimization across various platforms and environments (e.g., see here and here). While our primary focus will be on an ML training workload and AWS Inferentia2, the techniques discussed are also applicable to AWS Trainium. (Recall that although AWS Inferentia is primarily designed as an AI inference chip, we have previously demonstrated its effectiveness in training tasks as well.)

Generally speaking, performance optimization is an iterative process that includes a performance analysis step to appropriately identify performance bottlenecks and resource under-utilization (e.g., see here). However, since the techniques we will discuss are general purpose (i.e., they are potentially applicable to any model, regardless of their performance profile), we defer the discussion on performance analysis with the Neuron SDK to a future post.

Disclaimers

The code we will share is intended for demonstrative purposes only - we make no claims regarding its accuracy, optimality, or robustness. Please do not view this post as a substitute for the official Neuron SDK documentation. Please do not interpret our mention of any platforms, libraries, or optimization techniques as an endorsement for their use. The best options for you will depend greatly on the specifics of your use-case and will require your own in-depth investigation and analysis.

The experiments described below were run on an Amazon EC2 inf2.xlarge instance (containing two Neuron cores and four vCPUs). We used the most recent version of the Deep Learning AMI for Neuron available at the time of this writing, "Deep Learning AMI Neuron (Ubuntu 22.04) 20240927", with AWS Neuron 2.20 and PyTorch 2.1. See the SDK documentation for more details on setup and installation. Keep in mind that the Neuron SDK is under active development and that the APIs we refer to, as well as the runtime measurements we report, may become outdated by the time you read this. Please be sure to stay up-to-date with the latest SDK and documentation available.

Toy Model

To facilitate our discussion, we introduce the following simple Vision Transformer (ViT)-backed classification model (based on timm version 1.0.10):

from torch.utils.data import Dataset
import time, os
import torch
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl
from timm.models.vision_transformer import VisionTransformer

# use random data
class FakeDataset(Dataset):
  def __len__(self):
    return 1000000

  def __getitem__(self, index):
    rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
    label = torch.tensor(data=index % 1000, dtype=torch.int64)
    return rand_image, label

def train(batch_size=16, num_workers=0):
  # Initialize XLA process group for torchrun
  import torch_xla.distributed.xla_backend
  torch.distributed.init_process_group('xla')

  # multi-processing: ensure each worker has same initial weights
  torch.manual_seed(0)
  dataset = FakeDataset()
  model = VisionTransformer()

  # load model to XLA device
  device = xm.xla_device()
  model = model.to(device)
  optimizer = torch.optim.Adam(model.parameters())
  data_loader = torch.utils.data.DataLoader(dataset,
                                            batch_size=batch_size,
                                            num_workers=num_workers)

  data_loader = pl.MpDeviceLoader(data_loader, device)
  loss_function = torch.nn.CrossEntropyLoss()
  summ = 0
  count = 0
  t0 = time.perf_counter()

  for step, (inputs, targets) in enumerate(data_loader, start=1):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss.backward()
    xm.optimizer_step(optimizer)
    batch_time = time.perf_counter() - t0
    if step > 10:  # skip first steps
      summ += batch_time
      count += 1
    t0 = time.perf_counter()
    if step > 500:
      break
  print(f'average step time: {summ/count}')

if __name__ == '__main__':
  train()

# Initialization command:
# torchrun --nproc_per_node=2 train.py

Running our baseline model on the two cores of our AWS Inferentia instance, results in a training speed of 251.98 samples per second.

In the next sections, we will iteratively apply a number of potential optimization techniques and assess their impact on step time performance. While we won't go into the full details of each method, we will provide references for further reading (e.g., here). Importantly, the list we will present is not all-inclusive --- there are many techniques beyond what we will cover. We will organize the methods into three categories: PyTorch optimizations, OpenXLA optimizations, and Neuron-specific optimizations. However, the order of presentation is not binding. In fact, some of the techniques are interdependent --- for example, applying the mixed precision optimization may free up enough device memory to enable increasing the batch size.

PyTorch Performance Optimizations

In previous posts (e.g., here) we have covered the topic of PyTorch model performance analysis and optimization on GPU, extensively. Many of the techniques we discussed are relevant to other AI accelerators. In this section we will revisit few of these techniques and apply them to AWS Inferentia.

Multi-process Data Loading

In multi process data loading the input data is prepared in one or more dedicated CPU processes rather than in the same process that runs the training step. This allows for overlapping the data loading and training which can increase system utilization and lead to a significant speed-up. The number of processes is controlled by the num_workers parameter of the PyTorch DataLoader. In the following block we run our script with *num_workers *set to one:

train(num_workers=1)

This change results in a training speed of 253.56 samples per second for a boost of less than 1%.

Batch Size Optimization

Another important hyperparameter that can influence training speed is the training batch size. Often, we have found that increasing the batch size improves system utilization and results in better performance. However, the effects can vary based on the model and platform. In the case of our toy model on AWS Inferentia, we find that running with a batch size of 8 samples per neuron core results in a speed of 265.68 samples per second - roughly 5% faster than a batch size of 16 samples per core.

train(batch_size=8, num_workers=1)

PyTorch Automatic Mixed Precision

Another common method for boosting performance is to use lower precision floats such as the 16-bit BFloat16. Importantly, some model components might not be compatible with reduced precision floats. PyTorch's Automatic Mixed Precision (AMP) mode attempts to match the most appropriate floating point type to each model operation automatically. Although, the Neuron compiler offers different options for employing mixed precision, it also supports the option of using PyTorch AMP. In the code block below we include the modifications required to use PyTorch AMP.

def train(batch_size=16, num_workers=0):
  # Initialize XLA process group for torchrun
  import torch_xla.distributed.xla_backend
  torch.distributed.init_process_group('xla')

  # multi-processing: ensure each worker has same initial weights
  torch.manual_seed(0)
  dataset = FakeDataset()
  model = VisionTransformer()

  # load model to XLA device
  device = xm.xla_device()
  model = model.to(device)
  optimizer = torch.optim.Adam(model.parameters())
  data_loader = torch.utils.data.DataLoader(dataset,
                                            batch_size=batch_size,
                                            num_workers=num_workers)

  data_loader = pl.MpDeviceLoader(data_loader, device)
  loss_function = torch.nn.CrossEntropyLoss()
  summ = 0
  count = 0
  t0 = time.perf_counter()

  for step, (inputs, targets) in enumerate(data_loader, start=1):
    optimizer.zero_grad()

    # use PyTorch AMP
    with torch.autocast(dtype=torch.bfloat16, device_type='cuda'):
      outputs = model(inputs)
      loss = loss_function(outputs, targets)
    loss.backward()
    xm.optimizer_step(optimizer)
    batch_time = time.perf_counter() - t0
    if step > 10:  # skip first steps
      summ += batch_time
      count += 1
    t0 = time.perf_counter()
    if step > 500:
      break
  print(f'average step time: {summ/count}')

if __name__ == '__main__':
  # disable neuron compilar casting
  os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"
  torch.cuda.is_bf16_supported = lambda: True
  train(batch_size=8, num_workers=1)

The resultant training speed is 196.64 samples per second, about 26% lower than the default mixed precision setting of the Neuron compiler. It's important to note that while this post focuses on performance, in real-world scenarios, we would also need to evaluate the effect of the mixed precision policy we choose on model accuracy.

OpenXLA Optimizations

As discussed in a previous post, Neuron Cores are treated as XLA devices and the torch-neuronx Python package implements the PyTorch/XLA API. Consequently, any optimization opportunities provided by the OpenXLA framework, and specifically those offered by the PyTorch/XLA API, can be leveraged on AWS Inferentia and Trainium. In this section we consider a few of these opportunities.

BFloat16 Precision

OpenXLA supports the option of casting all floats to BFloat16 via the XLA_USE_BF16 environment variable, as shown in the code block below:

if __name__ == '__main__':
  os.environ['XLA_USE_BF16'] = '1'
  train(batch_size=8, num_workers=1)

The resultant training speed is 394.51 samples per second, nearly 50% faster than the speed of the default mixed precision option.

Multi-process Device Loading

The PyTorch/XLA MpDeviceLoader and its internal ParallelLoader, which are responsible for loading input data on to the accelerator, include a number of parameters for controlling the transfer of data from the host to the device. In the code block below we tune batches_per_execution setting which determines the number of batches copied to the device for each execution cycle of the ParallelLoader. By increasing this setting, we aim to reduce the overhead of the host-to-device communication:

data_loader = torch.utils.data.DataLoader(dataset,
                                          batch_size=batch_size,
                                          num_workers=num_workers
                                          )
data_loader = pl.MpDeviceLoader(data_loader,
                                device, batches_per_execution=10)

As a result of this optimization, the training speed increased to 1,027.39 samples per second, representing an additional 260% speed-up.

Torch Compilation with OpenXLA Backend

In previous posts (e.g., here), we have demonstrated the potential performance gains from using PyTorch's graph compilation offering. Although OpenXLA includes its own graph creation and Just-In-Time (JIT) compilation mechanisms, torch.compile can provide additional acceleration by eliminating the need for tracing the model operations at every step. The following code snippet demonstrates the use of the dedicated openxla backend for compiling the model:

model = model.to(device)
model = torch.compile(backend='openxla')

Although torch.compile is currently not yet supported by the Neuron SDK, we include its mention in anticipation of its future release.

Neuron SDK Optimizations

In this section we consider some of the optimization opportunities offered by the AWS Neuron SDK and, more specifically, by the Neuron compiler.

Mixed Precision

The Neuron SDK supports a variety of mixed precision settings. In the code block below we program the compiler to cast all floats to BFloat16 via the NEURON_CC_FLAGS environment variable.

if __name__ == '__main__':
  os.environ["NEURON_CC_FLAGS"] = "--auto-cast all --auto-cast-type bf16"
  train(batch_size=8, num_workers=1)

This results (unsurprisingly) in a similar training speed to the OpenXLA BFloat16 experiment described above.

FP8

One of the unique features of NeuronCoreV2 is its support of the eight-bit floating point type, fp8_e4m3. The code block below demonstrates how to configure the Neuron compiler to automatically cast all floating-point operations to FP8:

if __name__ == '__main__':
 os.environ["NEURON_CC_FLAGS"] = "--auto-cast all --auto-cast-type fp8_e4m3"
 train(batch_size=8, num_workers=1)

While FP8 can accelerate training in some cases, maintaining stable convergence can be more challenging than when using BFloat16 due its reduced precision and dynamic range. Please see our previous post for more on the potential benefits and challenges of FP8 training.

In the case of our model, using FP8 actually harms runtime performance compared to BFloat16, reducing the training speed to 940.36 samples per second.

Compiler Optimizations

The Neuron compiler includes a number of controls for optimizing the runtime performance of the compiled graph. Two key settings are model-type and opt-level. The model-type *setting applies optimizations tailored to specific model architectures, such as transformers, while the *opt-level *setting allows for balancing compilation time against runtime performance. In the code block below, we program the *model-type setting to tranformer and the opt-level setting to the highest performance option. We further specify the target runtime device, inf2, to ensure that the model is optimized for the target device.

if __name__ == '__main__':
  os.environ['XLA_USE_BF16'] = '1'
  os.environ["NEURON_CC_FLAGS"] = "--model-type transformer " \
                                  "--optlevel 3" \
                                  " --target inf2"
  train(batch_size=8, num_workers=1)

The above configuration resulted in a training speed of 1093.25 samples per second, amounting to a modest 6% improvement.

Results

We summarize the results of our experiments in the table below. Keep in mind that the effect of each of the optimization methods we discussed will depend greatly on the model and the runtime environment.
Experiment Results (by Author)
The techniques we employed resulted in a 435% performance boost compared to our baseline experiment. It is likely that additional acceleration could be achieved by revisiting and fine-tuning some of the methods we discussed, or by applying other optimization techniques not covered in this post.

Our goal has been to demonstrate some of the available optimization strategies and demonstrate their potential impact on runtime performance. However, in a real-world scenario, we would need to assess the manner in which each of these optimizations impact our model convergence. In some cases, adjustments to the model configuration may be necessary to ensure optimal performance without sacrificing accuracy. Additionally, using a performance profiler to identify bottlenecks and measure system resource utilization is essential for guiding and informing our optimization activities.

Summary

Nowadays, we are fortunate to have a wide variety of systems on which to run our ML workloads. No matter which platform we choose, our goal is to maximize its capabilities. In this post, we focused on AWS Inferentia and reviewed several techniques for accelerating ML workloads running on it. Be sure to check out our other posts for more optimization strategies across various AI accelerators.

Training AI Models on CPU on AWS EC2

Chaim Rand — Wed, 04 Sep 2024 07:09:43 +0000

Revisiting CPU for ML in an Era of GPU Scarcity

Photo by Quino Al on Unsplash

The recent successes in AI are often attributed to the emergence and evolutions of the GPU. The GPU's architecture, which typically includes thousands of multi-processors, high-speed memory, dedicated tensor cores, and more, is particularly well-suited to meet the intensive demands of AI/ML workloads. Unfortunately, the rapid growth in AI development has led to a surge in the demand for GPUs, making them difficult to obtain. As a result, ML developers are increasingly exploring alternative hardware options for training and running their models. In previous posts, we discussed the possibility of training on dedicated AI ASICs such as Google Cloud TPU, Haban Gaudi, and AWS Trainium. While these options offer significant cost-saving opportunities, they do not suit all ML models and can, like the GPU, also suffer from limited availability. In this post we return to the good old-fashioned CPU and revisit its relevance to ML applications. Although CPUs are generally less suited to ML workloads compared to GPUs, they are much easier to acquire. The ability to run (at least some of) our workloads on CPU could have significant implications on development productivity.

In previous posts (e.g., here) we emphasized the importance of analyzing and optimizing the runtime performance of AI/ML workloads as a means of accelerating development and minimizing costs. While this is crucial regardless of the compute engine used, the profiling tools and optimization techniques can vary greatly between platforms. In this post, we will discuss some of the performance optimization options that pertain to CPU. Our focus will be on Intel® Xeon® CPU processors (with Intel® AVX-512) and on the PyTorch (version 2.4) framework (although similar techniques can be applied to other CPUs and frameworks, as well). More specifically, we will run our experiments on an Amazon EC2 c7i instance with an AWS Deep Learning AMI. Please do not view our choice of Cloud platform, CPU version, ML framework, or any other tool or library we should mention, as an endorsement over their alternatives.

Our goal will be to demonstrate that although ML development on CPU may not be our first choice, there are ways to "soften the blow" and - in some cases - perhaps even make it a viable alternative.

Disclaimers

Our intention in this post is to demonstrate just a few of the ML optimization opportunities available on CPU. Contrary to most of the online tutorials on the topic of ML optimization on CPU, we will focus on a training workload rather than an inference workload. There are a number of optimization tools focused specifically on inference that we will not cover (e.g., see here and here).

Please do not view this post as a replacement of the official documentation on any of the tools or techniques that we mention. Keep in mind that given the rapid pace of AI/ML development, some of the content, libraries, and/or instructions that we mention may become outdated by the time you read this. Please be sure to refer to the most-up-to-date documentation available.

Importantly, the impact of the optimizations that we discuss on runtime performance is likely to vary greatly based on the model and the details of the environment (e.g., see the high degree of variance between models on the official PyTorch TouchInductor CPU Inference Performance Dashboard). The comparative performance numbers we will share are specific to the toy model and runtime environment that we will use. Be sure to reevaluate all of the proposed optimizations on your own model and runtime environment.

Lastly, our focus will be solely on throughput performance (as measured in samples per second) - not on training convergence. However, it should be noted that some optimization techniques (e.g., batch size tuning, mixed precision, and more) could have a negative effect on the convergence of certain models. In some cases, this can be overcome through appropriate hyperparameter tuning.

Toy Example - ResNet-50

We will run our experiments on a simple image classification model with a ResNet-50 backbone (from Deep Residual Learning for Image Recognition). We will train the model on a fake dataset. The full training script appears in the code block below (loosely based on this example):

import torch

import torchvision

from torch.utils.data import Dataset, DataLoader

import time

# A dataset with random images and labels

class FakeDataset(Dataset):

    def __len__(self):

        return 1000000

    def __getitem__(self, index):

        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)

        label = torch.tensor(data=index % 10, dtype=torch.uint8)

        return rand_image, label

train_set = FakeDataset()

batch_size=128

num_workers=0

train_loader = DataLoader(

    dataset=train_set,

    batch_size=batch_size,

    num_workers=num_workers

)

model = torchvision.models.resnet50()

criterion = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters())

model.train()

t0 = time.perf_counter()

summ = 0

count = 0

for idx, (data, target) in enumerate(train_loader):

    optimizer.zero_grad()

    output = model(data)

    loss = criterion(output, target)

    loss.backward()

    optimizer.step()

    batch_time = time.perf_counter() - t0

    if idx > 10:  # skip first steps

        summ += batch_time

        count += 1

    t0 = time.perf_counter()

    if idx > 100:

        break

print(f'average step time: {summ/count}')

print(f'throughput: {count*batch_size/summ}')

Running this script on a c7i.2xlarge (with 8 vCPUs) and the CPU version of PyTorch 2.4, results in a throughput of 9.12 samples per second. For the sake of comparison, we note that the throughput of the same (unoptimized script) on an Amazon EC2 g5.2xlarge instance (with 1 GPU and 8 vCPUs) is 340 samples per second. Taking into account the comparative costs of these two instance types ($0.357 per hour for a c7i.2xlarge and $1.212 for a g5.2xlarge, as of the time of this writing), we find that training on the GPU instance to give roughly eleven(!!) times better price performance. Based on these results, the preference for using GPUs to train ML models is very well founded. Let's assess some of the possibilities for reducing this gap.

PyTorch Performance Optimizations

In this section we will explore some basic methods for increasing the runtime performance of our training workload. Although you may recognize some of these from our post on GPU optimization, it is important to highlight a significant difference between training optimization on CPU and GPU platforms. On GPU platforms much of our effort was dedicated to maximizing the parallelization between (the training data preprocessing on) the CPU and (the model training on) the GPU. On CPU platforms all of the processing occurs on the CPU and our goal will be to allocate its resources most effectively.

Batch Size

Increasing the training batch size can potentially increase performance by reducing the frequency of the model parameter updates. (On GPUs it has the added benefit of reducing the overhead of CPU-GPU transactions such as kernel loading). However, while on GPU we aimed for a batch size that would maximize the utilization of the GPU memory, the same strategy might hurt performance on CPU. For reasons beyond the scope of this post, CPU memory is more complicated and the best approach for discovering the most optimal batch size may be through trial and error. Keep in mind that changing the batch size could affect training convergence.

The table below summarizes the throughput of our training workload for a few (arbitrary) choices of batch size:

Training Throughput as Function of Batch Size (by Author)

Contrary to our findings on GPU, on the c7i.2xlarge instance type our model appears to prefer lower batch sizes.

Multi-process Data Loading

A common technique on GPUs is to assign multiple processes to the data loader so as to reduce the likelihood of starvation of the GPU. On GPU platforms, a general rule of thumb is to set the number of workers according to the number of CPU cores. However, on CPU platforms, where the model training uses the same resources as the data loader, this approach could backfire. Once again, the best approach for choosing the optimal number of workers may be trial and error. The table below shows the average throughput for different choices of num_workers:

Training Throughput as Function of the Number of Data Loading Workers (by Author)

Mixed Precision

Another popular technique is to use lower precision floating point datatypes such as torch.float16 or torch.bfloat16 with the dynamic range of torch.bfloat16 generally considered to be more amiable to ML training. Naturally, reducing the datatype precision can have adverse effects on convergence and should be done carefully. PyTorch comes with torch.amp, an automatic mixed precision package for optimizing the use of these datatypes. Intel® AVX-512 includes support for the bfloat16 datatype. The modified training step appears below:

for idx, (data, target) in enumerate(train_loader):

    optimizer.zero_grad()

    with torch.amp.autocast('cpu',dtype=torch.bfloat16):

        output = model(data)

        loss = criterion(output, target)

    loss.backward()

    optimizer.step()

The throughput following this optimization is 24.34 samples per second, an increase of 86%!!

Channels Last Memory Format

Channels last memory format is a beta-level optimization (at the time of this writing), pertaining primarily to vision models, that supports storing four dimensional (NCHW) tensors in memory such that the channels are the last dimension. This results in all of the data of each pixel being stored together. This optimization pertains primarily to vision models. Considered to be more "friendly to Intel platforms", this memory format is reported boost the performance of a ResNet-50 on an Intel® Xeon® CPU. The adjusted training step appears below:

for idx, (data, target) in enumerate(train_loader):

    data = data.to(memory_format=torch.channels_last)

    optimizer.zero_grad()

    with torch.amp.autocast('cpu',dtype=torch.bfloat16):

        output = model(data)

        loss = criterion(output, target)

    loss.backward()

    optimizer.step()

The resulting throughput is 37.93 samples per second - an additional 56% improvement and a total of 415% compared to our baseline experiment. We are on a role!!

Torch Compilation

In a previous post we covered the virtues of PyTorch's support for graph compilation and its potential impact on runtime performance. Contrary to the default eager execution mode in which each operation is run independently (a.k.a., "eagerly"), the compile API converts the model into an intermediate computation graph which is then JIT-compiled into low-level machine code in a manner that is optimal for the underlying training engine. The API supports compilation via different backend libraries and with multiple configuration options. Here we will limit our evaluation to the *default *(TorchInductor) backend and the ipex backend from the Intel® Extension for PyTorch, a library with dedicated optimizations for Intel hardware. Please see the documentation for appropriate installation and usage instructions. The updated model definition appears below:

import intel_extension_for_pytorch as ipex

model = torchvision.models.resnet50()

backend='inductor' # optionally change to 'ipex'

model = torch.compile(model, backend=backend)

In the case of our toy model, the impact of torch compilation is only apparent when the "channels last" optimization is disabled (and increase of ~27% for each of the backends). When "channels last" is applied, the performance actually drops. As a result, we drop this optimization from our subsequent experiments.

Memory and Thread Optimizations

There are a number of opportunities for optimizing the use of the underlying CPU resources. These include optimizing memory management and thread allocation to the structure of the underlying CPU hardware. Memory management can be improved through the use of advanced memory allocators (such as Jemalloc and TCMalloc) and/or reducing memory accesses that are slower (i.e., across NUMA nodes). Threading allocation can be improved through appropriate configuration of the OpenMP threading library and/or use of Intel's Open MP library.

Generally speaking, these kinds of optimizations require a deep level understanding of the CPU architecture and the features of its supporting SW stack. To simplify matters, PyTorch offers the torch.backends.xeon.run_cpu script for automatically configuring the memory and threading libraries so as to optimize runtime performance. The command below will result in the use of the dedicated memory and threading libraries. We will return to the topic of NUMA nodes when we discuss the option of distributed training.

We verify appropriate installation of TCMalloc (conda install conda-forge::gperftools) and Intel's Open MP library (pip install intel-openmp), and run the following command.

python -m torch.backends.xeon.run_cpu train.py

The use of the run_cpu script further boosts our runtime performance to 39.05 samples per second. Note that the run_cpu script includes many controls for further tuning performance. Be sure to check out the documentation in order to maximize its use.

The Intel Extension for PyTorch

The Intel® Extension for PyTorch includes additional opportunities for training optimization via its ipex.optimize function. Here we demonstrate its default use. Please see the documentation to learn of its full capabilities.

model = torchvision.models.resnet50()

criterion = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters())

model.train()

model, optimizer = ipex.optimize(

   model,

   optimizer=optimizer,

   dtype=torch.bfloat16

)

Combined with the memory and thread optimizations discussed above, the resultant throughput is 40.73 samples per second. (Note that a similar result is reached when disabling the "channels last" configuration.)

Distributed Training on CPU

Intel® Xeon® processors are designed with Non-Uniform Memory Access (NUMA) in which the CPU memory is divided into groups, a.k.a., NUMA nodes, and each of the CPU cores is assigned to one node. Although any CPU core can access the memory of any NUMA node, the access to its own node (i.e., its local memory) is much faster. This gives rise to the notion of distributing training across NUMA nodes, where the CPU cores assigned to each NUMA node act as a single process in a distributed process group and data distribution across nodes is managed by Intel® oneCCL, Intel's dedicated collective communications library.

We can run data distributed training across NUMA nodes easily using the ipexrunutility. In the following code block (loosely based on this example) we adapt our script to run data distributed training (according to usage detailed here):

import os, time

import torch

from torch.utils.data import Dataset, DataLoader

from torch.utils.data.distributed import DistributedSampler

import torch.distributed as dist

import torchvision

import oneccl_bindings_for_pytorch as torch_ccl

import intel_extension_for_pytorch as ipex

os.environ["MASTER_ADDR"] = "127.0.0.1"

os.environ["MASTER_PORT"] = "29500"

os.environ["RANK"] = os.environ.get("PMI_RANK", "0")

os.environ["WORLD_SIZE"] = os.environ.get("PMI_SIZE", "1")

dist.init_process_group(backend="ccl", init_method="env://")

rank = os.environ["RANK"]

world_size = os.environ["WORLD_SIZE"]

batch_size = 128

num_workers = 0

# define dataset and dataloader

class FakeDataset(Dataset):

    def __len__(self):

        return 1000000

    def __getitem__(self, index):

        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)

        label = torch.tensor(data=index % 10, dtype=torch.uint8)

        return rand_image, label

train_dataset = FakeDataset()

dist_sampler = DistributedSampler(train_dataset)

train_loader = DataLoader(

    dataset=train_dataset,

    batch_size=batch_size,

    num_workers=num_workers,

    sampler=dist_sampler

)

# define model artifacts

model = torchvision.models.resnet50()

criterion = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters())

model.train()

model, optimizer = ipex.optimize(

    model,

    optimizer=optimizer,

    dtype=torch.bfloat16

)

# configure DDP

model = torch.nn.parallel.DistributedDataParallel(model)

# run training loop

# destroy the process group

dist.destroy_process_group()

Unfortunately, as of the time of this writing, the Amazon EC2 c7i instance family does not include a multi-NUMA instance type. To test our distributed training script, we revert back to a Amazon EC2 c6i.32xlarge instance with 64 vCPUs and 2 NUMA nodes. We verify the installation of Intel® oneCCL Bindings for PyTorch and run the following command (as documented here):

source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh

# This example command would utilize all the numa sockets of the processor, taking each socket as a rank.

ipexrun cpu --nnodes 1 --omp_runtime intel train.py

The following table compares the performance results on the c6i.32xlarge instance with and without distributed training:

Distributed Training Across NUMA Nodes (by Author)

In our experiment, data distribution did not boost the runtime performance. Please see ipexrun documentation for additional performance tuning options.

CPU Training with Torch/XLA

In previous posts (e.g., here) we discussed the PyTorch/XLA library and its use of XLA compilation to enable PyTorch based training on XLA devicessuch as TPU, GPU, and CPU. Similar to torch compilation, XLA uses graph compilation to generate machine code that is optimized for the target device. With the establishment of the OpenXLA Project, one of the stated goals was to support high performance across all hardware backends, including CPU (see the CPU RFC here). The code block below demonstrates the adjustments to our original (unoptimized) script required to train using PyTorch/XLA:

import torch

import torchvision

import timeimport torch_xla

import torch_xla.core.xla_model as xm

device = xm.xla_device()

model = torchvision.models.resnet50().to(device)

criterion = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters())

model.train()

for idx, (data, target) in enumerate(train_loader):

    data = data.to(device)

    target = target.to(device)

    optimizer.zero_grad()

    output = model(data)

    loss = criterion(output, target)

    loss.backward()

    optimizer.step()

    xm.mark_step()

Unfortunately, (as of the time of this writing) the XLA results on our toy model seem far inferior to the (unoptimized) results we saw above (- by as much as 7X). We expect this to improve as PyTorch/XLA's CPU support matures.

Results

We summarize the results of a subset of our experiments in the table below. For the sake of comparison, we add the throughput of training our model on Amazon EC2 g5.2xlarge GPU instance following the optimization steps discussed in this post. The samples per dollar was calculated based on the Amazon EC2 On-demand pricing page ($0.357 per hour for a c7i.2xlarge and $1.212 for a g5.2xlarge, as of the time of this writing).

Performance Optimization Results (by Author)

Although we succeeded in boosting the training performance of our toy model on the CPU instance by a considerable margin (446%), it remains inferior to the (optimized) performance on the GPU instance. Based on our results, training on GPU would be ~6.7 times cheaper. It is likely that with additional performance tuning and/or applying additional optimizations strategies, we could further close the gap. Once again, we emphasize that the comparative performance results we have reached are unique to this model and runtime environment.

Amazon EC2 Spot Instances Discounts

The increased availability of cloud-based CPU instance types (compared to GPU instance types) may imply greater opportunity for obtaining compute power at discounted rates, e.g., through Spot Instance utilization. Amazon EC2 Spot Instances are instances from surplus cloud service capacity that are offered for a discount of as much as 90% off the On-Demand pricing. In exchange for the discounted price, AWS maintains the right to preempt the instance with little to no warning. Given the high demand for GPUs, you may find CPU spot instances easier to get ahold of than their GPU counterparts. At the time of this writing, c7i.2xlarge Spot Instance price is $0.1291 which would improve our samples per dollar result to 1135.76 and further reduces the gap between the optimized GPU and CPU price performances (to 2.43X).

While the runtime performance results of the optimized CPU training of our toy model (and our chosen environment) were lower than the GPU results, it is likely that the same optimization steps applied to other model architectures (e.g., ones that include components that are not supported by GPU) may result in the CPU performance matching or beating that of the GPU. And even in cases where the performance gap is not bridged, there may very well be cases where the shortage of GPU compute capacity would justify running some of our ML workloads on CPU.

Summary

Given the ubiquity of the CPU, the ability to use them effectively for training and/or running ML workloads could have huge implications on development productivity and on end-product deployment strategy. While the nature of the CPU architecture is less amiable to many ML applications when compared to the GPU, there are many tools and techniques available for boosting its performance - a select few of which we have discussed and demonstrated in this post.

In this post we focused optimizing training on CPU. Please be sure to check out our many other posts on medium covering a wide variety of topics pertaining to performance analysis and optimization of machine learning workloads.

A Priority Based Scheduler for Amazon SageMaker Training Jobs

Chaim Rand — Mon, 11 Mar 2024 09:06:14 +0000

Optimizing the use of limited AI training accelerators — Part 2

Photo by Adrien Aletti on Unsplash

This post was created in collaboration with Max Rabin.

This is the second part of a series of posts on the topic of maximizing the utility of scarce AI resources. In the first post we noted the increasing limitations on the ability to scale up AI resources at will and, as a consequence, the growing trend of AI development teams to guarantee AI compute capacity by means such as building up an in-house AI server farm and/or reserving dedicated instances in the cloud. The scarcity of AI compute resources motivates the design of specialized scheduling solutions to minimize idle time and prioritize critical workloads. Please see our previous post in which we proposed a detailed list of requirements for such solutions. The approach we took there was to leverage the existing priority-based scheduler that comes with Kubernetes and align our training development workflow to its use. In this post we explore the option of maintaining our existing framework for training AI models and enhancing it with our own custom implementation of a priority-based scheduler. Importantly, the need for this type of solution is often motivated not just by the scarcity of AI resources, but also by the desire to increase control over the orchestration and prioritization of training workloads so as to reduce development costs. For example, even in a scenario of abundant capacity, you may choose to limit your use to a fixed number of training instances so as to cap your training expenditure.

For the purposes of this post, we will assume that our training framework of choice is AWS’s managed service for AI model training, Amazon SageMaker. The solution we will propose will use additional AWS services such as Amazon DynamoDB and AWS Lambda. The choice to demonstrate our solution using AWS services should not be viewed as endorsement. There are many cloud-based service offerings available and the best one for you will depend on the particular details of your project. Similar solutions to the one that we will describe can be designed on other cloud-based environments and/or using alternative cloud-based services.

The Traditional Method for Starting Up SageMaker Training Jobs

Traditionally, we would start up a SageMaker training job using the Amazon SageMaker Python SDK. In the code block below we use the SageMaker SDK (version 2.208) to run a PyTorch training workload on a single instance of type p5.48xlarge.

from sagemaker.pytorch import PyTorch  

# define job  
estimator = PyTorch(  
    role='<sagemaker role>',  
    entry_point='train.py',  
    instance_type='ml.p5.48xlarge',  
    instance_count=1,  
    framework_version='2.0.1',  
    py_version='py310',  
    tags=[{'Key': 'priority', 'Value': '100'}  
)  

# start job  
estimator.fit()

When the estimator.fit() function is called, the SageMaker library uploads our code to Amazon S3 and then transforms the request to a boto3 SageMaker client create_training_job request (see here).

This method for starting up training jobs is dependent on the availability of the requested resources for its success. In our scenario of scarce AI resources, it is likely to fail more often than not. Although this can be partially mitigated by retaining provisioned compute instances for successive workloads, the API does not provide the appropriate tooling for maximizing their utility. Let’s suppose that we wish to utilize precisely two p5.48xlarge instances. To simplify our discussion, let’s assume that each training workload runs on a single instance. Typically, during an AI model development cycle there will be periods when there are more than two training workloads that are waiting to be processed. The existing API would try to start up a third p5.48xlarge instance and would most likely fail due to its limited availability. Even when there is instance availability, we may wish to limit our training to just our two designated instances to increase our control over the costs of training.

We require a new API for submitting jobs for training, one that does not immediately start up a new p5.48xlarge instance, but rather enters the jobs to a priority queue. And we need an associated job scheduler that manages the use of our two resources while prioritizing critical workloads.

Importantly, please note that as of the time of this writing, Amazon SageMaker does not support the option of training on reserved Amazon EC2 instances. And although Amazon SageMaker Savings Plans has similar properties to instance reservations, it does not guarantee instance capacity. In a previous post we addressed this limitation and proposed using SageMaker managed warm pools as an alternative method for retaining access to provisioned instances. For the remainder of the post, we will assume that we are able to attain two instances of our choice whether it be through this or some other method.

Priority-Based Scheduling for Amazon SageMaker

In this section we will describe the components of our proposed solution. We will use the AWS Serverless Application Model (SAM) specification. More specifically, we will create an AWS SAM template YAML file and gradually add the AWS resources that we need. Please see the documentation for details on how to define and deploy serverless solutions using AWS SAM.

AWS Architecture Diagram (by Author)

A Private API for Submitting Training Jobs

We start by using Amazon API Gateway to define a private REST API for submitting training job requests. We name the API training-job-queue. Later, we will add a POST method called add-job and modify our training-job creation code to use this method instead of the SageMaker client create_training_job API. The code block below contains the definition of the private API resource in SAM. In practice you will likely want to specify access limitations to the API and/or a method of authorization.

AWSTemplateFormatVersion: '2010-09-09'  
Transform: AWS::Serverless-2016-10-31  

Resources:  
  InternalAPI:  
    Type: AWS::Serverless::Api  
      # Auth: # Add access control to API  
      EndpointConfiguration:  
        Type: PRIVATE  
        # VPCEndpointIds: # Specify VPC Endpoint(s)  
      Name: training-job-queue  
      StageName: prod

Define an AWS DynamoDB Table for Storing Training Job Requests

We will use an Amazon DynamoDB table named sagemaker-queue to store the submitted training workloads. Each entry will have the following fields:

jobName: Stores the unique name of the training job.
entryTime: Stores the date and time that the job was added.
jobState: Stores the current state of the training job. The valid values are ‘pending’, ‘running’, and ‘preempted’.
priority: Stores an integer value representing the relative priority of the job.
jobDetails: Stores the details of the job request.

We define our DynamoDB table in our SAM template YAML file using the AWS::Serverless::SimpleTable resource.

 DynamoSMQueue:  
    Type: AWS::Serverless::SimpleTable  
    Properties:  
      PrimaryKey:  
        Name: jobName  
        Type: String  
      TableName: sagemaker-queue

We define a function that creates a table entry from a given training job request. We assume that request contains the same contents as the input to the create_training_job API in JSON format. We further assume that the priority of the workload is entered as a key-value tag in the training job definition.

import json, boto3, datetime  

dynamodb = boto3.resource('dynamodb')  
table = dynamodb.Table('sagemaker-queue')  

def add_job_entry(job_json):  
    job_details = json.loads(job_json)  

    # extract job_name  
    job_name = job_details['TrainingJobName']  
    print(f'add entry {job_name}')  

    # get current time  
    entry_time = datetime.now().strftime("%Y-%m-%dT%H:%M:%S")  

    # default priority is 0  
    priority = 0  

    # update priority based on tags  
    tags = job_details['Tags']  
    for tag in tags:  
        if tag['Key'] == 'priority':  
            priority = int(tag['Value'])  
            break  

    # create entry  
    entry = {  
       'jobName': job_name,  
       'entryTime': entry_time,  
       'jobState': 'pending',  
       'priority': priority,  
       'jobDetails': job_json  
    }  
    table.put_item(Item=entry) #TODO handle errors  
    print(f'Added job {job_name} to queue')

The REST API add-job method that we will soon define will be programmed to call the add_job_entry function.

We define a second function that extracts the pending jobs from the database and returns them in order of priority. In the case that multiple jobs have the same priority, they are ordered according to the amount of time they have been waiting in the queue.

from boto3.dynamodb.conditions import Attr  

# Get a list of all pending jobs sorted by priority  
def get_pending_jobs():  
    response = table.scan(  
        ProjectionExpression='jobName, priority, entryTime',  
        FilterExpression=Attr('jobState').ne('running')  
    )  
    jobs = response.get('Items', [])  

    # sort jobs, first by priority (descending) and then by entryTime  
    sorted_jobs = sorted(jobs,  
                         key=lambda x: (-x['priority'], x['entryTime']))  

    return sorted_jobs

The following utility functions will come in handy in the next sections.

# Get a jobName -> priority mapping of all running jobs  
def get_running_jobs_dict():  
    # Get all running jobs  
    response = table.scan(  
        ProjectionExpression="jobName, priority",  
        FilterExpression=Attr('jobState').eq('running')  
    )  
    jobs = response.get('Items', [])  

    running_jobs = {job['jobName']: job['priority'] for job in jobs}  

    return running_jobs  

# Print the queue state  
def print_queue_state():  
    response = table.scan(  
        ProjectionExpression='jobName, jobState, priority'  
    )  
    jobs = response.get('Items', [])  

    print_table = []  
    for job in jobs:  
        print_table.append([job['jobName'], job['jobState'], job['priority']])  

    # sort by priority  
    sorted_table = sorted(print_table,  
                         key=lambda x: -x[2])  
    # Print the table  
    from tabulate import tabulate  
    print(tabulate(sorted_table, headers=['Job Name', 'State', 'Priority']))  

# get job details  
def get_job_details(job_name):  
    response = table.get_item(  
        Key={'jobName': job_name},  
        ProjectionExpression='jobDetails'  
    )  
    return json.loads(response.get('Item').get('jobDetails'))  

# get job state or None if the job does not exist  
def get_job_state(job_name):  
    response = table.get_item(  
        Key={'jobName': job_name},  
        ProjectionExpression='jobState'  
    )  
    job = response.get('Item')  
    return job.get('jobState') if job else None  

# update the job state  
def update_job_state(job_name, new_state):  
    table.update_item(  
        Key={'jobName': job_name},  
        UpdateExpression="SET jobState = :new_state",  
        ExpressionAttributeValues={":new_state": new_state}  
    )  
    print(f'Update job {job_name} to {new_state}')  

# remove a job entry  
def remove_job(job_name):  
    table.delete_item(  
        Key={'jobName': job_name}  
    )  
    print(f'Removed job {job_name} from queue')

Both our choice of DynamoDB and its usage (e.g., our use of the Scan API rather than the Query API) assume that the overall number of jobs in our queue will be in the dozens, at most. For a larger scale solution, you may be better off with a heavier duty database (e.g., one that performs the sorting operation for you) or a more sophisticated use of DynamoDB (e.g., see here).

Define the Training Job Queue Manager

The main component of our solution is the training job scheduler. Here we implement a rather simple manager that performs the following steps:

Extract the list of queued jobs, ordered by priority. If none exist, return.
Discover unused instance capacity. For each free instance, start one pending job on SageMaker. If no jobs remain after that, return.
Calculate the number of SageMaker jobs in the Stopping state. If greater than the number of pending jobs, return.
Assess the need for preemption of running SageMaker jobs by comparing their priorities to those of our pending jobs.

# set the limit on total number of instances/jobs  
MAX_CAPACITY = 2  

sagemaker = boto3.client('sagemaker')  

# apply a queue stamp to identify that the job came from the queue  
def apply_qstamp(job_name):  
    return f'{job_name}-qstamp-{datetime.now().strftime("%d%H%M")}'  

# strip the queue stamp  
def strip_qstamp(job_name):  
    return job_name.split('-qstamp-')[0]  

# start a SageMaker job and update job entry in queue  
def start_job(job_name):  
    print(f'start job {job_name}')  
    job_details = get_job_details(job_name)  
    job_details['TrainingJobName'] = apply_qstamp(job_name)  
    if(job_details):  
        # start job with detail from queue  
        # (you may optinally overwrite fields such as the iam role)  
        response = sagemaker.create_training_job(**job_details)  
        if response['ResponseMetadata']['HTTPStatusCode'] == 200:  
            print(f'started job {job_name}')  
            update_job_state(job_name, 'running')  

# preempt a SageMaker job and update job entry in queue  
def preempt_job(job_name):  
    print(f'preempt job {job_name}')  
    response = sagemaker.stop_training_job(TrainingJobName=job_name)  
    if response['ResponseMetadata']['HTTPStatusCode'] == 200:  
        print(f'preempted job {job_name}')  
        update_job_state(strip_qstamp(job_name), 'preempted')  

# get SageMaker jobs  
def get_sagemaker_jobs(status):  
    running = sagemaker.list_training_jobs(StatusEquals=status)  
    return running.get('TrainingJobSummaries', [])  

# queue manager  
def manage_queue():  
    # extract pending jobs to run  
    pending = get_pending_jobs()  

    if not pending:  
        return  

    if len(pending) > MAX_CAPACITY:  
        pending = pending[:MAX_CAPACITY]  

    # get running sagemaker jobs  
    running = get_sagemaker_jobs('InProgress')  
    total_running = len(running)  

    # get stopping sagemaker jobs  
    stopping = get_sagemaker_jobs('Stopping')  
    total_stopping = len(stopping)  

    # calculate the number of free instances   
    free_slots = MAX_CAPACITY - total_running - total_stopping  

    jobs_to_start = min(len(pending), free_slots)  

    # for each free instance, start a job  
    for i in range(jobs_to_start):  
        start_job(pending[i].get('jobName'))  

    still_pending = pending[jobs_to_start:]  

    if not still_pending:  
        return  

    # assume that 'total_stopping' number of jobs will start soon  
    test_for_preemption = len(still_pending) - total_stopping  
    if test_for_preemption <= 0:  
        return  

    # check if preemption is required  
    test_priority = still_pending[total_stopping:]  

    running_jobs = get_running_jobs_dict()  
    priority_dict = {}  
    for job in running:  
        job_name = job['TrainingJobName']  
        priority_dict[job_name] = running_jobs[strip_qstamp(job_name)]  

    # sort running jobs from lowest to highest priority  
    sorted_running = sorted(priority_dict.items(), key=lambda item: item[1])  

    index = 0  
    while index < test_for_preemption and \  
          test_priority[index].get('priority') > sorted_running[index][1]:  
        preempt_job(sorted_running[index][0])  
        index = index + 1

Important notes:

Our implementation is highly optimistic in the sense that we assume that all the jobs that are inserted are valid and that we will be able to start them up on SageMaker without issue. In practice, appropriate error handling should be added (e.g., removing faulty jobs from the queue with appropriate logging).
In a production environment, we would need to take into consideration the likely occurrence of a race condition when our queue_manager is triggered by multiple concurrent events. There are several ways of addressing this problem (e.g., see here) including enforcing atomicity (e.g., by setting our Lambda function concurrency to one), using some form of locking mechanism (e.g., as done here), or making our function idempotent. Here we have taken the approach of what we call “optimistic idempotence”, where we rely on appropriate use of the API and on the idempotency of our underlying calls to the SageMaker APIs.
We emphasize that our implementation is naïve. In practice, we recommend a more sophisticated algorithm that 1) accounts for the use of different types of instances and jobs that require more than one instance, 2) takes all edge cases into consideration, and 3) is tailored towards the specific needs of your project.

Define the AWS Lambda Function

The next component of the solution is the Lambda function. The following code block includes the SAM definition of our serverless function. We program the function to run on two different types of events: any call to add-job on our private API gateway and a change to the state of a SageMaker training job.

 ManagedTrainingJobQueue:  
    Type: AWS::Serverless::Function  
    Properties:  
      CodeUri: job-queue/ # the directory containing our index.py file  
      Handler: index.lambda_handler  
      Runtime: python3.12  
      Architectures:  
        - arm64 # use graviton  
      Policies: # allow access to SageMaker and DynamoDB  
        - !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonSageMakerFullAccess"  
        - DynamoDBCrudPolicy:  
            TableName: !Ref DynamoSMQueue  
      Events:  
        CreateTraining:  
          Type: Api  
          Properties:  
            Path: /add-job  
            Method: post  
            RestApiId: !Ref InternalAPI  
        SageMakerEvent:  
          Type: EventBridgeRule  
          Properties:  
            Pattern:  
              source:  
                - aws.sagemaker  
              detail-type:  
                - SageMaker Training Job State Change  
              detail:  
                TrainingJobStatus:  
                  - "Completed"  
                  - "Failed"  
                  - "Stopped"

The lambda_handler function is implemented as follows:

def lambda_handler(event, context):  
    # identify source of event and take appropriate action  
    if 'requestContext' in event and 'apiId' in event['requestContext']:  
        print('Lambda triggerred by API Gateway')  
        job_details = json.loads(event.get('body'))  
        add_job_entry(job_details)  
    elif 'source' in event and event['source'] == 'aws.sagemaker':  
        print('Lambda triggerred by SageMaker job state change')  
        job_name = event['detail']['TrainingJobName']  
        job_status = event['detail']['TrainingJobStatus']  
        print(f'{job_name} status changed to {job_status}')  

        # strip qstamp from job_name  
        job_name = strip_qstamp(job_name)  

        if job_status in ['Completed' , 'Failed']:  
            remove_job(job_name)  
        elif job_status == 'Stopped':  
            # check if it was manually stopped or preempted by queue manager  
            if get_job_state(job_name) == 'preempted':  
                print(f'job {job_name} preemption completed')  
            else:  
                print(f'job {job_name} {job_status}, remove from queue')  
                remove_job(job_name)  

    # in all cases invoke queue manager  
    manage_queue()

Intercept the Create Training Job Request

The final modification required to make our solution complete is to intercept the call to the SageMaker create_training_job API and reroute it to our add-job method. We do this by overriding the _intercept_create_request function of the SageMaker Session class:

from sagemaker.pytorch import PyTorch  
from sagemaker.session import Session  
import requests, logging  
logger = logging.getLogger('sagemaker')  

def submit_to_training_queue(job):  
    logger.info(f'Adding training-job {job['TrainingJobName']} to queue')  
    logger.debug('train request: {json.dumps(job, indent=4)}')  

    vpce='<vpc endpoint>' # insert id of vpc endpoint  
    region='us-east-1' # specify region  
    url=f'https://{vpce}.execute-api.{region}.vpce.amazonaws.com/prod/add-job'  
    headers = {'x-apigw-api-id': '<api-id>'} # insert api gateway id  

    # submit job  
    response = requests.post(url, headers=headers, json=job)  

class QueueTrainingJobSession(Session):  
    def _intercept_create_request(self, request, create, func_name = None):  
        """This function intercepts the create job request  

        Args:  
          request (dict): the create job request  
          create (functor): a functor calls the sagemaker client create method  
          func_name (str): the name of the function needed intercepting  
        """  
        if func_name == 'train':  
            submit_to_training_queue(request)  
        else:  
            super()._intercept_create_request(request,create,func_name)  

# define job  
estimator = PyTorch(  
    role='<sagemaker role>',  
    entry_point='train.py',  
    instance_type='ml.p5.48xlarge',  
    instance_count=1,  
    framework_version='2.0.1',  
    py_version='py310',  
    tags=[{'Key': 'priority', 'Value': '100'},  
    keep_alive_period_in_seconds=60, # keep warm for 1 minute  
    # use our custom Session class  
    sagemaker_session=QueueTrainingJobSession()  
)  

estimator.fit(wait=False)

Use Case Example

To test our solution we submit the following sequence of jobs. After each call we print the status of the queue (using the print_queue_state function) and sleep for twenty seconds.

Start job1 with priority 1.
Start job2 with priority 2.
Start job3 with priority 1.
Start job4 with priority 3.

The first two jobs are immediately submitted to SageMaker and updated to the running state. Since the third job has low priority and we have precisely two training instances, it remains in the pending state and waits its turn. After submitting the first three jobs, the queue state appears as:

Job Name    State      Priority  
----------  -------  ----------  
job2        running           2  
job1        running           1  
job3        pending           1

The fourth job we submit has a higher priority than all of the jobs in the queue. Consequently, the running job with the lowest priority, job1, is preempted. The corresponding SageMaker job is stopped and once the instance is released, the queue state becomes:

Job Name    State        Priority  
----------  ---------  ----------  
job4        running             3  
job2        running             2  
job1        preempted           1  
job3        pending             1

The SageMaker job running job2 is the first to finish, job2 is removed from the queue, and our preempted job is resumed:

Job Name    State      Priority  
----------  -------  ----------  
job4        running           3  
job1        running           1  
job3        pending           1

Once job4 is completed, it too is removed from the queue, making room for job3. The remaining jobs are also run to completion, ultimately leaving our queue empty.

Summary

The increasing difficulty of acquiring AI compute capacity has forced AI development teams to reevaluate the processes they use for training AI models. The approach we have demonstrated in this post is to augment the traditional APIs for training models with a custom-made priority queue and an associated job scheduler. Importantly, the proposal we have put forth should be viewed as a general scheme, not as a production-worthy solution. Appropriate modifications and enhancements would be required to address the specifics needs of your project.

Retaining Amazon SageMaker Instance Capacity with SageMaker Managed Warm Pools

Chaim Rand — Mon, 04 Mar 2024 09:39:21 +0000

An Alternative Solution to Cloud Instance Reservation

Photo by Ivan Slade on Unsplash

In previous posts (e.g., here and here), we covered some of the pros and cons of training ML workloads using Amazon SageMaker. In this post we address one of its more inconvenient limitations — its lack of support (as of the time of this writing) for training on reserved Amazon EC2 instances. This limitation has become more and more restrictive of late due to the increasing difficulty to acquire the instance types required in a reliable and timely fashion. Recent advances in the field of generative AI have led to unprecedented demand for AI compute while challenges in the global supply chain continue to linger. In this post we propose a partial mitigation to this limitation using SageMaker managed warm pools. By using SageMaker managed warm pools you can, under certain circumstances, retain access to provisioned instance capacity for successive training workloads. Not only will this hold on to acquired capacity for as long as you need (up to four weeks), but it can also reduce the latency between experiments.

Training With Managed Warm Pools - Example

In the example below, we start up a PyTorch training job on a p5.48xlarge instance of type using the Amazon SageMaker Python SDK (version 2.208). We use the keep_alive_period_in_seconds control to configure the instance to remain warm for ten minutes.

from sagemaker.pytorch import PyTorch  

# define job  
estimator = PyTorch(  
    role='<sagemaker role>',  
    entry_point='train.py',  
    instance_type='ml.p5.48xlarge',  
    instance_count=1,  
    framework_version='2.0.1',  
    py_version='py310',  
    keep_alive_period_in_seconds=60 # keep warm for 1 minute  
)  

# start job  
estimator.fit()

As long as we start up another job with matching settings within the sixty seconds allotted, the same instance will be retained for the next job. Thus, by configuring the use of SageMaker warm pools we have guaranteed instance capacity for our next workload. As an added bonus, the start-up time of the second workload will be noticeably reduced since the instance has already been provisioned.

Limitations of Managed Warm Pools

Although this technique offers an instance capacity guarantee similar to the one provided by Amazon EC2 reservations (and without the long-term commitment!!), it is important to note its significant limitations.

The method relies on our ability to secure instance capacity for the first training job. Generally speaking, this is a safe assumption — sooner or later, we will succeed in securing an instance, but it is hard to know how much time and patience will be required.
The method assumes that our workloads have matching settings, particularly with regards to the number and types of instances. Although AI development teams will frequently have multiple workloads with similar instance requirements, the inability to share resources between jobs with different settings (e.g., with different SageMaker Roles) is limiting.
The method works only if the subsequent workload is started within the specified warm pool duration, with the maximum duration being one hour. Unless we want to constantly monitor our training jobs to detect when they stop, we will need to implement an automated system for submitting new jobs when our provisioned instances become available.
In cases where a matching training job is not found during the warm pool durations, we still need to pay for the provisioned instance. Thus, there is a certain risk of waste associated with this method and the way it is used (e.g., the most appropriate warm pool duration setting) should be planned accordingly.
The maximum period of time during which an instance can be retained in this manner is twenty-eight days.

Please see the official documentation for more details on how warm pooling works as well as additional considerations associated with its use.

Reducing Cost with SageMaker Savings Plans

The method we have described for retaining control of instances is relevant for AI teams that have a consistent requirement for AI compute. This is manifested as a continuous backlog of training experiments waiting to be processed. In situations where this requirement is expected to last for an extended period of time, Amazon SageMaker Savings Plans may provide a great opportunity for training-cost savings. SageMaker Savings Plans offer significant discounts in exchange for a commitment to pay for consistent usage. The instance types offered under this plan can vary — please refer to the documentation for the most up-do-date details. Importantly, despite some similarities to Amazon EC2 reservations, SageMaker Savings Plans does not guarantee instance capacity. However, the method described in this post for retaining control of provisioned instance capacity can help you take the most advantage of the instances you have committed to using.

SageMaker Savings Plans is not for everyone. Make sure to fully understand all the terms of the offering before deciding whether it is the right solution for your team.

Summary

A common approach for dealing with the difficulty of acquiring AI/ML compute resources in the cloud is to guarantee capacity by purchasing instance reservations. Unfortunately, as of the time of this writing, Amazon SageMaker does not support instance reservation. In this post, we have demonstrated how SageMaker warm pools can be used to maintain control over instance capacity for successive training workloads. We noted that for this type of solution to be effective, we require some form of mechanism for automating the detection of available warm pool instances and triggering a job with matching settings. In a future post we will propose a solution that addresses this challenge.

Optimizing Instance Type Selection for AI Development in Cloud Spot Markets

Chaim Rand — Wed, 24 Jan 2024 10:13:37 +0000

Instance Selection for Deep Learning — Part 2

Photo by Mike Enerio on Unsplash

This post was written in collaboration with Tomer Berkovich, Yitzhak Levi, and Max Rabin.

Appropriate instance selection for machine learning (ML) workloads is an important decision with potentially significant implications on the speed and cost of development. In a previous post we expanded on this process, proposed a metric for making this important decision, and highlighted some of the many factors you should take into consideration. In this post we will demonstrate the opportunity for reducing AI model training costs by taking Spot Instance availability into account when making your cloud-based instance selection decision.

Reducing Costs Using Spot Instances

One of the most significant opportunities for cost savings in the cloud is to take advantage of low cost Amazon EC2 Spot Instances. Spot instances are discounted compute engines from surplus cloud service capacity. In exchange for the discounted price, AWS maintains the right to preempt the instance with little to no warning. Consequently, the relevance of Spot instance utilization is limited to workloads that are fault tolerant. Fortunately, through effective use of model checkpointing ML training workloads can be designed to be fault tolerant and to take advantage of the Spot instance offering. In fact, Amazon SageMaker, AWS’s managed service for developing ML, makes it easy to train on Spot instances by managing the end-to-end Spot life-cycle for you.

The Challenge of Anticipating Spot Instance Capacity

Unfortunately, Spot instance capacity, which measures the availability of Spot instances for use, is subject to constant fluctuations and can be very difficult to predict. Amazon offers partial assistance in assessing the Spot instance capacity of an instance type of choice via its Spot placement score (SPS) feature which indicates the likelihood that a Spot request will succeed in a given region or availability zone (AZ). This is especially helpful when you have the freedom to choose to train your model in one of several different locations. However, the SPS feature offers no guarantees.

When you choose to train a model on one or more Spot instances, you are taking the risk that your instance type of choice does not have any Spot capacity (i.e., your training job will not start), or worse, that you will enter an iterative cycle in which your training repeatedly runs for just a small number of training steps and is stopped before you have made any meaningful progress — which can tally up your training costs without any return.

Over the past couple of years, the challenges of spot instance utilization have been particularly acute when it comes to multi-GPU EC2 instance types such as g5.12xlarge and p4d.24xlarge. A huge increase in demand for powerful training accelerators (driven in part by advances in the field of Generative AI) combined with disruptions in the global supply chain, have made it virtually impossible to reliably depend on multi-GPU Spot instances for ML training. The natural fallback is to use the more costly On-Demand (OD) or reserved instances. However, in our previous post we emphasized the value of considering many different alternatives for your choice of instance type. In this post we will demonstrate the potential gains of replacing multi-GPU On Demand instances with multiple single-GPU Spot instances.

Although our demonstration will use Amazon Web Services, similar conclusions can be reached on alternative cloud service platforms (CSPs). Please do not interpret our choice of CSP or services as an endorsement. The best option for you will depend on the unique details of your project. Furthermore, please take into consideration the possibility that the type of cost savings we will demonstrate will not reproduce in the case of your project and/or that the solution we propose will not be applicable (e.g., for some reason beyond the scope of this post). Be sure to conduct a detailed evaluation of the relevance and efficacy of the proposal before adapting it to your use case.

When Multiple Single-GPU Instances are Better than a Single Multi-GPU Instance

Nowadays, training AI models on multiple GPU devices in parallel — a process called distributed training — is commonplace. Setting aside instance pricing, when you have the choice between an instance type with multiple GPUs and multiple instance types with the same type of single GPUs, you would typically choose the multi-GPU instance. Distributed training typically requires a considerable amount of data communication (e.g., gradient sharing) between the GPUs. The proximity of the GPUs on a single instance is bound to facilitate higher network bandwidth and lower latency. Moreover, some multi-GPU instances include dedicated GPU-to-GPU inter-connections that can further accelerate the communication (e.g., NVLink on p4d.24xlarge). However, when Spot capacity is limited to single GPU instances, the option of training on multiple single GPU instances at a much lower cost becomes more compelling. At the very least, it warrants evaluation of its opportunity for cost-savings.

Optimizing Data Communication Between Multiple EC2 Instances

When distributed training runs on multiple instances, the GPUs communicate with one another via the network between the host machines. To optimize the speed of training and reduce the likelihood and/or impact of a network bottleneck, we need to ensure minimal network latency and maximal data throughput. These can be affected by a number of factors.

Instance Collocation

Network latency can be greatly impacted by the relative locations of the EC2 instances. Ideally, when we request multiple cloud-based instances we would like them to all be collocated on the same physical rack. In practice, without appropriate configuration, they may not even be in the same city. In our demonstration below we will use a VPC Config object to program an Amazon SageMaker training job to use a single subnet of an Amazon Virtual Private Cloud (VPC). This technique will ensure that all the requested training instances will be in the same availability zone (AZ). However, collocation in the same AZ, may not suffice. Furthermore, the method we described involves choosing a subnet associated with one specific AZ (e.g., the one with the highest Spot placement score). A preferred API would fulfill the request in any AZ that has sufficient capacity.

A better way to control the placement of our instances is to launch them inside a placement group, specifically a cluster placement group. Not only will this guarantee that all of the instances will be in the same AZ, but it will also place them on “the same high-bisection bandwidth segment of the network” so as to maximize the performance of the network traffic between them. However, as of the time of this writing SageMaker does not provide the option to specify a placement group. To take advantage of placement groups we would need to use an alternative training service solution (as we will demonstrate below).

EC2 Network Bandwidth Constraints

Be sure to take into account the maximal network bandwidth supported by the EC2 instances that you choose. Note, in particular, that the network bandwidths associated with single-GPU machines are often documented as being “up to” a certain number of Gbps. Make sure to understand what that means and how it can impact the speed of training over time.

Keep in mind that the GPU-to-GPU data communication (e.g., gradient sharing) might need to share the limited network bandwidth with other data flowing through the network such as training samples being streamed into the training instances or training artifacts being uploaded to persistent storage. Consider ways of reducing the payload of each of the categories of data to minimize the likelihood of a network bottleneck.

Elastic Fabric Adapter (EFA)

A growing number of EC2 instance types support Elastic Fabric Adapter (EFA), a dedicated network interface for optimizing inter-node communication. Using EFA can have a decisive impact on the runtime performance of your training workload. Note that the bandwidth on the EFA network channel is different than the documented bandwidth of the standard network. As of the time of this writing, detailed documentation of the EFA capabilities is hard to come by and it is usually best to evaluate its impact through trial and error. Consider using an EC2 instance that supports EFA type when relevant.

Toy Example

We will now demonstrate the comparative price performance of training on four single-GPU EC2 g5 Spot instances (ml.g5.2xlarge and ml.g5.4xlarge) vs. a single four-GPU On-Demand instance (ml.g5.12xlarge). We will use the training script below containing a Vision Transformer (ViT) backed classification model (trained on synthetic data).

import os, torch, time  
import torch.distributed as dist  
from torch.utils.data import Dataset, DataLoader  
from torch.cuda.amp import autocast  
from torch.nn.parallel import DistributedDataParallel as DDP  
from timm.models.vision_transformer import VisionTransformer  

batch_size = 128  
log_interval = 10  

# use random data  
class FakeDataset(Dataset):  
    def __len__(self):  
        return 1000000  

    def __getitem__(self, index):  
        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)  
        label = torch.tensor(data=[index % 1000], dtype=torch.int64)  
        return rand_image, label  

def mp_fn():  
    local_rank = int(os.environ['LOCAL_RANK'])  
    dist.init_process_group("nccl")  
    torch.cuda.set_device(local_rank)  

    # model definition  
    model = VisionTransformer()  
    loss_fn = torch.nn.CrossEntropyLoss()  
    model.to(torch.cuda.current_device())  
    model = DDP(model)  
    optimizer = torch.optim.Adam(params=model.parameters())  

    # dataset definition  
    num_workers = os.cpu_count()//int(os.environ['LOCAL_WORLD_SIZE'])  
    dl = DataLoader(FakeDataset(), batch_size=batch_size, num_workers=num_workers)  

    model.train()  
    t0 = time.perf_counter()  
    for batch_idx, (x, y) in enumerate(dl, start=1):  
        optimizer.zero_grad(set_to_none=True)  
        x = x.to(torch.cuda.current_device())  
        y = torch.squeeze(y.to(torch.cuda.current_device()), -1)  
        with autocast(enabled=True, dtype=torch.bfloat16):  
            outputs = model(x)  
            loss = loss_fn(outputs, y)  
        loss.backward()  
        optimizer.step()  
        if batch_idx % log_interval == 0 and local_rank == 0:  
            time_passed = time.perf_counter() - t0  
            samples_processed = dist.get_world_size() * batch_size * log_interval  
            print(f'{samples_processed / time_passed} samples/second')  
            t0 = time.perf_counter()  

if __name__ == '__main__':  
    mp_fn()

The code block below demonstrates how we used the SageMaker Python package (version 2.203.1) to run our experiments. Note that for the four-instance experiments, we configure the use of a VPC with a single subnet, as explained above.

from sagemaker.pytorch import PyTorch  
from sagemaker.vpc_utils import VPC_CONFIG_DEFAULT  


# Toggle flag to switch between multiple single-GPU nodes and  
# single multi-GPU node  
multi_inst = False  

inst_count=1  
inst_type='ml.g5.12xlarge'  
use_spot_instances=False  
max_wait=None #max seconds to wait for Spot job to complete  
subnets=None  
security_group_ids=None  

if multi_inst:  
    inst_count=4  
    inst_type='ml.g5.4xlarge' #  optinally change to ml.g5.2xlarge  
    use_spot_instances=True  
    max_wait=24*60*60 #24 hours  
    # configure vpc settings  
    subnets=['<VPC subnet>']  
    security_group_ids=['<Security Group>']  


estimator = PyTorch(  
    role='<sagemaker role>',  
    entry_point='train.py',  
    source_dir='<path to source dir>',  
    instance_type=inst_type,  
    instance_count=inst_count,  
    framework_version='2.1.0',  
    py_version='py310',  
    distribution={'torch_distributed': {'enabled': True}},  
    subnets=subnets,  
    security_group_ids=security_group_ids,  
    use_spot_instances=use_spot_instances,  
    max_wait=max_wait  
)  

# start job  
estimator.fit()

Note that our code depends on the third-party timm Python package that we point to in a requirements.txt file in the root of the source directory. This assumes that the VPC has been configured to enable internet access. Alternatively, you could define a private PyPI server (as described here), or create a custom image with your third party dependencies preinstalled (as described here).

Results

We summarize the results of our experiment in the table below. The On-Demand prices were taken from the SageMaker pricing page (as of the time of this writing, January 2024). The Spot saving values were collected from the reported managed spot training savings of the completed job. Please see the EC2 Spot pricing documentation to get a sense for how the reported Spot savings are calculated.

Experiment Results (by Author)

Our results clearly demonstrate the potential for considerable savings when using four single-GPU Spot instances rather than a single four-GPU On Demand instance. They further demonstrate that although the cost of an On Demand g5.4xlarge instance type is higher, the increased CPU power and/or network bandwidth combined with higher Spot savings, resulted in much greater savings.

Importantly, keep in mind that the relative performance results can vary considerably based on the details of your job as well the Spot prices at the time that you run your experiments.

Enforcing EC2 Instance Co-location Using a Cluster Placement Group

In a previous post we described how to create a customized managed environment on top of an unmanaged service, such as Amazon EC2. One of the motivating factors listed there was the desire to have greater control over device placement in a multi-instance setup, e.g., by using a cluster placement group, as discussed above. In this section, we demonstrate the creation of a multi-node setup using a cluster placement group.

Our code assumes the presence of a default VPC as well as the (one-time) creation of a cluster placement group, demonstrated here using the AWS Python SDK (version 1.34.23):

import boto3  

ec2 = boto3.client('ec2')  
ec2.create_placement_group(  
    GroupName='cluster-placement-group',  
    Strategy='cluster'  
)

In the code block below we use the AWS Python SDK to launch our Spot instances:

import boto3  

ec2 = boto3.resource('ec2')  
instances = ec2.create_instances(  
    MaxCount=4,  
    MinCount=4,  
    ImageId='ami-0240b7264c1c9e6a9', # replace with image of choice  
    InstanceType='g5.4xlarge',  
    Placement={'GroupName':'cluster-placement-group'},  
    InstanceMarketOptions={  
        'MarketType': 'spot',  
        'SpotOptions': {  
            "SpotInstanceType": "one-time",  
            "InstanceInterruptionBehavior": "terminate"  
        }  
    },  
)

Please see our previous post for step-by-step tips on how to extend this to an automated training solution.

Summary

In this post, we have illustrated how demonstrating flexibility in your choice of training instance type can increase your ability to leverage Spot instance capacity and reduce the overall cost of training.

As the sizes of AI models continue to grow and the costs of AI training accelerators continue to rise, it becomes increasingly important that we explore ways to mitigate training expenses. The technique outlined here is just one among several methods for optimizing cost performance. We encourage you to explore our previous posts for insights into additional opportunities in this realm.

Debugging and Tuning Amazon SageMaker Training Jobs with SageMaker SSH Helper

Chaim Rand — Fri, 29 Dec 2023 08:52:26 +0000

A new tool that increases the debuggability of managed training workloads

Photo by James Wainscoat on Unsplash

Considering all the new Amazon SageMaker features announced over the past year (2023), including at the most recent AWS re:invent, it would have been easy to have overlooked SageMaker SSH Helper — a new utility for connecting to remote SageMaker training environments. But sometimes it is the quiet enhancements that have the potential to make the greatest impact on your daily development. In this post we will review SageMaker SSH Helper and demonstrate how it can increase your ability to 1) investigate and solve errors that arise in your training applications and 2) optimize their runtime performance.

In previous posts, we discussed at length the benefits of training in the cloud. Cloud-based managed training services, such as Amazon SageMaker, have simplified many of the complexities surrounding AI model development and greatly increased accessibility to both AI-specific machinery and pretrained AI models. To train in Amazon SageMaker, all you need to do is define a training environment (including an instance type) and point to the code you wish to run, and the training service will 1) set up the requested environment, 2) deliver your code to the training machine, 3) run your training script, 4) copy the training output to persistent storage, and 5) tear everything down when the training completes (so that you pay only for what you need). Sounds easy… right? However, managed training is not without its flaws, one of which — the limited access it enables to the training environment — will be discussed in this post.

Disclaimers

Please do not interpret our use of Amazon SageMaker, SageMaker SSH Helper, or any other framework or utility we should mention as an endorsement for their use. There are many different methodologies for developing AI models. The best solution for you will depend on the details of your project.
Please be sure to verify the contents of this post, particularly the code samples, against the most up to date SW and documentation available at the time that you read this. The landscape of AI development tools is in constant flux and it is likely that some of the APIs we refer to will change over time.

Disadvantage of Managed Training — Inaccessibility of the Training Environment

As seasoned developers are well aware, a significant chunk of the application development-time is actually spent on debugging. Rarely do our programs work “out of the box”; More often than not, they require hours of laborious debugging to get them to run as desired. Of course, to be able to debug effectively, you need to have direct access to your application environment. Trying to debug an application without access to its environment is like trying to fix a faucet without a wrench.

Another important step in AI model development is to tune the runtime performance of the training application. Training AI models can be expensive and our ability to maximize the utilization of our compute resources can have a decisive cost on training. In a previous post we described the iterative process of analyzing and optimizing training performance. Similar to debugging, direct access to the runtime environment will greatly increase and accelerate our ability to reach the best results.

Unfortunately, one of the side-effects of the “fire and forget” nature of training in SageMaker, is the lack of ability to freely connect to the training environment. Of course, you could always debug and optimize performance using the training job output logs and debug prints (i.e., add prints, study the output logs, modify your code, and repeat until you’ve solved all your bugs and reached the desired performance) but this would be a very primitive and time-consuming solution.

There are a number of best practices that address the problem of debugging managed training workloads, each with its own advantages and disadvantages. We will review three of these, discuss their limitations, and then demonstrate how the new SageMaker SSH Helper completely alters the playing field.

Debug in Your Local Environment

It is recommended that you run a few training steps in your local environment before launching your job to the cloud. Although this may require a few modifications to your code (e.g., to enable training on a CPU device), it is usually worth the effort as it enables you to identify and fix silly coding errors. It is certainly more cost effective than discovering them on an expensive GPU machine in the cloud. Ideally, your local environment would be as similar to the SageMaker training environment (e.g., using the same versions of Python and Python packages) but in most cases there is a limit to the extent that this is possible.

Debug Locally within the SageMaker Docker Container

The second option is to pull the deep learning container (DLC) image that SageMaker uses and run your training script within the container on your local PC. This method allows you to get a good understanding of the SageMaker training environment including the packages (and package versions) that are installed. It is extremely useful in identifying missing dependencies and addressing dependency conflicts. Please see the documentation for details on how to login and pull the appropriate image. Note that the SageMaker APIs support pulling and training within a DLC via its local mode feature. However, running the image on your own will enable you to explore and study the image more freely.

Debug in the Cloud on an Unmanaged Instance

Another option is to train on an unmanaged Amazon EC2 instance in the cloud. The advantage to this option is the ability to run on the same instance type that you use in SageMaker. This will enable you to reproduce issues that you may not be able to reproduce in your local environment, e.g., issues related to your use of the GPU resources. The easiest way to do this would be to run your instance with a machine image that is most similar to your SageMaker environment (e.g., the same OS, Python, and Python package versions). Alternatively, you could pull the SageMaker DLC and run it on the remote instance. However, keep in mind that although this also runs in the cloud, the runtime environment may still be significantly different than SageMaker’s environment. SageMaker configures a whole bunch of system settings during initialization. Trying to reproduce the same environment may require quite a bit of effort. Given that debugging in the cloud is more costly than the previous two methods, our goal should be to try to clean up our code as much as possible before resorting to this option.

Debugging Limitations

Although each of the above options are useful for solving for certain types of bugs, none of them offer a way to perfectly replicate the SageMaker environment. Consequently, you may run into issues when running in SageMaker that you are not able to reproduce, and thus not able to correct, when using these methods. In particular, there are a number of features that are supported only when running in the SageMaker environment (e.g., SageMaker’s Pipe input and Fast File modes for accessing data from Amazon S3). If your issue is related to one of those features, you will not be able to reproduce it outside of SageMaker.

Tuning Limitations

In addition, the options above do not provide an effective solution for performance tuning. Runtime performance can be extremely susceptible to even the slightest changes in the environment. While a simulated environment might provide some general optimization hints (e.g., the comparative performance overhead of different data augmentations), an accurate profiling analysis can be performed only in the SageMaker runtime environment.

SageMaker SSH Helper

SageMaker SSH Helper introduces that ability to connect to the remote SageMaker training environment. This is enabled via an SSH connection over AWS SSM. As we will demonstrate, the steps required to set this up are quite simple and very well worth the effort. The official documentation includes comprehensive details on the value of this utility and how it can be used.

Example

In the code block below we demonstrate how to enable remote connection to a SageMaker training job using sagemaker-ssh-helper (version 2.1.0). We pass in our full code source directory but replace our usual entry_point (train.py) with a new run_ssh.py script that we place in the root of the source_dir. Note that we add the SSHEstimatorWrapper to the list of project dependencies since our start_ssh.py script will require it. Alternatively, we could have added sagemaker-ssh-helper to our requirements.txt file. Here we have set the connection_wait_time_seconds setting to two minutes. As we will see, this will impact the behavior of our training script.

from sagemaker.pytorch import PyTorch  
from sagemaker_ssh_helper.wrapper import SSHEstimatorWrapper  
MINUTE = 60  

estimator = PyTorch(  
    role='<sagemaker role>',  
    entry_point='run_ssh.py',  
    source_dir='<path to source dir>',  
    instance_type='ml.g5.xlarge',  
    instance_count=1,  
    framework_version='2.0.1',  
    py_version='py310',  
    dependencies=[SSHEstimatorWrapper.dependency_dir()]  
)  


# configure the SSH wrapper. Set the wait time for the connection.  
ssh_wrapper = SSHEstimatorWrapper.create(estimator.framework,   
                                    connection_wait_time_seconds=2*MINUTE)  

# start job  
estimator.fit()  

# wait to receive an instance id for the connection over SSM  
instance_ids = ssh_wrapper.get_instance_ids()  

print(f'To connect run: aws ssm start-session --target {instance_ids[0]}')

As usual, the SageMaker service will allocate a machine instance, build the requested environment, download and unpack our source code, and install the requested dependencies. At that point, the runtime environment will be identical to the one in which we usually run our training script. Only now, instead of training we will run our start_ssh.py script:

import sagemaker_ssh_helper  
from time import sleep  

# setup SSH and wait for connection_wait_time_seconds seconds  
# (to give opportunity for the user to connect before script resumes)  
sagemaker_ssh_helper.setup_and_start_ssh()  

# place any code here... e.g. your training code  
# we choose to sleep for two hours to enable connecting in an SSH window  
# and running trials there  
HOUR = 60*60  
sleep(2*HOUR)

The setup_and_start_ssh function will start the SSH service, then block for the allotted time we defined above (connection_wait_time_seconds) to allow an SSH client to connect, and then proceed with the rest of the script. In our case it will sleep for two hours and then exit the training job. During that time we can connect to the machine using the aws ssm start-session command and the instance-id that was returned by the ssh_wrapper (which typically starts with an “mi-” prefix for “managed instance”) and play to our hearts desire. In particular, we can explicitly run our original training script (which was uploaded as part of the source_dir) and monitor the training behavior.

The method we have described, enables us to run our training script iteratively while we identify and fix bugs. It also provides an ideal setting for optimizing performance — one in which we can 1) run a few training steps, 2) identify performance bottlenecks (e.g., using PyTorch Profiler), 3) tune our code to address them, and 4) repeat, until we achieve the desired runtime performance.

Importantly, keep in mind that the instance will be terminated as soon as the start_ssh.py script completes. Make sure to copy all important files (e.g., code modifications, profile traces, etc.) to persistent storage before it is too late.

Port Forwarding Over AWS SSM

We can extend our aws ssm start-session command to enable port forwarding. This allows you to securely connect to server applications running on your cloud instance. This is particularly exciting for developers who are accustomed to using the TensorBoard Profiler plugin for analyzing runtime performance (as we are). The command below demonstrates how to set up port forwarding over AWS SSM:

aws ssm start-session \  
  --target mi-0748ce18cf28fb51b \  
  --document-name AWS-StartPortForwardingSession  
  --parameters '{"portNumber":["6006"],"localPortNumber":["9999"]}'

Additional Modes of Use

The SageMaker SSH Helper documentation describes several different ways of using the SSH functionality. In the basic example the setup_and_start_ssh command is added to the top of the existing training script (instead of defining a dedicated script). This allows you time (as defined by the connection_wait_time_seconds setting) to connect to the machine before the training begins so that you can monitor its behavior (from a separate process) as it runs.

The more advanced examples include different methods for using SageMaker SSH Helper to debug the training script running in the SageMaker environment from an IDE running in our local environment. The setup is more complicated but may very well be worth the reward of being able to perform line-by-line debugging from a local IDE.

Additional use cases cover training in a VPC, integration with SageMaker Studio, connecting to SageMaker inference endpoints, and more. Please be sure to see the documentation for details.

When to Use SageMaker SSH Helper

Given the advantages of debugging with SageMaker SSH Helper, you might wonder if there is any reason to use the three debugging methods we described above. We would argue that, despite the fact that you could perform all of your debugging in the cloud, it is still highly recommended that you perform your initial development and experimentation phase — to the extent possible — in your local environment (using the first two methods we described). Only once you have exhausted your ability to debug locally, should you move to debugging in the cloud using SageMaker SSH Helper. The last thing you would want would be to spend hours cleaning up silly syntax errors on a super expensive cloud-based GPU machine.

Contrary to debugging, analyzing and optimizing performance has little value unless it is performed directly on the target training environment. Thus, it would be advised to perform your optimization efforts on the SageMaker instance using SageMaker SSH Helper.

Summary

Until now, one of the most painful side effects of training on Amazon SageMaker has been the loss of direct access to the training environment. This restricted our ability to debug and tune our training workloads in an effective manner. The recent release of SageMaker SSH Helper and its support for unmediated access to the training environment opens up a wealth of new opportunities for developing, debugging, and tuning. These can have a distinctive impact on the efficiency and speed of your ML development life cycle. It is for this reason that SageMaker SSH Helper is one of our favorite new cloud-ML features of 2023.

Using Server-less Functions to Govern and Monitor Cloud-Based Training Experiments

Chaim Rand — Sat, 23 Dec 2023 16:46:46 +0000

A simple routine that can save you loads of money

Photo by Ziyou Zhang on Unsplash

This blog post was co-authored with my colleague Shay Margalit. It summarizes his research into how AWS Lambda functions can be used to increase the control over the usage and costs of the Amazon SageMaker training service. Interested? Please read on :).

We are fortunate (or very unfortunate - depending on who you ask) to be sharing a front row seat to an AI revolution that is expected by many to change the world as we know it. Powered by advances in hardware development and access to enormous amounts of data, this revolution is likely to impact many aspects of our daily lives - although precisely how, no one can say for sure. To support the growing appetite for artificial intelligence, the sizes of the underlying machine learning models are increasing rapidly as are the resources that are required to train them. The bottom line is that staying relevant in the AI development playing field requires a sizable investment into heavy, and expensive, machinery.

Cloud-based managed training services, such as Amazon SageMaker, Google Vertex AI, and Microsoft Azure ML, have lowered the entry barrier to AI development by enabling developers to train on machines that they could otherwise not afford. Although such services reduce the upfront costs of AI and enable you to pay only for the time you spend training, the potential for the variable costs to add up warrants careful planning of how the training services will be used and how they will contribute to your overall training expense. However, inevitably, things don't always go according to plan. To paraphrase an old Yiddish proverb "developers plan and the programming gods laugh". When the stakes are high, as when training AI models --- where an errant experiment can result in hundreds or thousands of dollars worth of wasted compute time, it is wise to institute multiple lines of defense.

First Line of Defense - Encourage Healthy Development Habits

The first line of defense should address the development practices of the ML algorithm engineers. Here are examples of some guiding principles you might consider:

Encourage appropriate and cost-optimal use of the hardware resources used for training (e.g., see here).
Identify and terminate failing experiments early.
Increase price performance by regularly analyzing and optimizing runtime performance (e.g., see here).

While formulating and adapting AI development principles such as the ones above are likely to increase your productivity and reduce waste, they do not offer full protection against all possible failures. For example, a dedicated failure detection runtime process may not help address a situation in which a training experiment stalls (e.g., due to a deadlock in the training application's processes) but the training job remains active until it is actively stopped or times out.

Second Line of Defense - Deploy Cross-project Guardrails

In this post we propose instituting a second line of defense that monitors all of the training activities in the project (or organization), verifies their compliance with a predetermined set of rules, and takes appropriate action in the case that errant training experiments are identified. One way to do this is to use dedicated server-less functions that are triggered at different stages of a training job and programmed to evaluate the job's state and optionally stop or restart it (possibly with changes to the job settings), accordingly. In the next sections we will demonstrate a few examples of how to use AWS Lambda as a second line of defense against errant Amazon SageMaker training experiments.

Disclaimers

Although we have chosen Amazon SageMaker and AWS Lambda for our demonstrations, the contents of the post are just as relevant to other services and similar functionality can be implemented for them. Please do not interpret our choice of these services as an endorsement of their use over their alternatives. There are multiple options available for cloud-based training each with their own advantages and disadvantages. The best choice for you will greatly depend on the details of your project.

While we will share a few Python examples of server-less code, we will not go into the details of how to create and deploy them as AWS Lambda functions. There are many ways of interacting with AWS Lambda. We refer the reader to the official AWS documentation to learn more about them.

The examples below were created for demonstrative purposes. They will likely require modification to suit the specific needs of your project. Be sure to fully understand all of the details of the code and the associated service costs before adapting the type of solution we propose. Importantly, the code we will share has not undergone rigorous testing. Any solution that includes creation and invocation of multiple Lambda functions and Amazon CloudWatch alarms (as described here) requires appropriate validation to prevent the accumulation of redundant/orphan artifacts.

We highly advise that you verify the details of this post against the most up-to-date AWS Lambda documentation and most up-to-date versions of the supporting libraries.

Enforcing Developer Compliance

While cloud governance is often vital for successful and efficient use of cloud services, its enforcement can sometimes be challenging. For example: Amazon SageMaker includes an API for appending tags to training jobs. These can be used to include metadata associated with the SageMaker job such as the name of the training project, the stage of development, the goal of the current trial, the name of the development group or user running the job, etc. This metadata can be used to collect statistics such as the cost of development per project or group. In the code block below, we demonstrate the application of several tags to a SageMaker training job:

from sagemaker.pytorch import PyTorch

tags = [{'Key': 'username', 'Value': 'johndoe'},\
        {'Key': 'model_name', 'Value': 'mnist'},\
        {'Key': 'training_phase', 'Value': 'finetune'},\
        {'Key': 'description', 'Value': 'fine tune final linear layer'}]

# define the training job with tags\
estimator = PyTorch(\
    entry_point='train.py',\
    framework_version='2.1.0',\
    role='<arn role>',\
    py_version='py310',\
    job_name='demo',\
    instance_type='ml.g5.xlarge',\
    instance_count=1,\
    tags=tags\
)

# deploy the job to the cloud\
estimator.fit()

Naturally, these tags are only helpful if we can enforce their application. This is where AWS Lambda comes to the rescue. Using Amazon EventBridge we can monitor changes in the status of a SageMaker training jobs and register a function that will be triggered on every change. In the code block below, we propose a Python routine that will verify the presence of specific SageMaker tags every time a job is started. In case a tag is missing the job is automatically terminated. The structure of the event is documented here. Note the use of (the more detailed) SecondaryStatus field to poll the status of the training job (rather than TrainingJobStatus).

import boto3\
def stop_training_job(training_job_name):\
    sm_client = boto3.client("sagemaker")\
    response = sm_client.stop_training_job(TrainingJobName=training_job_name)\
    assert response['ResponseMetadata']['HTTPStatusCode'] == 200\
    # TODO - optionally send an email notification

def enforce_required_tags(training_job_name, event):\
    event_tags = event['detail']['Tags']\
    if 'model_name' not in event_tags:\
        stop_training_job(training_job_name)

# define lambda handler\
def sagemaker_event_handler(event, _):\
    job_name = event['detail']['TrainingJobName']\
    job_secondary_status = event['detail']['SecondaryStatus']\
    if job_secondary_status == 'Starting':\
        enforce_required_tags(job_name, event)

AWS offers multiple ways for creating a Lambda function. Please see the AWS Lambda documentation for details. Once created, make sure to set the function as the target of the EventBridge rule.

The same function can be used to enforce additional development rules that are aimed at controlling cost such as: the types of instances that can be used, the maximum number of instances per job, the maximum runtime of a job, and more.

Stopping Stalled Experiments

Imagine the following scenario: You have planned a large cloud-based training job that will run on eight $30-an-hour ML compute instances for a period of three days. For the purpose of this task, you have secured a budget of $17,280 (8 instances x $30 an hour x 24 hours x 3 days). You start up the training job just before heading out for a three-day holiday weekend. When you return from your holiday weekend, you discover that an hour into the job, the training process stalled causing the expensive machinery to essentially remain completely idle for three long days. Not only have you wasted $17,280 (good luck explaining that to your boss) but your development has now been pushed back by three days!!

One way to protect yourself against this type of occurrence, is to monitor the utilization of the underlying training job resources. For example, if the GPU utilization your training instances remains below a certain threshold for an extended period of time, this is likely to be a sign that something has gone wrong and that the training job should be stopped immediately.

We will do this by defining an Amazon CloudWatch alarm that monitors the GPU utilization of one of the training instances of each SageMaker job and invokes an AWS Lambda function that terminates the job if the alarm is triggered. Setting this up requires three components: an Amazon CloudWatch alarm (one per training job), an AWS Lambda function, and an Amazon Simple Notification Service (SNS) topic that is used to link the Lambda function to the CloudWatch alarms.

First, we create an SNS topic. This can be done via the Amazon SNS Console or in Python, as shown below:

import boto3

sns_client = boto3.client('sns')\
# Create a SNS notification topic.\
topic = sns_client.create_topic(Name="SageMakerTrainingJobIdleTopic")\
topic_arn = topic.arn\
print(f"Created SNS topic arn: {topic_arn}")

Next, we extend the sagemaker_event_handler function we defined above to create a unique alarm each time a training job is started. We program the alarm to measure the average GPU utilization over five-minute periods and to alert our SNS topic when there are three consecutive measurements below 1%. The alarm is deleted when the job is completed.

def create_training_alarm(job_name):\
    topic_arn = '<sns topic arn>'

    SAMPLE_PERIOD_SECONDS = 60 * 5 # 5 minutes\
    SAMPLE_POINTS_LIMIT = 3\
    GPU_UTIL_THRESHOLD_PERCENTAGE = 1

    cloudwatch_client = boto3.client('cloudwatch')

    # A new sample is generated each SAMPLE_PERIOD_SECONDS seconds.\
    # The alarm will set off it there will be more than SAMPLE_POINTS_LIMIT\
    # below the limit.\
    response = cloudwatch_client.put_metric_alarm(\
        AlarmName=job_name + 'GPUUtil',\
        AlarmActions=topic_arn,\
        MetricName='GPUUtilization',\
        Namespace='/aws/sagemaker/TrainingJobs',\
        Statistic='Average',\
        Dimensions=[{\
            "Name": "Host",\
            "Value": job_name+"/algo-1"\
        }],\
        Period=SAMPLE_PERIOD_SECONDS,\
        EvaluationPeriods=SAMPLE_POINTS_LIMIT,\
        DatapointsToAlarm=SAMPLE_POINTS_LIMIT,\
        Threshold=GPU_UTIL_THRESHOLD_PERCENTAGE,\
        ComparisonOperator='LessThanOrEqualToThreshold',\
        TreatMissingData='notBreaching'\
    )\
    assert response['ResponseMetadata']['HTTPStatusCode'] == 200

def delete_training_alarm(job_name):\
    cloudwatch_client = boto3.client('cloudwatch')\
    response = cloudwatch_client.delete_alarms(\
                                   AlarmNames=[job_name+'GPUUtil'])

def sagemaker_event_handler(event, _):\
    job_name = event['detail']['TrainingJobName']\
    job_secondary_status = event['detail']['SecondaryStatus']\
    if job_secondary_status == 'Starting':\
        enforce_required_tags(job_name, event)\
    elif job_secondary_status == 'Training':\
        create_training_alarm(job_name)\
    elif job_secondary_status in ['Completed', 'Failed', 'Stopped']:\
        delete_training_alarm(job_name)

Last, we define a second Python AWS Lambda function that parses messages received from the SNS topic and terminates the training job associated with the alarm.

import boto3, json

def lambda_sns_handler(event, context):\
    data = json.loads(event['Records'][0]['Sns']['Message'])\
    alarm_name = data['AlarmName']\
    training_job_name = alarm_name.replace('GPUUtil', '')\
    stop_training_job(training_job_name)

AWS offers multiple mechanisms for subscribing a Lambda function to an SNS topic including the AWS Console, AWS CLI, and the AWS Serverless Application Model (AWS SAM).

The solution we described is summarized in the following diagram:

Note that the same architecture can be used to enforce a minimum level of GPU utilization of your ML training projects. GPUs are typically the most expensive resource in your training infrastructure and your goal should be to maximize the utilization of all of your training workloads. By dictating a minimum level of utilization (e.g. 80%) you can ensure that all developers optimize their workloads appropriately.

Ensuring Continuity of Development

In our previous example, we demonstrated how to identify and stop a stalled experiment. In the large training job scenario that we described, this helped save a lot of money, but it did not address the three day delay to development. Obviously, if the source of the stall is in your code, it makes sense to postpone resuming training until the problem is fixed. However, we often encounter training interruptions that are not caused by our code but rather by sporadic failures in the service environment. In such scenarios, your priority may be to ensure training continuity rather than having to wait for someone to manually resume the training job (using the most recent training checkpoint). In the code block below, we use the boto3 create_training_job API to extend our sagemaker_event_handler function to (naively) resume any training job that has failed after running for at least two hours.

import boto3, datetime

def clone_job(training_name, disable_spot=False):\
    # get description\
    client = boto3.client('sagemaker')\
    desc = client.describe_training_job(TrainingJobName=training_name)

    # update the training name\
    new_training_name = training_name + 'clone'

    use_spots = (not disable_spot) and desc["EnableManagedSpotTraining"]

    if disable_spot:\
        desc["StoppingCondition"].pop("MaxWaitTimeInSeconds", None)

    client.create_training_job(\
        TrainingJobName=new_training_name,\
        HyperParameters=desc["HyperParameters"],\
        AlgorithmSpecification=desc["AlgorithmSpecification"],\
        RoleArn=desc["RoleArn"],\
        OutputDataConfig=desc["OutputDataConfig"],\
        ResourceConfig=desc["ResourceConfig"],\
        StoppingCondition=desc["StoppingCondition"],\
        EnableNetworkIsolation=desc["EnableNetworkIsolation"],\
        EnableInterContainerTrafficEncryption=desc[\
            "EnableInterContainerTrafficEncryption"\
        ],\
        EnableManagedSpotTraining=use_spots,\
        Tags=client.list_tags(ResourceArn=desc['TrainingJobArn'])\
     )

def sagemaker_event_handler(event, _):\
    TRAIN_TIME_THRESHOLD = 2 * 60 * 60: # 2 hours\
    job_name = event['detail']['TrainingJobName']\
    job_secondary_status = event['detail']['SecondaryStatus']\
    if job_secondary_status == 'Starting':\
        enforce_required_tags(job_name, event)\
    elif job_secondary_status == 'Training':\
        create_training_alarm(job_name)\
    elif job_secondary_status in ['Completed', 'Failed', 'Stopped']:\
        delete_training_alarm(job_name)

    if job_secondary_status == 'Failed':\
        start_time = datetime.datetime.utcfromtimestamp(\
                                     event['detail']['CreationTime']/1000)\
        end_time = datetime.datetime.utcfromtimestamp(\
                                     event['detail']['TrainingEndTime']/1000)\
        training_time_seconds = (end_time - start_time).seconds\
        if training_time_seconds >= TRAIN_TIME_THRESHOLD:\
            clone_job(job_name)

The function above automatically resumes any job that fails after two hours. A more practical solution might attempt to diagnose the type of error to determine whether resuming the job would be appropriate. One way to do this is to parse the failure description message and/or the CloudWatch logs associated with the failing job.

Advanced Spot-instance Utilization

One of the compelling features of Amazon SageMaker is its support for managed spot training. Amazon EC2 Spot Instances allow you to take advantage of unused EC2 capacity at discounted prices. The catch is that these instances can be taken away ("interrupted") in the middle of their use. Thus, Spot instances should be used only for fault-tolerant workloads. SageMaker makes it easy to take advantage of Spot instances by identifying Spot interruptions on your behalf and automatically restarting jobs when new Spot instances become available. While managed spot instances can be used to reduce cost of training, sometimes this strategy can backfire. For example, when there is low spot capacity your training jobs might time out before starting. Alternatively, the job might experience frequent interruptions that prevent it from making any meaningful progress. Both occurrences can interfere with development and reduce productivity. These types of situations can be monitored and addressed using AWS Lambda. In the code block below, we extend our sagemaker_event_handler function to identify a training job that has been interrupted more than three times and replace it with a cloned job in which the managed spot training is disabled.

def sagemaker_event_handler(event, _):\
    TRAIN_TIME_THRESHOLD = 2 * 60 * 60: # 2 hours\
    MIN_ITERRUPTS = 3\
    job_name = event['detail']['TrainingJobName']\
    job_secondary_status = event['detail']['SecondaryStatus']\
    if job_secondary_status == 'Starting':\
        enforce_required_tags(job_name, event)\
    elif job_secondary_status == 'Training':\
        create_training_alarm(job_name)\
    elif job_secondary_status in ['Completed', 'Failed', 'Stopped']:\
        delete_training_alarm(job_name)

    if job_secondary_status == 'Failed':\
        start_time = datetime.datetime.utcfromtimestamp(\
                                     event['detail']['CreationTime']/1000)\
        end_time = datetime.datetime.utcfromtimestamp(\
                                     event['detail']['TrainingEndTime']/1000)\
        training_time_seconds = (end_time - start_time).seconds\
        if training_time_seconds >= TRAIN_TIME_THRESHOLD:\
            clone_job(job_name)

    if job_secondary_status == 'Interrupted':\
        transitions = event['detail']["SecondaryStatusTransitions"]\
        interrupts = [e for e in transitions if e["Status"] == "Interrupted"]\
        num_interrupts = len(interrupts)\
        if num_interrupts > MIN_ITERRUPTS:\
            stop_training_job(job_name)\
            clone_job(job_name, disable_spot=True)

The implementation above determined the spot usage strategy based solely on the number of interruptions of the training job in question. A more elaborate solution might take into account other jobs (that use the same instance types), the duration of time across which the interruptions occurred, the amount of active training time, and/or the number of recent jobs that timed out due to low Spot instance capacity.

Summary

Effective AI model development requires the definition of a creative and detailed training infrastructure architecture in order to minimize cost and maximize productivity. In this post we have demonstrated how serverless AWS Lambda functions can be used to augment Amazon SageMaker's managed training service in order to address some common issues that can occur during training. Naturally, the precise manner in which you might apply these kinds of techniques will depend greatly on the specifics of your project.

Please feel free to reach out with questions, comments, and corrections. Be sure to check out our other posts on the topic of DL training optimization.

Accelerating PyTorch Training Workloads with FP8

Chaim Rand — Thu, 21 Dec 2023 14:35:00 +0000

How to make the most of your modern-day GPU

The past few years have seen revolutionary advancements in the field of AI, perhaps best exemplified by the recent popularity and proliferation of LLM-based applications such as ChatGPT. These breakthroughs have been powered by equally exciting developments in the machinery used to train AI models. New and innovative architectures, sophisticated tensor processing cores, and dedicated HW accelerators have enabled the convergence of AI models of ever-increasing sizes, at faster and faster rates. In this post, we will focus on one particular advancement in AI-specialized HW — the inclusion of dedicated 8-bit floating-point (FP8) tensor processing cores. Appearing in the most modern AI HW architectures (e.g., Nvidia Hopper, Nvidia Ada Lovelace, and Habana Gaudi2) the FP8 tensor cores enable a significant increase in floating-point operations per second (FLOPS), as well as opportunities for memory optimization and energy savings for both AI training and inference workloads.

Taking advantage of the HW-level FP8 capabilities requires appropriate support in the SW stack and development framework that we use to build our AI training and inference applications. In this post we will describe how to modify a PyTorch training script so as to utilize the built-in support for the FP8 datatype of an Nvidia H100 GPU. We will start by providing some motivation for the use of the FP8 datatype. We will then review the FP8-specific PyTorch API support exposed by the Transformer Engine library and show how to integrate them into a simple training script. Although we will not go into the theory behind the use of FP8 for AI training, we will note the potential challenges involved in its use. Last, we will demonstrate the significant optimization opportunities of the FP8 datatype.

Disclaimers

Please do not interpret our mention of any SW component, methodology, or service as an endorsement for its use. The best design for ML development will vary greatly based on the specific details of your own AI workload. Please also keep in mind that the APIs and behaviors of some of the SW packages and components we will mention may change by the time you read this post. You are highly encouraged to evaluate any potential design decisions based on the most up to date HW and SW available.

Motivation

As AI models grow more and more sophisticated, so does the machinery required to train them. The Nvidia H100 GPU, said to support “unprecedented performance and scalability”, is (at the time of this writing) Nvidia’s newest and strongest AI accelerator, purposely designed with the goal of enabling the AI development of the next generation. With the current AI hype in full swing, the demand for these GPUs has been huge (e.g., see here). Accordingly, and unsurprisingly, the cost of these GPUs has been extremely high — perhaps even forbidding for many of our readers. Fortunately, cloud service providers such as AWS, GCP, and Microsoft Azure, offer “pay as you go” (per hour/per second) access to H100 powered machines thereby opening up the opportunity for their use to a much greater community of AI developers.

In AWS, H100 GPUs are offered as a component of the recently announced AWS EC2 p5 instance family. These instances are claimed to “accelerate your time to solution by up to 4x compared to previous-generation GPU-based EC2 instances and reduce cost to train ML models by up to 40%”.

In a recent post we discussed some of the considerations that should go into the choice of an ML training instance. We highlighted the fact that the most optimal instance type will be very-much dependent on the project at hand. Specifically, when it comes to ML training instances — bigger is not always better. This is particularly true of the p5 instance family. True — the p5 will likely out-perform any other instance type — after all, the H100 is an undisputed performance beast. But once you factor in the cost of the p5 ($98.32 per hour for the 8-GPU p5.48xlarge instance — at the time of this writing), you might find other instance types to be more suitable.

In the next section we will train a relatively large computer vision model on a p5.48xlarge and compare its performance to a p4d.24xlarge containing 8 Nvidia A100 GPUs.

Toy Model

In the code-block below we define a Vision Transformer (ViT)-backed classification model (using the popular timm Python package version 0.9.10) along with a randomly generated dataset. ViT backbones come in many shapes and sizes. Here we have chosen what is often referred to as the ViT-Huge configuration — with 632 million parameters — in order to take better advantage of the capacity the H100 has for large models.

import torch, time
import torch.optim
import torch.utils.data
import torch.distributed as dist
from torch.nn.parallel.distributed import DistributedDataParallel as DDP
import torch.multiprocessing as mp

# modify batch size according to GPU memory
batch_size = 64

from timm.models.vision_transformer import VisionTransformer

from torch.utils.data import Dataset


# use random data
class FakeDataset(Dataset):
    def __len__(self):
        return 1000000

    def __getitem__(self, index):
        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
        label = torch.tensor(data=[index % 1000], dtype=torch.int64)
        return rand_image, label


def mp_fn(local_rank, *args):
    # configure process
    dist.init_process_group("nccl",
                            rank=local_rank,
                            world_size=torch.cuda.device_count())
    torch.cuda.set_device(local_rank)
    device = torch.cuda.current_device()

    # create dataset and dataloader
    train_set = FakeDataset()
    train_loader = torch.utils.data.DataLoader(
        train_set, batch_size=batch_size,
        num_workers=12, pin_memory=True)

    # define ViT-Huge model
    model = VisionTransformer(
            embed_dim=1280,
            depth=32,
            num_heads=16,
        ).cuda(device)
    model = DDP(model, device_ids=[local_rank])

    # define loss and optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

    model.train()

    t0 = time.perf_counter()
    summ = 0
    count = 0

    for step, data in enumerate(train_loader):
        # copy data to GPU
        inputs = data[0].to(device=device, non_blocking=True)
        label = data[1].squeeze(-1).to(device=device, non_blocking=True)

        # use mixed precision to take advantage of bfloat16 support
        with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
            outputs = model(inputs)
            loss = criterion(outputs, label)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

        # capture step time
        batch_time = time.perf_counter() - t0
        if step > 10:  # skip first steps
            summ += batch_time
            count += 1
        t0 = time.perf_counter()
        if step > 50:
            break
    print(f'average step time: {summ/count}')


if __name__ == '__main__':
    mp.spawn(mp_fn,
             args=(),
             nprocs=torch.cuda.device_count(),
             join=True)

We trained this model on both the p5.48xlarge and p4d.24xlarge instance types using the dedicated PyTorch 2.1 AWS deep learning container (763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2).

Unsurprisingly, the p5 step-time performance blows away the p4d performance — 0.199 seconds per step compared to 0.41 — more than twice as fast!! That would mean halving the time to train your large ML models. However, when you take into account the difference in cost ($32.77 per-hour for the p4d vs $98.32 per-hour for the p5 — as of the time of this writing) a completely different story unfolds. The price-performance of the p5 is ~30% worse than the p4d!! This is very far from the 40% improvement that appeared in the p5 announcement.

At this point you might draw one of two possible conclusions. The first possibility is that, despite all the hype, the p5 is simply not the right machine for you. The second is that the p5 could still be viable, but that adaptations would be required to your model in order to take full advantage of its potential. In the next sections we will adopt the second approach and demonstrate how using the FP8 datatype — unique to the p5 instance type — can completely alter the comparative price-performance results.

Integrating FP8 with Transformer Engine

The first thing we should emphasize is that, as of the time of this writing, PyTorch (version 2.1) does not include a native 8-bit floating datatype. To program our script to use FP8 we will use Transformer Engine (TE) a dedicated library for accelerating Transformer models on NVIDIA GPUs. TE (version 0.12) comes preinstalled in the AWS PyTorch 2.1 DL container.

Although the theory behind the use of FP8 for training is beyond the scope of this post (e.g., see here), it is important to be aware that the mechanics of using FP8 are far more complex than the 16-bit alternatives (float16 and bfloat16). Fortunately, the TE imlementation hides all of the messy details from the user. Please see the official documentation as well as this simple example for instructions on how to use the TE APIs. To learn more about what is going on behind the scenes be sure to see the following two video tutorials.

FP8 Training with Transformer Engine | NVIDIA On-Demand

The session will include an introduction to FP8 and mixed precision, an overview of Transformer Engine features, and a code demo on how to use the library.

nvidia.com

FP8 for Deep Learning | NVIDIA On-Demand

FP8 is a natural progression for accelerating deep learning (DL) training beyond the 16-bit formats common in modern processors

nvidia.com

To modify our model to use TE, we wrap TE’s specialized Transformer Layer with a custom transformer block class that conforms to timm’s block layer signature.

import transformer_engine.pytorch as te
from transformer_engine.common import recipe


class TE_Block(te.transformer.TransformerLayer):
    def __init__(
            self,
            dim,
            num_heads,
            mlp_ratio=4.,
            qkv_bias=False,
            qk_norm=False,
            proj_drop=0.,
            attn_drop=0.,
            init_values=None,
            drop_path=0.,
            act_layer=None,
            norm_layer=None,
            mlp_layer=None
    ):
        super().__init__(
            hidden_size=dim,
            ffn_hidden_size=int(dim * mlp_ratio),
            num_attention_heads=num_heads,
            hidden_dropout=proj_drop,
            attention_dropout=attn_drop
            )

Next, we modify the VisionTransformer initialization to use our custom block layer:

  model = VisionTransformer(
      embed_dim=1280,
      depth=32,
      num_heads=16,
      block_fn=TE_Block
      ).cuda(device)

Until now we have not made any H100-specific changes — the same code can be run on our A100-powered p4d instance type. The last modification is wrapping the model forward-pass with a te.fp8_autocast context manager. This change requires a GPU that supports FP8:

with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
    with te.fp8_autocast(enabled=True):
        outputs = model(inputs)
    loss = criterion(outputs, label)

A Few Cautionary Remarks Regarding the Use of FP8

Usage of an 8-bit floating-point representation (as opposed to a 16 or 32-bit representation) implies lower precision and a lower dynamic range. These can have a meaningful impact on the attainability and/or speed of your model convergence. Although the underlying TE FP8 implementation is designed to address this challenge, there is no guarantee that this will work for your model. You may need to fiddle with the underlying FP8 mechanics (e.g., using the TE recipe APIs), tune some of the hyperparameters, and/or limit the application of the FP8 to subsections of the model. You might find that despite all of your attempts, your model is simply not compatible with FP8.

Results

In the table below we summarize the results of our experiments on both p4d.24xlarge and p5.48xlarge EC2 instance types, with and without the TE library. For the p5.48xlarge experiments we doubled the batch size in order to increase the utilization of the 80 GB GPU memory. Using FP8 reduces the GPU memory consumption enabling a further increase to the batch size.

We can see that the use of the TE transformer block increased the price-performance on both the p4d (~19%) and the p5 (~32%) instance types. Using FP8 boosts the performance on the p5 by an additional ~20%. Following the TE and FP8 optimizations the price performance of the H100-based p5.48large beats that of the A100-based p4d.24large — although not by a very wide margin (~2%). Taking into account the 3x increase in training speed, we can safely conclude that the p5 would be the better instance type for training our optimized model.

Note that the relatively small increase in price-performance (far lower than the 40% mentioned in the p5 announcement) leaves us wishing for additional H100-specific optimizations… but those will have to wait for another post :).

Summary

In this post we have demonstrated how to program a PyTorch training script to use 8-bit floating types. We further demonstrated how the use of FP8 can be a key factor in getting the best performance out of modern GPUs such as Nvidia H100. Importantly, the viability of FP8 as well as its impact on training performance can vary a great deal based on the details of the model.

This post continues a long series of publications on the topic of optimizing machine learning workloads. Be sure to see some of our other posts on this important topic.