<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Itamar Perez</title>
    <description>The latest articles on DEV Community by Itamar Perez (@cosmicrocks).</description>
    <link>https://dev.to/cosmicrocks</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F16116%2Fcd7c1b46-71e2-4611-898f-dd9be39fef75.jpeg</url>
      <title>DEV Community: Itamar Perez</title>
      <link>https://dev.to/cosmicrocks</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cosmicrocks"/>
    <language>en</language>
    <item>
      <title>Enabling GPU Nodes for PyTorch Workloads on EKS with Autoscaling</title>
      <dc:creator>Itamar Perez</dc:creator>
      <pubDate>Tue, 12 Sep 2023 18:24:40 +0000</pubDate>
      <link>https://dev.to/cosmicrocks/enabling-gpu-nodes-for-pytorch-workloads-on-eks-with-autoscaling-1d9f</link>
      <guid>https://dev.to/cosmicrocks/enabling-gpu-nodes-for-pytorch-workloads-on-eks-with-autoscaling-1d9f</guid>
      <description>&lt;h2&gt;
  
  
  Enabling GPU Nodes for PyTorch Workloads on EKS with Autoscaling
&lt;/h2&gt;

&lt;p&gt;Amazon Elastic Kubernetes Service (EKS) provides a managed Kubernetes service that makes it easier for users to run Kubernetes on AWS without needing to install, operate, and maintain their own Kubernetes control plane or nodes. When running machine learning workloads, especially those that require GPU acceleration like PyTorch, it's essential to set up GPU nodes. This article will guide you through the process of setting up GPU nodes for PyTorch workloads on EKS with autoscaling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Basic knowledge of Kubernetes, EKS, and Terraform.&lt;/li&gt;
&lt;li&gt;AWS CLI and kubectl installed and configured.&lt;/li&gt;
&lt;li&gt;Docker installed.&lt;/li&gt;
&lt;li&gt;Helm installed.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1. Setting up the EKS Cluster with GPU Nodes using Terraform
&lt;/h3&gt;

&lt;p&gt;Before applying the Terraform configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensure you have the AWS provider configured in your Terraform setup.&lt;/li&gt;
&lt;li&gt;Initialize the Terraform directory using &lt;code&gt;terraform init&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's a snippet from a Terraform configuration that sets up an EKS cluster and a self-managed GPU node group:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;## (https://github.com/aws-ia/terraform-aws-eks-blueprints)&lt;/span&gt;
&lt;span class="c1"&gt;## ... [other Terraform code]&lt;/span&gt;

&lt;span class="c1"&gt;## Cluster Configuration&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"eks"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# ... [other configuration]&lt;/span&gt;

  &lt;span class="nx"&gt;self_managed_node_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;gpu_node_group&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;node_group_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"gpu-node-group"&lt;/span&gt;
      &lt;span class="nx"&gt;ami_type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"AL2_x86_64_GPU"&lt;/span&gt;
      &lt;span class="nx"&gt;capacity_type&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ON_DEMAND"&lt;/span&gt;
      &lt;span class="nx"&gt;instance_types&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"g4dn.xlarge"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"g4dn.2xlarge"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="c1"&gt;# ... [other configuration]&lt;/span&gt;

      &lt;span class="nx"&gt;taints&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;dedicated&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;key&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"nvidia.com/gpu"&lt;/span&gt;
          &lt;span class="nx"&gt;value&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt;
          &lt;span class="nx"&gt;effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"NO_SCHEDULE"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="c1"&gt;# ... [other configuration]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Building the PyTorch Container with GPU Support
&lt;/h3&gt;

&lt;p&gt;Your &lt;code&gt;requirements.txt&lt;/code&gt; should contain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;torch
torchvision
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run PyTorch workloads on the GPU nodes, you need a container image with the necessary dependencies. Here's a Dockerfile that sets up a PyTorch environment with GPU support:&lt;/p&gt;

&lt;p&gt;Your Dockerfile should contain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; nvidia/cuda:12.2.0-devel-ubuntu20.04&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="c"&gt;# Install Python and pip&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; python3 python3-pip &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /var/lib/apt/lists/&lt;span class="k"&gt;*&lt;/span&gt;

&lt;span class="c"&gt;# Install requirements&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; requirements.txt ./&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; app.py ./&lt;/span&gt;
&lt;span class="c"&gt;# Run app&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python3", "./app.py"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your &lt;code&gt;app.py&lt;/code&gt; can be as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Check if CUDA (GPU support) is available
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"GPU not available. Exiting."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;

    &lt;span class="c1"&gt;# Set the device to GPU
&lt;/span&gt;    &lt;span class="n"&gt;device&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cuda:0"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;iteration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;iteration&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

        &lt;span class="c1"&gt;# Create two random tensors
&lt;/span&gt;        &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Perform a matrix multiplication on GPU
&lt;/span&gt;        &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Iteration &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: Matrix multiplication completed on GPU!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Result tensor shape: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Pause for a short duration before the next iteration
&lt;/span&gt;        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"__main__"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Pushing PyTorch GPU App to AWS ECR
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Prerequisites:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Docker installed and running.&lt;/li&gt;
&lt;li&gt;AWS CLI installed and configured with the necessary permissions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Steps:
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Set Environment Variables&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Set your AWS account ID and region as environment variables:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_ACCOUNT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-account-id&amp;gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your-region&amp;gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Replace &lt;code&gt;&amp;lt;your-account-id&amp;gt;&lt;/code&gt; with your AWS account ID and &lt;code&gt;&amp;lt;your-region&amp;gt;&lt;/code&gt; with your AWS region (e.g., &lt;code&gt;us-west-1&lt;/code&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Docker Build&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Navigate to the directory containing your Dockerfile and app.js. Build the Docker image:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker buildx build &lt;span class="nt"&gt;--platform&lt;/span&gt; linux/amd64 &lt;span class="nt"&gt;-t&lt;/span&gt; pytorch-gpu-app &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create an ECR Repository&lt;/strong&gt; (if you haven't already):&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ecr create-repository &lt;span class="nt"&gt;--repository-name&lt;/span&gt; pytorch-gpu-app &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$AWS_REGION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Authenticate Docker to the ECR Registry&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ecr get-login-password &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$AWS_REGION&lt;/span&gt; | docker login &lt;span class="nt"&gt;--username&lt;/span&gt; AWS &lt;span class="nt"&gt;--password-stdin&lt;/span&gt; &lt;span class="nv"&gt;$AWS_ACCOUNT&lt;/span&gt;.dkr.ecr.&lt;span class="nv"&gt;$AWS_REGION&lt;/span&gt;.amazonaws.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tag the Docker Image&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker tag pytorch-gpu-app:latest &lt;span class="nv"&gt;$AWS_ACCOUNT&lt;/span&gt;.dkr.ecr.&lt;span class="nv"&gt;$AWS_REGION&lt;/span&gt;.amazonaws.com/pytorch-gpu-app:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Push the Docker Image to ECR&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker push &lt;span class="nv"&gt;$AWS_ACCOUNT&lt;/span&gt;.dkr.ecr.&lt;span class="nv"&gt;$AWS_REGION&lt;/span&gt;.amazonaws.com/pytorch-gpu-app:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  3.5 Creating an imagePullSecret for AWS ECR
&lt;/h3&gt;

&lt;p&gt;Before deploying your PyTorch workload on EKS, if your Docker image is stored in a private AWS ECR repository, you'll need to create a Kubernetes secret (&lt;code&gt;imagePullSecret&lt;/code&gt;) to allow your EKS nodes to pull the image.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Retrieve an authentication token&lt;/strong&gt; to use to authenticate your Docker client to your registry:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws ecr get-login-password &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$AWS_REGION&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create the &lt;code&gt;imagePullSecret&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create secret docker-registry ecr-secret &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--docker-server&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;AWS_ACCOUNT.dkr.ecr.&lt;span class="nv"&gt;$AWS_REDION&lt;/span&gt;.amazonaws.com &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--docker-username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;AWS &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--docker-password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
secret/ecr-secret created
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Replace &lt;code&gt;&amp;lt;your-account-id&amp;gt;&lt;/code&gt; with your AWS account ID and &lt;code&gt;&amp;lt;your-region&amp;gt;&lt;/code&gt; with your AWS region.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  4. Deploying the PyTorch Workload on EKS
&lt;/h3&gt;

&lt;p&gt;Before deploying our workloads we need to enable the nvidia k8s-device-plugin:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To deploy the PyTorch workload on the GPU nodes in EKS, use the following &lt;strong&gt;deployment.yaml&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pytorch&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pytorch&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pytorch&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nvidia.com/gpu"&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Equal"&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
          &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NoSchedule"&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;NodeGroup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-node-group&lt;/span&gt;
      &lt;span class="na"&gt;imagePullSecrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ecr-secret&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pytorch&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;your-account-id&amp;gt;.dkr.ecr.&amp;lt;your-region&amp;gt;.amazonaws.com/pytorch-gpu-app:latest&lt;/span&gt;
          &lt;span class="na"&gt;imagePullPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Always&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2Gi&lt;/span&gt;
              &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4Gi&lt;/span&gt;
              &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  4. Installing the Cluster Autoscaler
&lt;/h3&gt;

&lt;p&gt;The Cluster Autoscaler automatically adjusts the size of the cluster, adding or removing nodes based on resource requirements and constraints. To install the Cluster Autoscaler on your EKS cluster, you can use the Helm package manager.&lt;/p&gt;

&lt;p&gt;Ensure that the necessary RBAC roles and permissions are set up for the Cluster Autoscaler.&lt;/p&gt;

&lt;p&gt;First, add the Cluster Autoscaler Helm repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;helm repo add autoscaler https://kubernetes.github.io/autoscaler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Using Autodiscovery
&lt;/h4&gt;

&lt;p&gt;This method allows the Cluster Autoscaler to discover node groups automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;my-release autoscaler/cluster-autoscaler &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="s1"&gt;'autoDiscovery.clusterName'&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;CLUSTER NAME&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;&amp;lt;CLUSTER NAME&amp;gt;&lt;/code&gt; with the name of your EKS cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Scaling Up GPU Nodes in EKS
&lt;/h3&gt;

&lt;p&gt;Scaling up the number of GPU nodes in your EKS cluster ensures that you have sufficient resources to handle increased workloads. Here's how you can scale up the GPU nodes:&lt;/p&gt;

&lt;h4&gt;
  
  
  Update the Terraform Configuration:
&lt;/h4&gt;

&lt;p&gt;Modify your existing Terraform configuration to increase the desired number of nodes in the gpu_node_group. For instance, if you initially set up 2 nodes and want to scale up to 4, you'd update the desired_capacity attribute.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;self_managed_node_groups&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;gpu_node_group&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;# ... [other configuration]&lt;/span&gt;

    &lt;span class="nx"&gt;desired_capacity&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
    &lt;span class="nx"&gt;min_size&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="nx"&gt;max_size&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;

    &lt;span class="c1"&gt;# ... [other configuration]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The min_size and max_size attributes define the minimum and maximum number of nodes in the node group, respectively. Adjust these values based on your requirements.&lt;/p&gt;

&lt;p&gt;Apply the Updated Configuration:&lt;/p&gt;

&lt;p&gt;Run the following command to apply the updated configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terraform will show you a plan of the changes it will make. Review the plan to ensure it's making the desired changes, then confirm to apply.&lt;/p&gt;

&lt;h4&gt;
  
  
  Monitor the Scaling Process:
&lt;/h4&gt;

&lt;p&gt;You can monitor the scaling process using the AWS Management Console or the kubectl command. To check the nodes being added to your cluster using kubectl, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will display a list of all nodes in your cluster. You should see the new nodes being added.&lt;/p&gt;

&lt;h4&gt;
  
  
  Optimizing Costs:
&lt;/h4&gt;

&lt;p&gt;Remember, adding more GPU nodes will increase your AWS bill. To optimize costs, consider using a mix of On-Demand and Spot Instances, or setting up Auto Scaling policies that scale down the number of nodes during off-peak hours.&lt;/p&gt;




&lt;p&gt;In conclusion, by following these steps, you can efficiently set up GPU nodes for PyTorch workloads on EKS with autoscaling. This setup ensures that your machine learning workloads can leverage the power of GPUs for faster processing and better performance, while also benefiting from the elasticity provided by the Cluster Autoscaler. Remember to tear down resources after you're done to avoid incurring unnecessary costs using commands like terraform destroy and deleting the EKS cluster.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
