DEV Community

Karthik Sakthivel
Karthik Sakthivel

Posted on

Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development

What's new at AWS 📢

☑ #Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development

☑ This new availability enables customers to run and manage their Kubernetes workloads on SageMaker HyperPod, a purpose-built infrastructure for foundation model (FM) development which reduces time to train models by up to 40%.

☑ Many customers use Kubernetes to orchestrate their ML workflows due to its portability, scalability, and rich ecosystem of tools. However managing hardware failures are not automated.

☑ With this launch, customers can run deep health checks during cluster creation and automated hardware failures during ML trainings and fine-tuning.

☑ In addition, HyperPod automatically replaces faulty nodes(self-healing performant clusters) and resumes training from the last checkpoint on both AWS Trainium and Nvidia GPU at a scale of more than a thousand accelerators.

☑ EKS orchestrated HyperPod clusters also integrate with CloudWatch Container Insights to provide out-of-the-box observability of health status checks and visual dashboards.

☑ Customer can use HyperPod CLI, or their preferred tools, to submit, manage, and monitor workloads.

☑ What is Amazon EKS:

 ➰ AWS managed Kubernetes service to run Kubernetes in the AWS cloud and on-premises data centers as well.

 ➰ It automatically manages the availability and scalability of the Kubernetes control plane nodes and major tasks.

 ➰ Amazon EKS is integrated with AWS services such as Elastic load balancer, IAM, VPC, and CloudTrails are added advantage.
Enter fullscreen mode Exit fullscreen mode

📌 Explore more about EKS: https://aws.amazon.com/eks/

📌 Explore more about SageMaker HyperPod: https://aws.amazon.com/blogs/aws/amazon-sagemaker-hyperpod-introduces-amazon-eks-support/

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay