Consider supporting my work by reading this on Medium. Thanks! ❤
This is how modern K8s feels like on AWS.
Here’s a collection of my learnings from AWS re:Invent 2023 Las Vegas — an incredible event for innovators and technologists from all over the world, in a reality bending place which is Las Vegas.
By 2028, more than 95% of global organisations will be running containerised applications in production, which is a significant increase from fewer than 50% in 2023.
My focus was AWS EKS — AWS Elastic Kubernetes Service. I attended a variety of sessions and workshops focused on the latest advancements, emerging tools and solutions for containerised workloads based on Kubernetes:
- Generative AI
- Karpenter
- Security
- Multi-tenancy
I hope it will become helpful to you!
Generative AI on EKS
Workshop from re:Invent 2023 available for free (thx AWS).
GenAI companies like OpenAI use Kubernetes to train their foundational models like GPT-3, GPT-4, and DALL·E by distributing model training across thousands of powerful GPU-enabled VMs. With frameworks like horovod, and k8s operators like MPI, you can perform distributed deep learning on EKS at scale. You will need to configure CUDA time-slicing on nvidia-device-plugin to allow GPU sharing across multiple training jobs.
Newly released Mountpoint for S3 CSI Driver adds file system interface access to large training datasets stored on S3 (previously only possible with FSx for CSI drivers).
Increasingly AWS customers are benefiting from their investments in EKS by baking in complete end-to-end MLOps capabilities into their clusters. JARK on EKS allows you to use Jupiter Notebooks, Argo Workflows, and kuberay to build, train, test, serve and re-train your AI/ML on EKS.
During AWS re:Invent 2023, by leveraging EKS Terraform blueprints, in just two hours we deployed an entire EKS cluster with JARK stack installed, and deployed + re-trained stable diffusion (images from prompt) using custom data.
Karpenter
Workshop from AWS re:Invent 2023 available for free (thx AWS).
Karpenter, which has recently graduated to beta, is becoming a de-facto standard for node auto-scaling on EKS.
It’s faster than Cluster Autoscaler, because it uses a synchronous EC2 Fleet API instead of the Auto Scaling Groups. It’s smarter, because it consolidates your nodes down to the most optimal mixture of node types best fitted to your workloads(incl. Spot), rather than simply scaling up the ASG with more nodes of the same type. This drives down your compute cost while maximising performance and reliability.
It’s important to follow best practices for resource requests/limits for all EKS Workloads to avoid OOM Kills due to pod bursts during consolidation (TLDR; memory request = limit).
During the workshop, we simulated load on the cluster and observed Karpenter consolidation using eks-node-viewer, which shows you a real-time breakdown of the EKS nodes, including resource utilisation, number of pods, node type, cost per node, and even an estimated total cost of the cluster per month.
Security
Workshop from AWS re:Invent 2023 available for free (thx AWS).
AWS Security Hub can ingest findings regarding your Kubernetes resources and Helm chart miss-configurations generated by kubescape and [AWS Config with Conformance Pack for EKS. Code to translate kubescape controls in S3 to AWS Security Hub findings available here.
GuardDuty EKS Runtime Monitoring looks for system-level threats like illegal file access, permission escalations, miner binaries, and suspicious network connections. GuardDuty EKS Audit Log Monitoring generates findings based on the EKS control plane access logs (requires enabling control plane logs). AWS Inspector scans your container images in ECR for vulnerabilities, and can now also scan images directly in CI, via the new inspector-scan
API and inspector-sbomgen
AWS CLI command.
The kicker here is AWS Detective, which looks for patterns across all these findings, plus CloudTrail and VPC flow logs, giving you real-time analysis of threats in your EKS environments, down to individual IPs, IAM Roles and even Kubernetes service accounts and role bindings.
Check out Pod Security Standards, and remember that you can enforce them as kyverno policies.
Multi-tenancy
Workshop from AWS re:Invent 2023 available for free (thx AWS).
SaaS control plane sits on top of your SaaS application and manages your tenants, subscriptions, tiering, customer onboarding, third-party integrations, and product observability. It’s a separate layer from the application itself, requiring expertise in access control protocols, infrastructure, telemetry, as well as the API and product development. EKS delivers efficiencies and new monetisation strategies by becoming a central piece in architecting the SaaS control plane.
Using Taints and Tolerations , and Node Affinity we can allocate different sets of compute to different tiers. For example, we can run premium tenants using On-demand EC2s, while free tier users on Spot to save money. Istio and Istio Gateway allows you to route traffic to different application pools based on JWT claims, and using Karpenter in combination with HPA, we can eliminate noisy-neighbour problems coming from a spiky usage of the free tier while ensuring the desired compute is always available just-in-time, without over-provisioning. We can expose tenant metrics from our SaaS applications to Prometheus and plot them as graphs in Grafana for unified observability.
Hope my collection of notes were helpful to you. I was fortunate to be a part of AWS re:Invent 2023 thanks to ZeroNorth, where we build modern platforms and capabilities on top of EKS to make global trade green. 💚
Bloopers: Google Cloud advertising up the Sphere, right next to AWS re:Invent 2023 :D
Top comments (0)