DEV Community

Mateen Anjum
Mateen Anjum

Posted on

How I Built a Production-Grade GPU-Ready Kubernetes Platform from Scratch

TL;DR: I rebuilt an entire Kubernetes platform with GPU support, GitOps automation, and full observability. Deployment time dropped from 2 weeks to 30 minutes, and we went from 80% to 99.9% uptime.


The Problem

The existing infrastructure was a mess. If you've ever inherited a Kubernetes cluster that "just grew organically," you know the pain:

  • Manual kubectl deployments - Someone would SSH into a bastion and run kubectl apply directly in production. No audit trail, no rollback plan.
  • No Infrastructure as Code - The cluster was clicked together in the AWS console. Nobody knew the exact configuration.
  • Zero GPU scheduling - Data scientists were spinning up GPU instances manually and SSHing into them for ML training.
  • Monitoring was an afterthought - We had CloudWatch... and hope.

When a senior engineer left, half the tribal knowledge walked out the door.

What I Tried First (And Why It Failed)

Attempt 1: "Just Document Everything"

I spent two weeks reverse-engineering the existing setup. Created a 50-page runbook.

Result: Nobody read it. The docs were outdated within a month.

Attempt 2: "Gradual Migration"

Tried to incrementally improve the existing cluster - add monitoring here, automate one deployment there.

Result: The cluster was too fragile. Every change broke something else. We needed a clean slate.

Attempt 3: The Right Approach

I proposed a full rebuild with a parallel cluster, allowing us to:

  • Build the new platform without disrupting production
  • Migrate workloads gradually with blue-green deployments
  • Validate everything before cutting over

Management approved. Here's how I built it.

The Solution

Architecture Overview

EKS Architecture

The Tech Stack

Layer Technology Why
IaC Terraform + Terragrunt Reusable modules, DRY configs
Cluster AWS EKS Managed control plane, less ops overhead
GitOps Flux CD Pull-based, secure, lighter than ArgoCD
GPU NVIDIA Device Plugin + DCGM Native GPU scheduling and monitoring
Observability Prometheus + Grafana Industry standard, great GPU metrics support
CI/CD GitHub Actions Where our code already lives

Phase 1: Foundation (Week 1-2)

First, I built reusable Terraform modules:

# terragrunt/dev/eks/terragrunt.hcl
terraform {
  source = "../../../terraform/modules/eks"
}

inputs = {
  cluster_name    = "eks-dev-cluster"
  cluster_version = "1.28"

  node_groups = {
    general = {
      instance_types = ["m5.large"]
      capacity_type  = "SPOT"  # Cost savings for non-GPU
      min_size       = 2
      max_size       = 10
    }

    gpu = {
      instance_types = ["g4dn.xlarge"]
      capacity_type  = "ON_DEMAND"  # Stability for GPU workloads
      min_size       = 0
      max_size       = 5

      labels = {
        "nvidia.com/gpu.present" = "true"
      }

      taints = [{
        key    = "nvidia.com/gpu"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Key decisions:

  • Spot instances for general workloads - 60% cost reduction
  • On-demand for GPU nodes - Training jobs can't handle interruptions
  • Taints on GPU nodes - Prevents non-GPU pods from wasting expensive resources

Phase 2: GitOps Pipeline (Week 3-4)

GitOps Flow

I bootstrapped Flux CD to manage all Kubernetes resources:

flux bootstrap github \
  --owner=mateenali66 \
  --repository=devops-experiment \
  --branch=main \
  --path=kubernetes/clusters/dev
Enter fullscreen mode Exit fullscreen mode

Now every change goes through Git:

  1. Developer opens PR with Kubernetes manifest changes
  2. CI validates the YAML and runs flux diff
  3. PR gets reviewed and merged
  4. Flux automatically applies changes to cluster
  5. Slack notification confirms deployment

No more kubectl apply in production. Ever.

Phase 3: GPU Scheduling (Week 5)

This was the tricky part. Getting Kubernetes to properly schedule GPU workloads requires:

  1. NVIDIA Device Plugin - Exposes GPUs as schedulable resources
  2. DCGM Exporter - GPU metrics for Prometheus
  3. Proper tolerations - Pods must tolerate GPU node taints
# kubernetes/apps/sample-gpu-app/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference
spec:
  template:
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: inference
          image: my-ml-model:v1.2
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "16Gi"
            requests:
              nvidia.com/gpu: 1
              memory: "8Gi"
          env:
            - name: CUDA_VISIBLE_DEVICES
              value: "0"
Enter fullscreen mode Exit fullscreen mode

Phase 4: Observability (Week 6)

CI/CD Pipeline

Deployed the full monitoring stack via Flux:

# kubernetes/infrastructure/monitoring/kube-prometheus-stack.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: kube-prometheus-stack
  namespace: monitoring
spec:
  chart:
    spec:
      chart: kube-prometheus-stack
      version: "51.x"
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
  values:
    grafana:
      dashboardProviders:
        - name: 'gpu-dashboards'
          folder: 'GPU'
          options:
            path: /var/lib/grafana/dashboards/gpu
Enter fullscreen mode Exit fullscreen mode

Added GPU-specific alerts:

- alert: GPUMemoryExhaustion
  expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.95
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "GPU {{ $labels.gpu }} memory above 95%"
Enter fullscreen mode Exit fullscreen mode

The Results

Metric Before After Change
Deployment frequency 1/week 10+/day 70x faster
Lead time for changes 2 weeks 30 min 96% reduction
MTTR 4 hours 15 min 94% reduction
Failed deployments 20% <2% 90% reduction
GPU utilization 30% 75% 2.5x improvement
Infrastructure uptime 80% 99.9% Production-grade

Lessons Learned

1. GitOps requires discipline

The biggest challenge wasn't technical - it was cultural. Developers were used to quick kubectl fixes.

Solution: I made the bastion read-only for production. The only way to deploy is through Git.

2. GPU scheduling has hidden gotchas

We hit an issue where GPU pods would get scheduled, but CUDA wouldn't initialize.

Root cause: The AMI's NVIDIA driver version didn't match the device plugin version.

Fix: Pin both versions explicitly and test in staging first.

3. Start with observability

I'm glad I deployed monitoring early. When we hit issues during migration, we could actually debug them.

4. Cost visibility matters

Adding Kubecost early showed us that one team's "small experiment" was burning $3K/month in idle GPU instances.

Try It Yourself

The full implementation is open source:

GitHub: github.com/mateenali66/devops-experiment

Includes:

  • Terraform modules for VPC and EKS
  • Terragrunt configs for multi-environment
  • Flux GitOps bootstrap
  • GPU node configuration
  • Monitoring stack with GPU dashboards

What's Next?

I'm working on adding:

  • Karpenter for smarter autoscaling
  • Spot instance interruption handling for GPU nodes
  • ML model versioning with Flux image automation

Follow me for updates on GPU infrastructure and Kubernetes at scale.


Questions? Drop a comment or find me on Website.

Top comments (0)