Mateen Anjum

Posted on Dec 4, 2025

How I Built a Production-Grade GPU-Ready Kubernetes Platform from Scratch

#kubernetes #devops #nvidia #gpu

TL;DR: I rebuilt an entire Kubernetes platform with GPU support, GitOps automation, and full observability. Deployment time dropped from 2 weeks to 30 minutes, and we went from 80% to 99.9% uptime.

The Problem

The existing infrastructure was a mess. If you've ever inherited a Kubernetes cluster that "just grew organically," you know the pain:

Manual kubectl deployments - Someone would SSH into a bastion and run kubectl apply directly in production. No audit trail, no rollback plan.
No Infrastructure as Code - The cluster was clicked together in the AWS console. Nobody knew the exact configuration.
Zero GPU scheduling - Data scientists were spinning up GPU instances manually and SSHing into them for ML training.
Monitoring was an afterthought - We had CloudWatch... and hope.

When a senior engineer left, half the tribal knowledge walked out the door.

What I Tried First (And Why It Failed)

Attempt 1: "Just Document Everything"

I spent two weeks reverse-engineering the existing setup. Created a 50-page runbook.

Result: Nobody read it. The docs were outdated within a month.

Attempt 2: "Gradual Migration"

Tried to incrementally improve the existing cluster - add monitoring here, automate one deployment there.

Result: The cluster was too fragile. Every change broke something else. We needed a clean slate.

Attempt 3: The Right Approach

I proposed a full rebuild with a parallel cluster, allowing us to:

Build the new platform without disrupting production
Migrate workloads gradually with blue-green deployments
Validate everything before cutting over

Management approved. Here's how I built it.

The Solution

Architecture Overview

The Tech Stack

Layer	Technology	Why
IaC	Terraform + Terragrunt	Reusable modules, DRY configs
Cluster	AWS EKS	Managed control plane, less ops overhead
GitOps	Flux CD	Pull-based, secure, lighter than ArgoCD
GPU	NVIDIA Device Plugin + DCGM	Native GPU scheduling and monitoring
Observability	Prometheus + Grafana	Industry standard, great GPU metrics support
CI/CD	GitHub Actions	Where our code already lives

Phase 1: Foundation (Week 1-2)

First, I built reusable Terraform modules:

# terragrunt/dev/eks/terragrunt.hcl
terraform {
  source = "../../../terraform/modules/eks"
}

inputs = {
  cluster_name    = "eks-dev-cluster"
  cluster_version = "1.28"

  node_groups = {
    general = {
      instance_types = ["m5.large"]
      capacity_type  = "SPOT"  # Cost savings for non-GPU
      min_size       = 2
      max_size       = 10
    }

    gpu = {
      instance_types = ["g4dn.xlarge"]
      capacity_type  = "ON_DEMAND"  # Stability for GPU workloads
      min_size       = 0
      max_size       = 5

      labels = {
        "nvidia.com/gpu.present" = "true"
      }

      taints = [{
        key    = "nvidia.com/gpu"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]
    }
  }
}

Key decisions:

Spot instances for general workloads - 60% cost reduction
On-demand for GPU nodes - Training jobs can't handle interruptions
Taints on GPU nodes - Prevents non-GPU pods from wasting expensive resources

Phase 2: GitOps Pipeline (Week 3-4)

I bootstrapped Flux CD to manage all Kubernetes resources:

flux bootstrap github \
  --owner=mateenali66 \
  --repository=devops-experiment \
  --branch=main \
  --path=kubernetes/clusters/dev

Now every change goes through Git:

Developer opens PR with Kubernetes manifest changes
CI validates the YAML and runs flux diff
PR gets reviewed and merged
Flux automatically applies changes to cluster
Slack notification confirms deployment

No more kubectl apply in production. Ever.

Phase 3: GPU Scheduling (Week 5)

This was the tricky part. Getting Kubernetes to properly schedule GPU workloads requires:

NVIDIA Device Plugin - Exposes GPUs as schedulable resources
DCGM Exporter - GPU metrics for Prometheus
Proper tolerations - Pods must tolerate GPU node taints

# kubernetes/apps/sample-gpu-app/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference
spec:
  template:
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: inference
          image: my-ml-model:v1.2
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "16Gi"
            requests:
              nvidia.com/gpu: 1
              memory: "8Gi"
          env:
            - name: CUDA_VISIBLE_DEVICES
              value: "0"

Phase 4: Observability (Week 6)

Deployed the full monitoring stack via Flux:

# kubernetes/infrastructure/monitoring/kube-prometheus-stack.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: kube-prometheus-stack
  namespace: monitoring
spec:
  chart:
    spec:
      chart: kube-prometheus-stack
      version: "51.x"
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
  values:
    grafana:
      dashboardProviders:
        - name: 'gpu-dashboards'
          folder: 'GPU'
          options:
            path: /var/lib/grafana/dashboards/gpu

Added GPU-specific alerts:

- alert: GPUMemoryExhaustion
  expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.95
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "GPU {{ $labels.gpu }} memory above 95%"

The Results

Metric	Before	After	Change
Deployment frequency	1/week	10+/day	70x faster
Lead time for changes	2 weeks	30 min	96% reduction
MTTR	4 hours	15 min	94% reduction
Failed deployments	20%	<2%	90% reduction
GPU utilization	30%	75%	2.5x improvement
Infrastructure uptime	80%	99.9%	Production-grade

Lessons Learned

1. GitOps requires discipline

The biggest challenge wasn't technical - it was cultural. Developers were used to quick kubectl fixes.

Solution: I made the bastion read-only for production. The only way to deploy is through Git.

2. GPU scheduling has hidden gotchas

We hit an issue where GPU pods would get scheduled, but CUDA wouldn't initialize.

Root cause: The AMI's NVIDIA driver version didn't match the device plugin version.

Fix: Pin both versions explicitly and test in staging first.

3. Start with observability

I'm glad I deployed monitoring early. When we hit issues during migration, we could actually debug them.

4. Cost visibility matters

Adding Kubecost early showed us that one team's "small experiment" was burning $3K/month in idle GPU instances.

Try It Yourself

The full implementation is open source:

GitHub: github.com/mateenali66/devops-experiment

Includes:

Terraform modules for VPC and EKS
Terragrunt configs for multi-environment
Flux GitOps bootstrap
GPU node configuration
Monitoring stack with GPU dashboards

DEV Community