TL;DR: I rebuilt an entire Kubernetes platform with GPU support, GitOps automation, and full observability. Deployment time dropped from 2 weeks to 30 minutes, and we went from 80% to 99.9% uptime.
The Problem
The existing infrastructure was a mess. If you've ever inherited a Kubernetes cluster that "just grew organically," you know the pain:
-
Manual kubectl deployments - Someone would SSH into a bastion and run
kubectl applydirectly in production. No audit trail, no rollback plan. - No Infrastructure as Code - The cluster was clicked together in the AWS console. Nobody knew the exact configuration.
- Zero GPU scheduling - Data scientists were spinning up GPU instances manually and SSHing into them for ML training.
- Monitoring was an afterthought - We had CloudWatch... and hope.
When a senior engineer left, half the tribal knowledge walked out the door.
What I Tried First (And Why It Failed)
Attempt 1: "Just Document Everything"
I spent two weeks reverse-engineering the existing setup. Created a 50-page runbook.
Result: Nobody read it. The docs were outdated within a month.
Attempt 2: "Gradual Migration"
Tried to incrementally improve the existing cluster - add monitoring here, automate one deployment there.
Result: The cluster was too fragile. Every change broke something else. We needed a clean slate.
Attempt 3: The Right Approach
I proposed a full rebuild with a parallel cluster, allowing us to:
- Build the new platform without disrupting production
- Migrate workloads gradually with blue-green deployments
- Validate everything before cutting over
Management approved. Here's how I built it.
The Solution
Architecture Overview
The Tech Stack
| Layer | Technology | Why |
|---|---|---|
| IaC | Terraform + Terragrunt | Reusable modules, DRY configs |
| Cluster | AWS EKS | Managed control plane, less ops overhead |
| GitOps | Flux CD | Pull-based, secure, lighter than ArgoCD |
| GPU | NVIDIA Device Plugin + DCGM | Native GPU scheduling and monitoring |
| Observability | Prometheus + Grafana | Industry standard, great GPU metrics support |
| CI/CD | GitHub Actions | Where our code already lives |
Phase 1: Foundation (Week 1-2)
First, I built reusable Terraform modules:
# terragrunt/dev/eks/terragrunt.hcl
terraform {
source = "../../../terraform/modules/eks"
}
inputs = {
cluster_name = "eks-dev-cluster"
cluster_version = "1.28"
node_groups = {
general = {
instance_types = ["m5.large"]
capacity_type = "SPOT" # Cost savings for non-GPU
min_size = 2
max_size = 10
}
gpu = {
instance_types = ["g4dn.xlarge"]
capacity_type = "ON_DEMAND" # Stability for GPU workloads
min_size = 0
max_size = 5
labels = {
"nvidia.com/gpu.present" = "true"
}
taints = [{
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE"
}]
}
}
}
Key decisions:
- Spot instances for general workloads - 60% cost reduction
- On-demand for GPU nodes - Training jobs can't handle interruptions
- Taints on GPU nodes - Prevents non-GPU pods from wasting expensive resources
Phase 2: GitOps Pipeline (Week 3-4)
I bootstrapped Flux CD to manage all Kubernetes resources:
flux bootstrap github \
--owner=mateenali66 \
--repository=devops-experiment \
--branch=main \
--path=kubernetes/clusters/dev
Now every change goes through Git:
- Developer opens PR with Kubernetes manifest changes
- CI validates the YAML and runs
flux diff - PR gets reviewed and merged
- Flux automatically applies changes to cluster
- Slack notification confirms deployment
No more kubectl apply in production. Ever.
Phase 3: GPU Scheduling (Week 5)
This was the tricky part. Getting Kubernetes to properly schedule GPU workloads requires:
- NVIDIA Device Plugin - Exposes GPUs as schedulable resources
- DCGM Exporter - GPU metrics for Prometheus
- Proper tolerations - Pods must tolerate GPU node taints
# kubernetes/apps/sample-gpu-app/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference
spec:
template:
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: inference
image: my-ml-model:v1.2
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
requests:
nvidia.com/gpu: 1
memory: "8Gi"
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
Phase 4: Observability (Week 6)
Deployed the full monitoring stack via Flux:
# kubernetes/infrastructure/monitoring/kube-prometheus-stack.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: kube-prometheus-stack
namespace: monitoring
spec:
chart:
spec:
chart: kube-prometheus-stack
version: "51.x"
sourceRef:
kind: HelmRepository
name: prometheus-community
values:
grafana:
dashboardProviders:
- name: 'gpu-dashboards'
folder: 'GPU'
options:
path: /var/lib/grafana/dashboards/gpu
Added GPU-specific alerts:
- alert: GPUMemoryExhaustion
expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "GPU {{ $labels.gpu }} memory above 95%"
The Results
| Metric | Before | After | Change |
|---|---|---|---|
| Deployment frequency | 1/week | 10+/day | 70x faster |
| Lead time for changes | 2 weeks | 30 min | 96% reduction |
| MTTR | 4 hours | 15 min | 94% reduction |
| Failed deployments | 20% | <2% | 90% reduction |
| GPU utilization | 30% | 75% | 2.5x improvement |
| Infrastructure uptime | 80% | 99.9% | Production-grade |
Lessons Learned
1. GitOps requires discipline
The biggest challenge wasn't technical - it was cultural. Developers were used to quick kubectl fixes.
Solution: I made the bastion read-only for production. The only way to deploy is through Git.
2. GPU scheduling has hidden gotchas
We hit an issue where GPU pods would get scheduled, but CUDA wouldn't initialize.
Root cause: The AMI's NVIDIA driver version didn't match the device plugin version.
Fix: Pin both versions explicitly and test in staging first.
3. Start with observability
I'm glad I deployed monitoring early. When we hit issues during migration, we could actually debug them.
4. Cost visibility matters
Adding Kubecost early showed us that one team's "small experiment" was burning $3K/month in idle GPU instances.
Try It Yourself
The full implementation is open source:
GitHub: github.com/mateenali66/devops-experiment
Includes:
- Terraform modules for VPC and EKS
- Terragrunt configs for multi-environment
- Flux GitOps bootstrap
- GPU node configuration
- Monitoring stack with GPU dashboards
What's Next?
I'm working on adding:
- Karpenter for smarter autoscaling
- Spot instance interruption handling for GPU nodes
- ML model versioning with Flux image automation
Follow me for updates on GPU infrastructure and Kubernetes at scale.
Questions? Drop a comment or find me on Website.



Top comments (0)