Ramchandra Reddy

Posted on Mar 18

I Ripped Out Docker Compose from Our ML Platform and Put Everything on EKS. Here's What Actually Happened.

I'll be honest - I resisted this for longer than I should have.
Our ML pipeline on Docker Compose was working. Not perfectly, but it was working. I knew where everything lived. I could debug it. The data science team understood it. And every time someone suggested moving to Kubernetes, I'd think "that's a lot of complexity for a problem we don't have yet."
Then we had the problem.
Three data scientists started running concurrent training jobs. One job consumed all GPU memory and the other two silently failed with zero useful error messages. Our serving container kept getting OOMKilled under load and nobody knew why because there was no proper metrics collection. We had a Friday afternoon incident where a model that had been in production for 4 months started returning garbage predictions - turned out the feature distribution had shifted weeks earlier and we had no monitoring to catch it. We only found out when the product team noticed the fraud catch rate had dropped.

That was the moment I stopped defending Compose.
Eight months later we're running the full ML platform on EKS with a proper tool stack and the difference is embarrassing. I'm not going to pretend the migration was painless or that we got everything right the first time. But the fundamentals are solid now and I want to write down what's actually running and why, because most of what I read on this topic is either too shallow or written by people who've never had to keep this stuff alive at 2am.

The honest reason Compose breaks for ML
The thing nobody tells you is that ML workloads are almost uniquely terrible for a single-machine, single-process orchestrator.
Training needs a lot of compute for a short time and then needs nothing. Serving needs to be always on with predictable latency. Feature engineering is I/O and memory heavy. Monitoring should run on almost nothing. These workloads have nothing in common resource-wise, and running them on the same machine managed by Compose means they're constantly fighting each other.

When a training job spins up and starts pulling 40GB of data into memory, your serving latency spikes. When you need to scale serving up to handle traffic, you can't because the training job already ate the node's CPU. The whole thing is just shared-resource chaos with a YAML file on top.
Kubernetes solves this by letting you bin-pack workloads onto appropriately sized nodes and isolate them from each other. It's not magic, it's just the right tool for the job.
What we actually run

I'm going to walk through the components that matter and why we picked them. I'm not going to pretend there wasn't a lot of trial and error.
Argo Workflows for pipeline orchestration. I looked at Prefect, Airflow, and Metaflow before landing here. Argo Workflows is just Kubernetes-native - your pipeline is a YAML spec that runs actual containers, not Python functions that spin up containers. The debugging experience is better because the logs and artifacts are first-class. The DAG with conditional steps is how we gate model registration on evaluation metrics:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: fraud-pipeline-
namespace: mlops
labels:
created-by: Ram
spec:
serviceAccountName: argo-workflow-sa
entrypoint: pipeline
arguments:
parameters:
- name: data-version
value: "v3.2.1"
- name: model-name
value: "fraud-detector"
templates:
- name: pipeline
dag:
tasks:
- name: validate-data
template: data-validator
arguments:
parameters:
- name: data-version
value: "{{workflow.parameters.data-version}}"

name: engineer-features template: feature-engineer dependencies: [validate-data]
name: train template: distributed-trainer dependencies: [engineer-features] arguments: parameters: - name: feature-path value: "{{tasks.engineer-features.outputs.parameters.feature-path}}"
name: evaluate template: evaluator dependencies: [train] arguments: parameters: - name: run-id value: "{{tasks.train.outputs.parameters.run-id}}" # This step only runs if AUC clears the bar # Previously this logic lived in a bash script nobody understood - name: register template: registrar dependencies: [evaluate] when: "{{tasks.evaluate.outputs.parameters.auc}} > 0.95" arguments: parameters: - name: run-id value: "{{tasks.train.outputs.parameters.run-id}}" name: distributed-trainer inputs: parameters: - name: feature-path outputs: parameters: - name: run-id valueFrom: path: /tmp/mlflow-run-id.txt # This creates a PyTorchJob and waits for it to finish resource: action: create successCondition: status.replicaStatuses.Worker.succeeded == 3 failureCondition: status.replicaStatuses.Worker.failed > 0 manifest: | apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: fraud-train-{{workflow.uid}} namespace: mlops spec: pytorchReplicaSpecs: Master: replicas: 1 template: spec: containers: - name: pytorch image: 123456789.dkr.ecr.us-east-1.amazonaws.com/mlops/trainer:3.0.0 args: - " - features={{inputs.parameters.feature-path}}" - " - run-id-out=/tmp/mlflow-run-id.txt" resources: limits: nvidia.com/gpu: "1" memory: "32Gi" cpu: "8" env: - name: MLFLOW_TRACKING_URI value: "http://mlflow-service.mlflow.svc.cluster.local:5000" nodeSelector: role: gpu-training tolerations: - key: nvidia.com/gpu operator: Exists Worker: replicas: 3 template: spec: containers: - name: pytorch image: 123456789.dkr.ecr.us-east-1.amazonaws.com/mlops/trainer:3.0.0 resources: limits: nvidia.com/gpu: "1" memory: "32Gi" nodeSelector: role: gpu-training

That when condition on the register step saved us from accidentally promoting a bad model at least three times. Before this we had a Python script that was supposed to check metrics before pushing to the registry but it had a bug that had been silently ignored for months. Declarative pipelines are harder to misread.
The feature store was the thing I least expected to matter this much
Training-serving skew is one of those problems that sounds academic until you actually have it. We had it. The fraud model was computing velocity_score slightly differently in the feature engineering pipeline versus the serving code. Both implementations looked correct. The bug was subtle - an off-by-one in a rolling window calculation. The model had essentially been trained on slightly wrong data for two months.
Feast fixed this by making the feature definition the single source of truth. You define the feature once. Training pulls it from the offline store. Serving pulls it from the online store. Same definition, same logic, no drift.

feature_repo/features.py

This file is owned by the ML team, not the platform team

Changes go through PR review before anything touches prod

from datetime import timedelta
from feast import Entity, Feature, FeatureView, ValueType, FeatureService
from feast.infra.offline_stores.contrib.athena_offline_store.athena import AthenaSource
user = Entity(name="user_id", value_type=ValueType.STRING)
transaction_source = AthenaSource(
table="transactions_features",
database="mlops_feature_db",
s3_staging_location="s3://ram-mlops-prod/athena-staging/",
timestamp_field="event_timestamp",
created_timestamp_column="created_at"
)
user_transaction_features = FeatureView(
name="user_transaction_features",
entities=["user_id"],
ttl=timedelta(days=7),
features=[
Feature(name="tx_count_1h", dtype=ValueType.INT64),
Feature(name="tx_count_24h", dtype=ValueType.INT64),
Feature(name="avg_amount_7d", dtype=ValueType.DOUBLE),
Feature(name="std_amount_7d", dtype=ValueType.DOUBLE),
Feature(name="unique_merchants_7d", dtype=ValueType.INT64),
Feature(name="high_risk_merchant", dtype=ValueType.BOOL),
Feature(name="velocity_score", dtype=ValueType.DOUBLE),
],
online=True,
source=transaction_source,
tags={"team": "fraud-ml", "owner": "Ram"},
)
fraud_v3 = FeatureService(
name="fraud_detection_v3",
features=[user_transaction_features],
)

Training fetches from Athena with point-in-time joins handled by Feast so there's no label leakage. Serving fetches from Redis with sub-millisecond latency. Same feature service name in both paths:

training

training_df = store.get_historical_features(
entity_df=entity_df,
features=store.get_feature_service("fraud_detection_v3"),
).to_df()

serving

feature_vector = store.get_online_features(
features=store.get_feature_service("fraud_detection_v3"),
entity_rows=[{"user_id": user_id}],
).to_dict()

If the velocity_score logic ever changes, you change it in one place. The PR diff is obvious. No hunting through two separate codebases trying to figure out why train and serve disagree.

KServe instead of custom FastAPI - this was a controversial call
Half the team wanted to keep the custom FastAPI inference server we'd built. I understood the argument - we knew it, we'd debugged it, it did what we needed. The counterargument was that we were spending meaningful engineering time maintaining serving infrastructure that isn't our actual job.

KServe won. And a year in I'd make the same call again.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detector
namespace: model-serving
labels:
model-version: "3.2.1"
created-by: Ram
spec:
predictor:
minReplicas: 3
maxReplicas: 100
scaleTarget: 50
scaleMetric: concurrency
model:
modelFormat:
name: mlflow
runtime: kserve-mlserver
# Model is loaded from S3 at startup, not baked into the image
# Change this URI and restart to deploy a new model version
storageUri: "s3://ram-mlops-prod/mlflow/artifacts/2/abc123/artifacts/model"
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
nodeSelector:
role: serving

Give models time to load before taking traffic

# Without this you get CrashLoopBackOff that isn't actually a crash
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
livenessProbe:
httpGet:
path: /v2/health/live
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
failureThreshold: 3

Canary rollouts went from a manual, nerve-wracking process to a one-line config change:

Route 10% of traffic to the new version

Watch Grafana for 30 minutes

If metrics look good, change to 50, then 100

If something looks wrong, change to 0

No redeployment. No downtime. No drama.

spec:
predictor:
canaryTrafficPercent: 10
model:
storageUri: "s3://ram-mlops-prod/mlflow/artifacts/fraud-detector/3.3.0"
We've rolled back two model versions this way. The entire rollback took under 2 minutes including the time it took me to notice the metrics were wrong.

GPU cost was way higher than it needed to be
Before KEDA, we had GPU nodes running 24/7 because training jobs were submitted manually and the machines needed to be warm. We were paying for 8 GPU hours per day and actually using maybe 3 of them.
KEDA lets training nodes scale to zero. A data scientist submits a job by pushing a message to SQS. KEDA sees the message, Karpenter provisions a spot GPU node, the training job runs, the node terminates. You pay for what you use.

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: model-training-scaledjob
namespace: mlops
spec:
jobTargetRef:
template:
spec:
containers:
- name: trainer
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/mlops/trainer:3.0.0
env:
- name: MLFLOW_TRACKING_URI
value: "http://mlflow-service.mlflow.svc.cluster.local:5000"
- name: SQS_QUEUE_URL
valueFrom:
secretKeyRef:
name: mlops-secrets
key: training-queue-url
resources:
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
nodeSelector:
role: gpu-training
tolerations:
- key: nvidia.com/gpu
operator: Exists
restartPolicy: Never
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/mlops-training-jobs
queueLength: "1"
awsRegion: us-east-1
identityOwner: operator # IRSA - no AWS keys anywhere in this file
minReplicaCount: 0
maxReplicaCount: 10
pollingInterval: 15

GPU costs dropped by roughly two thirds after this. That wasn't the goal - reproducibility was - but it was a welcome outcome.
The monitoring piece I should have built first
The Friday incident I mentioned at the start? A proper drift monitor would have caught it three weeks earlier. That's the thing about model monitoring - you always build it after the first incident that would have been prevented by it.
We use Evidently AI now, running as a Kubernetes CronJob that compares the last 24 hours of production feature distributions against the training baseline:

src/monitoring/drift_monitor.py

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
from evidently.metrics import ColumnDriftMetric, DatasetDriftMetric
import boto3
from datetime import datetime
def run_drift_analysis(reference_data, current_data, model_name):
report = Report(metrics=[
DataDriftPreset(
drift_share=0.3, # fire if more than 30% of features drifted
stattest="psi",
stattest_threshold=0.2
),
DataQualityPreset(),
ColumnDriftMetric(column_name="velocity_score"),
ColumnDriftMetric(column_name="avg_amount_7d"),
])
report.run(reference_data=reference_data, current_data=current_data)
result = report.as_dict()
drift_detected = result["metrics"][0]["result"]["dataset_drift"]
drift_share = result["metrics"][0]["result"]["share_of_drifted_columns"]

Push to CloudWatch so our alarms can fire

cw = boto3.client("cloudwatch", region_name="us-east-1")
cw.put_metric_data(
Namespace=f"MLOps/{model_name}",
MetricData=[{
"MetricName": "DriftShareOfColumns",
"Value": drift_share,
"Unit": "None",
"Dimensions": [{"Name": "Model", "Value": model_name}]
}]
)

Save the HTML report to S3 - linked from the Slack alert

report_key = f"monitoring/{model_name}/{datetime.utcnow().strftime('%Y/%m/%d/%H')}/report.html"
report.save_html("/tmp/report.html")
boto3.client("s3").upload_file("/tmp/report.html", "ram-mlops-prod", report_key)
return {"drift_detected": drift_detected, "drift_share": drift_share}
When drift crosses the threshold, CloudWatch fires to SNS, Slack gets a message with a link directly to the Evidently HTML report. The on-call person can see exactly which features moved, by how much, and when it started. The Argo pipeline auto-triggers a retraining run if drift has been consistently high for 48 hours.
We haven't had a silent model degradation incident since we turned this on. That's the whole point.

GitOps because "I applied it manually and forgot to commit" is not a production workflow
Every config change in the cluster happens through a Git PR now. Argo CD watches the repo and reconciles continuously:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: mlops-platform
namespace: argocd
labels:
created-by: Ram
spec:
project: default
source:
repoURL: https://github.com/ramcreddy-ch/eks-production-terraform
targetRevision: main
path: k8s/
destination:
server: https://kubernetes.default.svc
namespace: mlops
syncPolicy:
automated:
prune: true # if you delete it from git, it gets deleted from the cluster
selfHeal: true # if someone does kubectl apply manually, Argo reverts it
retry:
limit: 5
backoff:
duration: 5s
maxDuration: 3m
factor: 2

The selfHeal: true is the one that matters most culturally. It enforces the discipline. If someone gets frustrated and does a kubectl edit directly on a production resource at midnight, Argo reverts it within minutes. You either fix it properly in Git or it doesn't stick.
That sounds harsh. It's actually clarifying. There's no ambiguity about where the source of truth lives.
No AWS credentials anywhere in the cluster
This is a hill I will die on. No AWS access keys in environment variables. No shared credentials mounted as files. Every service account that needs AWS access gets an IAM role via IRSA and assumes it automatically.

terraform/irsa-mlops.tf

authored by Ram

module "irsa_kserve" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
role_name = "ram-mlops-prod-kserve"
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["model-serving:kserve-sa"]
}
}
role_policy_arns = {
s3_read = aws_iam_policy.kserve_s3_read.arn
}
}
module "irsa_drift_monitor" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
role_name = "ram-mlops-prod-drift-monitor"
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["mlops:drift-monitor-sa"]
}
}
role_policy_arns = {
policy = aws_iam_policy.drift_monitor_policy.arn
}
}
module "irsa_feast" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
role_name = "ram-mlops-prod-feast"
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["mlops:feast-sa"]
}
}
role_policy_arns = {
feast = aws_iam_policy.feast_policy.arn
}
}
KServe reads from S3. Drift monitor writes to CloudWatch. Feast reads DynamoDB and S3. Every role is scoped to exactly what it needs. If a container gets compromised, the blast radius is limited to what that specific service account was allowed to do.
What I'd do differently

A few things I got wrong and had to fix:

The node group setup took two iterations. I originally had serving and feature engineering on the same node pool and they fought for memory constantly. Separate pools with taints and node selectors are non-negotiable for ML workloads with different resource profiles.
I underestimated how long model loading takes in KServe. Some of our larger models take 90 seconds to load from S3. I set initialDelaySeconds: 30 on the readiness probe initially and spent two hours debugging what looked like CrashLoopBackOff but was just slow model loading. Set initialDelaySeconds to something generous - you can always tighten it later.

The Feast migration was harder than expected because the team had feature logic scattered across notebooks, training scripts, and serving code in three different forms. Centralising it into Feast required a proper audit and some uncomfortable conversations about whose version was actually correct. Plan for that work. It's worth it but it's not just a technical migration.

Actual numbers after 8 months
GPU training costs went from roughly $1,000 per model to around $200. The difference is spot instances plus zero-scale-when-idle.
Time from "model is trained and evaluated" to "model is in production" is now measured in hours not weeks. Mostly because the deployment path is just updating a storageUri in a YAML file and the canary rollout handles the rest.

We've caught four cases of significant feature drift before they caused user-facing problems. The Evidently monitor paid for itself in the first month.

Onboarding a new data scientist used to mean a week of environment setup, credential sharing, and "just ask whoever set this up before." Now it's cloning the repo and running terraform output configure_kubectl.
None of this is magic. It's just the right tools in the right places, with enough discipline around GitOps and IRSA that the system is actually manageable. The EKS cluster Terraform is on my GitHub at github.com/ramcreddy-ch/eks-production-terraform if you want to see how the node groups and IRSA roles are wired up.
What's the ML infrastructure problem you're dealing with right now? Genuinely asking - the more specific the problem the more useful the conversation tends to be.

Ram - Platform Engineering | EKS • MLOps • Terraform • Kubernetes github.com/ramcreddy-ch

MLOps #Kubernetes #EKS #MachineLearning #AWS #Terraform #KServe #ArgoWorkflows #PlatformEngineering #DataScience #Feast #KEDA #Karpenter

DEV Community