What do AI engineers hate most? Not hyperparameter tuning. Not waiting for GPUs. It's setting up a distributed training job on Kubernetes.
PyTorchJob, TFJob, MPIJob, XGBoostJob, PaddleJob, JAXJob. Six CRDs, six YAML formats, six knowledge domains. Switch your framework, relearn the API. Even more absurd: each one reimplements Gang Scheduling and failure restart — features that already have battle-tested solutions in the K8s ecosystem.
In July 2025, Kubeflow Trainer v2.0 shipped and ended this chaos.
Introduction: What Was Wrong with v1
Training Operator v1 (Kubeflow's predecessor) was essentially a brute-force concatenation of per-framework Operators. This created fundamental problems:
| Problem | Symptom |
|---|---|
| Framework explosion | 6 independent CRDs, 6 different APIs |
| High barrier to entry | AI practitioners had to understand Pod/container specs |
| Hard to extend | New framework support required modifying the Operator core |
| Reinventing the wheel | Reimplementing Gang Scheduling, etc., from scratch |
| No closed-source support | v1 architecture assumed community-owned framework code |
v2's goal was clear: abstract Kubernetes complexity away from AI practitioners, with a single TrainJob API serving all frameworks.
Part 1: The Core Architecture — Separation of Concerns
v2 draws clear role boundaries:
┌─────────────────────────────────────────────────────────────┐
│ Platform Administrator │
│ → Manages ClusterTrainingRuntime / TrainingRuntime │
│ → Defines training infrastructure templates │
│ (images / resources / distributed config / scheduling) │
└──────────────────────────┬──────────────────────────────────┘
│ runtimeRef (reference)
┌──────────────────────────▼──────────────────────────────────┐
│ AI Practitioner │
│ → Creates TrainJob (references Runtime, no K8s details) │
│ → Or via Python SDK: TrainerClient.train() │
└─────────────────────────────────────────────────────────────┘
AI researchers focus on algorithms. Infrastructure engineers manage the platform. Clean separation, no cross-contamination.
Part 2: Three CRDs, One System
TrainJob: The AI Practitioner's Interface
A complete LLM fine-tuning job looks like this:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: llm-finetune-job
namespace: team-ml
spec:
runtimeRef:
kind: ClusterTrainingRuntime
name: torch-distributed
trainer:
numNodes: 4 # 4 training nodes (GPU nodes)
image: pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime
command: ["python", "/workspace/train.py"]
args: ["--epochs=10"]
resourcesPerNode:
requests:
nvidia.com/gpu: "4"
limits:
nvidia.com/gpu: "4"
# Auto-download dataset and model from HuggingFace
initializer:
dataset:
storageUri: "hf://tatsu-lab/alpaca"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
model:
storageUri: "hf://meta-llama/Llama-3.2-7B-Instruct"
The AI engineer only writes training-logic-related parameters. How Pods are created, how torchrun launches, how rendezvous is configured — all hidden in the Runtime.
Built-in ClusterTrainingRuntimes
| Runtime | Framework | Notes |
|---|---|---|
torch-distributed |
PyTorch DDP/FSDP | Standard distributed training |
deepspeed-distributed |
DeepSpeed | ZeRO-series large model training |
torchtune-llama3.2-7b |
TorchTune | Built-in LLM fine-tuning (zero code!) |
torch-distributed-with-cache |
PyTorch + Cache | v2.1 data cache feature |
mlx-distributed |
MLX | Apple Silicon clusters |
Part 3: Python SDK — Low-Code Training
from kubeflow.trainer import TrainerClient
client = TrainerClient()
def train_func():
import torch
import torch.distributed as dist
dist.init_process_group(backend='nccl')
model = MyModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for epoch in range(10):
# training loop
pass
# One-line submission to the K8s cluster
job = client.train(
func=train_func,
num_nodes=4,
resources_per_node={"gpu": "4", "memory": "32Gi"},
runtime_ref="torch-distributed",
)
# Real-time log streaming
client.get_job_logs(name=job.name)
The SDK auto-handles: serializing code to a ConfigMap → injecting into Pods → configuring distributed env vars → launching torchrun. AI researchers never see K8s.
Part 4: Built-in LLM Fine-Tuning — BuiltinTrainer + TorchTune
The killer feature of v2.1 — fine-tune an LLM without writing a single line of training code:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: llama-sft
spec:
runtimeRef:
kind: ClusterTrainingRuntime
name: torchtune-llama3.2-7b # Built-in LLM fine-tuning Runtime
initializer:
dataset:
storageUri: "hf://tatsu-lab/alpaca"
model:
storageUri: "hf://meta-llama/Llama-3.2-7B-Instruct"
trainer:
numNodes: 2
resourcesPerNode:
requests:
nvidia.com/gpu: "4"
TorchTune BuiltinTrainer handles: QLoRA/LoRA fine-tuning, mixed-precision training, FSDP sharding, checkpoint saving. Platform admins pre-configure the Recipe; users only provide dataset and model source.
Part 5: Migrating from v1 in Three Steps
# v1 (PyTorchJob) — the old way
apiVersion: kubeflow.org/v1
kind: PyTorchJob
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.4.0
resources:
limits:
nvidia.com/gpu: 4
Worker:
replicas: 3
...
# v2 (TrainJob) — equivalent, much simpler
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
spec:
runtimeRef:
kind: ClusterTrainingRuntime
name: torch-distributed
trainer:
numNodes: 4 # Master + 3 Workers = 4 nodes
resourcesPerNode:
limits:
nvidia.com/gpu: "4"
Conclusion: The Age of Mature AI Infrastructure
Kubeflow Trainer v2 signals that the AI infrastructure community has moved from "everyone builds their own wheels" to "let's co-build a standard layer together."
The core value isn't just API unification — it's separation of concerns: AI researchers focus on algorithms, infrastructure engineers manage the platform, and the K8s ecosystem (JobSet, Kueue, LeaderWorkerSet) handles the repetitive heavy lifting.
In 2026, with LLM fine-tuning demand exploding, Kubeflow Trainer v2 is becoming the de facto standard for AI training on Kubernetes. If your team is still hand-writing PyTorchJob YAML, it's time to migrate.
Key numbers: v2.1 supports K8s 1.29+, Kueue 0.9+, PyTorch 2.7. Python SDK is PyTorch official ecosystem certified (July 2025). TAS (Topology-Aware Scheduling) maximizes NVLink bandwidth utilization.
References: Kubeflow Trainer v2 official documentation, JobSet project, Kueue documentation, PyTorch distributed training guide
Top comments (0)