How I finally stopped rewriting deployment configs every time I switched GPU providers

#gpu #machinelearning #devops #cloudcomputing

I’ve been running GPU inference workloads for about two years now and for most of that time I had the same problem: every time I wanted to move a workload to a different provider, I was essentially starting from scratch on the deployment config.
Not because the actual workload changed. The code was the same, the container was the same. But all the infrastructure glue — the scheduling constraints, the node selectors, the provider-specific API calls, the health check logic — was baked into the config in ways that assumed a specific provider’s environment. Moving meant unpicking all of that and rebuilding it for wherever we were going.
I tried a few things to fix this.
Terraform helped with provisioning but didn’t solve the actual problem. I could terraform my way to nodes on a different provider. I still had to tell each workload where to run and update that when things changed.
I tried writing an abstraction layer that sat between our deployment scripts and the provider APIs. That worked for a while. Then a provider updated their endpoint and broke it on a Friday afternoon and I spent the weekend fixing something that had nothing to do with our actual product.
The thing that actually fixed it was separating what a workload needs from where it runs.
I’ve been using Yotta Labs for a few months now and the specific thing that changed my workflow is their Launch Templates. The idea is pretty simple: instead of specifying “run this on an H100 at provider X in this region,” you specify what the workload needs — container image, resource requirements, environment variables, ports, storage mounts — and a scheduler figures out where to put it across whatever providers are in the network.
In practice this means when H200s are sold out at one provider it routes to available capacity elsewhere. When I want to try a different provider I add it at the infrastructure level and existing templates just work. When a provider changes something I don’t care because my workload definition doesn’t reference that provider.
One thing worth mentioning because it confused me initially: these are not the same as AWS Launch Templates. AWS Launch Templates are EC2 instance configuration — they define how to launch a specific instance type with specific AMIs and security groups. Yotta’s Launch Templates are workload-level deployment manifests. Completely different thing, unfortunate naming overlap.
The migration from my previous setup was less work than I expected. Mostly it was removing things — stripping out the provider-specific scheduling config that was never necessary, replacing it with a requirements declaration. The container images didn’t change. The application code didn’t change. I just stopped hardcoding where things run.
Six months in and I haven’t touched a deployment config because of a provider change. Which sounds like a small thing until you remember how many weekends I spent doing exactly that.

DEV Community

How I finally stopped rewriting deployment configs every time I switched GPU providers

Top comments (0)