DEV Community: Yash Sharma

When is Serverless Inference Cheaper than Your Self Hosted GPU? I Benchmarked gpt-oss-120b on Both

Yash Sharma — Tue, 23 Jun 2026 14:57:47 +0000

If you run LLM inference in production, you eventually will ask yourself, should you rent a GPU and run the model yourself, or do you use a serverless API and pay per token? Everyone has an opinion. Far fewer people show you the actual numbers that decide it.

So I ran both. I put the same model, gpt-oss-120b, on two setups, self-hosted with vLLM on a single AMD MI300X GPU Droplet, and on DigitalOcean's Serverless Inference. Then I measured the cold start, the warm latency, and the cost, and worked out exactly where one becomes the better choice than the other.

In short, self-hosted GPU is faster and more consistent once it's warm, but it carries a real cold start and you pay for it around the clock. Serverless hides the cold start and costs almost nothing at low volume, but you pay per token. Which one wins comes down to your traffic shape and how much your model actually outputs. Here are the numbers.

How long is the cold start on a self-hosted GPU?

When you run a model yourself, the GPU doesn't hold the model permanently. The weights have to be loaded from disk into the GPU's memory and the inference engine has to initialize before it can answer a single request. That startup delay, the gap between "process launched" and "first token out," is the cold start. You pay it every time you start the server fresh, a new deploy, a restart after a crash, or a new node coming up to handle load.

My setup here was a single AMD MI300X GPU Droplet running gpt-oss-120b with vLLM, with the weights already cached on disk. I started vLLM from cold and timed how long it took before the first token came back.

It took about 61 seconds. That wasn't a one-off, either, across three restarts it landed between 60.8 and 61.4 seconds every time.

One thing worth being precise about, because it's the most common misread, that 61 seconds is not the time to download the model. The weights were already saved on disk, so this is the cost you pay on a restart or redeploy, not a one-time setup. The startup logs show where the time actually goes:

Phase	Time
Load weights from disk into VRAM (~68 GB)	~24 s
`torch.compile`	~4 s
CUDA graph capture	~11 s
Engine init, KV cache, warmup	~21 s

So the 61 seconds includes compilation and warmup, the entire engine bring-up, right up to serving a request. What it excludes is the one-time download of the weights from Hugging Face, which you pay once and never again. (These phases come from one representative run and there's some overhead between stages, so they don't sum to exactly 61, but that's where the time lives.)

This also answers the obvious follow-up, what happens when you scale up? If a new replica boots from an image with the weights baked in, or mounts a shared volume that already has them, it pays this ~61-second load, not a download. A brand-new node with nothing staged would also have to pull the weights first, but well-run setups specifically avoid that, because you don't want every scale event re-downloading 68 GB. So 61 seconds is the realistic number for a properly configured restart or scale-up

Warm latency and throughput

Once the model is loaded, it's a different machine. Warm, the self-hosted MI300X returned the first token in about 322 ms and sustained roughly 154 tokens per second, and it was remarkably stable, across twenty requests, the spread was about two milliseconds.

Memory is worth a note, because the raw number looks alarming. The card showed about 173 GB of VRAM in use. But the weights themselves are only about 68 GB. vLLM reserves most of the rest up front as KV-cache headroom (roughly 100 GB of it) so it can serve many requests at once. A 120B model doesn't "need" 173 GB; the engine just claims the room ahead of time.

So the self-hosted trade-off is clean: once it's warm, it's fast, consistent, and entirely yours, but every cold start costs a full minute for our model, and you pay for the GPU whether or not anyone is using it.

Does serverless inference have a cold start?

Next I ran the identical test against Serverless Inference. Calling it is straightforward, you create a model access key and hit the OpenAI-compatible endpoint, so the client code is the same and only the base URL changes. I left the endpoint idle first, then measured first-token latency the same way.

The first token came back in about 546 ms, with a wider spread, anywhere from 446 ms to roughly 1.3 seconds across twenty runs. But there was no spin-up. I ran it twenty times after sitting idle and never caught a cold-start hit.

Two honest caveats on those numbers. First, the serverless requests travel over the network to DigitalOcean's endpoint, while the GPU test ran locally on the droplet, so some of that extra latency is network distance, not the model being slower. Here's the warm comparison side by side:

Metric	Self-hosted MI300X	Serverless Inference
Median time to first token	~322 ms	~546 ms
Spread (20 runs)	~2 ms	446 ms – 1.3 s
Throughput	~154 tok/s	(per-token billed)
Cold start	~61 s	none observed

So why no cold start on serverless? It didn't delete the cold start, it absorbed it. DigitalOcean pools GPU capacity across customers, so the model stayed warm without any effort from me, and I never paid the 61-second hit I took on my own box. The difference is the billing model: serverless isn't charged by the hour, it's charged per token.

To be fair, serverless isn't immune to cold starts. If you hit it during a genuinely quiet stretch, you can still catch one. The standard mitigations are sending periodic warm-up requests to keep a worker hot, or designing async-first so a slow first response doesn't matter. In this test I didn't need any of that, it just stayed warm.

When is serverless inference cheaper than your own GPU?

This is where the decision actually lives, and it comes down to arithmetic. The GPU is a flat cost, about $1.88 an hour for a single on-demand MI300X, the same whether it serves one request or a million. Serverless is usage-based, gpt-oss-120b is priced at $0.10 per million input tokens and $0.70 per million output tokens, so it costs almost nothing when you're quiet and climbs as you get busier.

The break-even point is your hourly GPU cost divided by your per-request cost:

break-even requests/hour = GPU $/hr ÷ [(input_tokens × $0.10/1M) + (output_tokens × $0.70/1M)]

The catch is that the per-request cost depends entirely on how much your model outputs, and that moves the crossover more than you'd expect. I measured it at three response lengths, on the same GPU at the same prices, changing only the output length:

Response type	Output tokens	Crossover
Short (classification / extraction)	~30	~18 requests/sec
Medium (paragraph answer)	~220	~3 requests/sec
Long (code / detailed explanation)	~1,200	<1 request/sec

That's roughly a 25x swing from identical hardware, with nothing changing but response length. Output tokens are the expensive side of the bill, so the chattier your app, the sooner owning a GPU pays for itself.

In plain terms, if your app sends short, snappy responses, serverless stays cheaper until you're well over a dozen requests per second, nonstop. If it writes long answers, the GPU starts winning below one request per second. (For comparison, Dedicated Inference, DigitalOcean's managed always-on endpoint, is billed by the GPU-hour like the Droplet but without you managing the it, so its economics sit closer to the self-hosted side of this table than the serverless side.) Drop your own response lengths and GPU rate into the formula and you'll find your exact line.

When to use serverless inference, and when not to

No "it depends." Here's the actual call.

For most teams, serverless is the right default. Bursty or spiky traffic, real idle stretches, a dev tool, an internal feature, a side project, anything async where nobody is staring at a spinner on the first request. In all of those, the cold start runs on someone else's pooled capacity, not yours, and you pay nothing while you're quiet. For that kind of traffic, it's almost perfect.

Run the GPU yourself when traffic is steady and high-volume, or when you have a latency SLA you can't miss. At that point your traffic rarely stops, so you're not benefiting from serverless's idle savings anyway, and you're using the GPU enough that flat hourly beats per-token. You keep it warm, so the cold start stops mattering. That's not a knock on serverless, it's just the wrong tool for that job.

If you're in between, start on serverless. Watch your token spend, and move to a dedicated GPU the day you cross the line for your response lengths. Don't buy a GPU to solve a problem you don't have yet.

Run it yourself

Everything here is reproducible. You can spin up an AMD GPU Droplet and run gpt-oss-120b on vLLM, hit the same model on Serverless Inference with a model access key, and check the serverless metrics and pricing pages against your own workload.

Don't take my crossover, run your own. And if you measure a cold start on your own setup, I'd genuinely like to see how the spread looks across different models and hardware.

We Got 2x LLM Inference Speed With Three Kubernetes Settings

Yash Sharma — Tue, 19 May 2026 09:48:44 +0000

Serving LLMs is not easy, especially when it comes to scalability, we have to optimise the infra to make sure we're actually able to serve inference to our customers.

And when you start scaling LLM inference on Kubernetes, two problems quietly show up and cost you real money. The first one is where you put the model weights, because these files are huge, and every pod needs them.

The second one is how fast your nodes can actually read those weights off shared storage, because if the network isn't tuned right, you leave a lot of throughput on the table.

In today's video, I'll walk you through how we solved both at DigitalOcean using a reference architecture we built, vLLM on DOKS, with Managed NFS for shared model storage. The whole thing is open source, Terraform, Kubernetes manifests, everything. Link in the description.

Let's go.

SECTION 1: ARCHITECTURE OVERVIEW

Let me start with the full picture, so the rest of the video makes sense.

Everything lives inside one VPC. Inside that VPC, there's a DOKS cluster with two node pools. The first is a management pool of regular droplets, these run system services and the model download job. The second is a GPU pool of H100 droplets, this is where vLLM actually runs.

Next to the cluster, there's a Managed NFS share. The download job writes the model weights to it once. Every vLLM pod mounts it and reads from it. That's the whole storage story in one sentence.

The whole thing is deployed as two Terraform stacks. Stack one builds the infrastructure VPC, cluster, NFS. Stack two deploys everything inside Kubernetes, the namespace, the persistent volume, the download job, vLLM, the gateway. They're split on purpose, so you can redeploy vLLM, swap models, or change replica counts without touching the underlying infrastructure.

Now let me explain the two decisions that matter most, why NFS, and the network tuning piece.

SECTION 2: WHY NFS FOR MODEL STORAGE

When you scale LLM inference, every pod needs the model weights. And these files are huge a 70B model is around 140 gigs.

There are usually three approaches people try.

The first is to download the weights from object storage on pod startup. So every pod, every restart, pulls the model from something like S3 or Spaces. For a 140-gig model, that's 15-20 mins on a cold node. Every autoscale event, you pay that cost again.

The second is to bake the weights into the container image. Now your image is 140 gigs. Slow to pull, painful to update, and switching models means rebuilding the image every time.

The third is block storage. This works fine for one pod, but block volumes are ReadWriteOnce — the moment you want multiple replicas, you're stuck.

What we actually want is "download once, use many" one copy of the weights, and every pod reads from it. That's exactly what NFS gives us. Specifically, DigitalOcean Managed NFS, which supports ReadWriteMany and is available in the same regions as our H100 droplets NYC2 and ATL1 right now.

The flow is simple. A Kubernetes Job runs once, pulls the model from HuggingFace onto the NFS share. Every vLLM pod mounts that share and reads from it. Scaling from one replica to three takes 20 to 30 seconds, because the weights are already there no re-downloading, no waiting.

A quick note on vLLM itself, since it's the serving layer. vLLM is an open-source inference engine, and it exposes an OpenAI-compatible API same

/v1/chat/completions

endpoint you're used to. For our architecture, vLLM just needs a folder with model files in it. It doesn't care that the folder is actually an NFS mount. That's why this setup works so cleanly.

SECTION 3: THE NETWORK TUNER

Okay, now the part that quietly matters the most

First let’s understand what MTU mean, it is called Maximum Transmission Unit (MTU) which defines the maximum size of a network packet that can be transmitted over a network interface without fragmentation.

By default, DOKS nodes use a 1500-byte MTU and pretty small TCP buffer sizes. For most workloads, that's totally fine. But for NFS reads of multi-gigabyte weight files, it leaves a lot of throughput on the table and we want to ensure to utilize it properly.

We benchmarked it. With default settings, we got around 420 MB/s loading model weights from NFS. With tuning applied, around 880 MB/s. That's roughly 2x faster — same hardware, same NFS share, same model.

Now let’s see Three changes get us there.

The first is jumbo frames. We bump the MTU on the private network from 1500 bytes up to 9000. When vLLM pods read file from NFS packet by packet for 140gigabyte model, its 93 million packets however when we make MTU to 9000 it’s 15 million.

Bigger packets means less overhead per packet during large transfers. One thing to note here — this only works on GPU droplets. Standard droplets don't support jumbo frames on the private network.

The second is bigger TCP buffers.

rmem_max, wmem_max, tcp_rmem, tcp_wmem

We raise all of these to 16 megabytes. This lets the kernel actually use the bandwidth that's available during high-throughput NFS reads.

The third is on the NFS mount itself — nconnect=8. Instead of opening one TCP connection per mount, each pod opens eight. More connections, more aggregate throughput.

And all of these tuning is done by a demonset on when a node freshly joins the cluster

Now here's the tricky part, and this is what I was hinting at in the intro.

When a fresh GPU node joins the cluster, two things want to run on it right away, the network tuner, and the vLLM pod. If the vLLM pod wins that race and mounts NFS first, TCP negotiates the packet size based on the old 1500-byte MTU. And here's the thing, that number never changes after the handshake. So even if the tuner runs five seconds later and raises the MTU to 9000, that specific NFS mount is locked at degraded speed for its entire lifetime. The only way to fix it is to remount.

The fix is a node taint.

When a GPU node joins the cluster, it comes up with a taint called network-not-tuned.

node.digitalocean.com/network-not-tuned:NoSchedule

That taint blocks every workload pod from being scheduled on the node. But the network tuner DaemonSet is built to tolerate it, so it schedules right away. It tunes the network sets jumbo frames, raises the TCP buffers and then removes the taint. Only after that do vLLM pods get to schedule.

So the guarantee is simple if a vLLM pod is running on a node, the network on that node was already tuned before the pod ever touched NFS.

This matters most with autoscaling. Every time the cluster brings up a new GPU node, it goes through this exact sequence. This helps us avoid the race conditions

SECTION 4: HA CHOICES — AND WHY

Before the demo, a few production choices worth explaining quickly, because the reasoning matters more than the config itself.

For rolling updates, we use maxSurge: 0, maxUnavailable: 1. The default behaviour is to spin up an extra pod during a rollout but on GPU-constrained clusters, you often don't have a spare H100 sitting around. So we'd rather accept one moment of reduced capacity than block the rollout waiting for a GPU that isn't coming.

The startup probe is set to 120 seconds, because loading a 70B model from NFS into VRAM takes real time 220sec in our case. Default health probes would kill the pod before it ever finished loading.

There's also a PreStop hook that drains in-flight requests before the pod terminates. vLLM batches requests on the GPU, so if you just send SIGTERM, those requests get dropped. The hook polls vLLM's metrics endpoint, waits until the queue is empty, then shuts down cleanly.

SECTION 5: DEMO

For quick demo checkout the video at

SECTION 6: WRAP

That's the architecture. Managed NFS for shared weights, a taint-plus-DaemonSet pattern to make sure the network is tuned before any NFS mount happens, and a few HA choices that make sense specifically for GPU-constrained clusters.

The full reference architecture is open source in the scale-with-simplicity repo. See you in the next one.