If you run LLM inference in production, you eventually will ask yourself, should you rent a GPU and run the model yourself, or do you use a serverless API and pay per token? Everyone has an opinion. Far fewer people show you the actual numbers that decide it.
So I ran both. I put the same model, gpt-oss-120b, on two setups, self-hosted with vLLM on a single AMD MI300X GPU Droplet, and on DigitalOcean's Serverless Inference. Then I measured the cold start, the warm latency, and the cost, and worked out exactly where one becomes the better choice than the other.
In short, self-hosted GPU is faster and more consistent once it's warm, but it carries a real cold start and you pay for it around the clock. Serverless hides the cold start and costs almost nothing at low volume, but you pay per token. Which one wins comes down to your traffic shape and how much your model actually outputs. Here are the numbers.
How long is the cold start on a self-hosted GPU?
When you run a model yourself, the GPU doesn't hold the model permanently. The weights have to be loaded from disk into the GPU's memory and the inference engine has to initialize before it can answer a single request. That startup delay, the gap between "process launched" and "first token out," is the cold start. You pay it every time you start the server fresh, a new deploy, a restart after a crash, or a new node coming up to handle load.
My setup here was a single AMD MI300X GPU Droplet running gpt-oss-120b with vLLM, with the weights already cached on disk. I started vLLM from cold and timed how long it took before the first token came back.
It took about 61 seconds. That wasn't a one-off, either, across three restarts it landed between 60.8 and 61.4 seconds every time.
One thing worth being precise about, because it's the most common misread, that 61 seconds is not the time to download the model. The weights were already saved on disk, so this is the cost you pay on a restart or redeploy, not a one-time setup. The startup logs show where the time actually goes:
| Phase | Time |
|---|---|
| Load weights from disk into VRAM (~68 GB) | ~24 s |
torch.compile |
~4 s |
| CUDA graph capture | ~11 s |
| Engine init, KV cache, warmup | ~21 s |
So the 61 seconds includes compilation and warmup, the entire engine bring-up, right up to serving a request. What it excludes is the one-time download of the weights from Hugging Face, which you pay once and never again. (These phases come from one representative run and there's some overhead between stages, so they don't sum to exactly 61, but that's where the time lives.)
This also answers the obvious follow-up, what happens when you scale up? If a new replica boots from an image with the weights baked in, or mounts a shared volume that already has them, it pays this ~61-second load, not a download. A brand-new node with nothing staged would also have to pull the weights first, but well-run setups specifically avoid that, because you don't want every scale event re-downloading 68 GB. So 61 seconds is the realistic number for a properly configured restart or scale-up
Warm latency and throughput
Once the model is loaded, it's a different machine. Warm, the self-hosted MI300X returned the first token in about 322 ms and sustained roughly 154 tokens per second, and it was remarkably stable, across twenty requests, the spread was about two milliseconds.
Memory is worth a note, because the raw number looks alarming. The card showed about 173 GB of VRAM in use. But the weights themselves are only about 68 GB. vLLM reserves most of the rest up front as KV-cache headroom (roughly 100 GB of it) so it can serve many requests at once. A 120B model doesn't "need" 173 GB; the engine just claims the room ahead of time.
So the self-hosted trade-off is clean: once it's warm, it's fast, consistent, and entirely yours, but every cold start costs a full minute for our model, and you pay for the GPU whether or not anyone is using it.
Does serverless inference have a cold start?
Next I ran the identical test against Serverless Inference. Calling it is straightforward, you create a model access key and hit the OpenAI-compatible endpoint, so the client code is the same and only the base URL changes. I left the endpoint idle first, then measured first-token latency the same way.
The first token came back in about 546 ms, with a wider spread, anywhere from 446 ms to roughly 1.3 seconds across twenty runs. But there was no spin-up. I ran it twenty times after sitting idle and never caught a cold-start hit.
Two honest caveats on those numbers. First, the serverless requests travel over the network to DigitalOcean's endpoint, while the GPU test ran locally on the droplet, so some of that extra latency is network distance, not the model being slower. Here's the warm comparison side by side:
| Metric | Self-hosted MI300X | Serverless Inference |
|---|---|---|
| Median time to first token | ~322 ms | ~546 ms |
| Spread (20 runs) | ~2 ms | 446 ms – 1.3 s |
| Throughput | ~154 tok/s | (per-token billed) |
| Cold start | ~61 s | none observed |
So why no cold start on serverless? It didn't delete the cold start, it absorbed it. DigitalOcean pools GPU capacity across customers, so the model stayed warm without any effort from me, and I never paid the 61-second hit I took on my own box. The difference is the billing model: serverless isn't charged by the hour, it's charged per token.
To be fair, serverless isn't immune to cold starts. If you hit it during a genuinely quiet stretch, you can still catch one. The standard mitigations are sending periodic warm-up requests to keep a worker hot, or designing async-first so a slow first response doesn't matter. In this test I didn't need any of that, it just stayed warm.
When is serverless inference cheaper than your own GPU?
This is where the decision actually lives, and it comes down to arithmetic. The GPU is a flat cost, about $1.88 an hour for a single on-demand MI300X, the same whether it serves one request or a million. Serverless is usage-based, gpt-oss-120b is priced at $0.10 per million input tokens and $0.70 per million output tokens, so it costs almost nothing when you're quiet and climbs as you get busier.
The break-even point is your hourly GPU cost divided by your per-request cost:
break-even requests/hour = GPU $/hr ÷ [(input_tokens × $0.10/1M) + (output_tokens × $0.70/1M)]
The catch is that the per-request cost depends entirely on how much your model outputs, and that moves the crossover more than you'd expect. I measured it at three response lengths, on the same GPU at the same prices, changing only the output length:
| Response type | Output tokens | Crossover |
|---|---|---|
| Short (classification / extraction) | ~30 | ~18 requests/sec |
| Medium (paragraph answer) | ~220 | ~3 requests/sec |
| Long (code / detailed explanation) | ~1,200 | <1 request/sec |
That's roughly a 25x swing from identical hardware, with nothing changing but response length. Output tokens are the expensive side of the bill, so the chattier your app, the sooner owning a GPU pays for itself.
In plain terms, if your app sends short, snappy responses, serverless stays cheaper until you're well over a dozen requests per second, nonstop. If it writes long answers, the GPU starts winning below one request per second. (For comparison, Dedicated Inference, DigitalOcean's managed always-on endpoint, is billed by the GPU-hour like the Droplet but without you managing the it, so its economics sit closer to the self-hosted side of this table than the serverless side.) Drop your own response lengths and GPU rate into the formula and you'll find your exact line.
When to use serverless inference, and when not to
No "it depends." Here's the actual call.
For most teams, serverless is the right default. Bursty or spiky traffic, real idle stretches, a dev tool, an internal feature, a side project, anything async where nobody is staring at a spinner on the first request. In all of those, the cold start runs on someone else's pooled capacity, not yours, and you pay nothing while you're quiet. For that kind of traffic, it's almost perfect.
Run the GPU yourself when traffic is steady and high-volume, or when you have a latency SLA you can't miss. At that point your traffic rarely stops, so you're not benefiting from serverless's idle savings anyway, and you're using the GPU enough that flat hourly beats per-token. You keep it warm, so the cold start stops mattering. That's not a knock on serverless, it's just the wrong tool for that job.
If you're in between, start on serverless. Watch your token spend, and move to a dedicated GPU the day you cross the line for your response lengths. Don't buy a GPU to solve a problem you don't have yet.
Run it yourself
Everything here is reproducible. You can spin up an AMD GPU Droplet and run gpt-oss-120b on vLLM, hit the same model on Serverless Inference with a model access key, and check the serverless metrics and pricing pages against your own workload.
Don't take my crossover, run your own. And if you measure a cold start on your own setup, I'd genuinely like to see how the spread looks across different models and hardware.





Top comments (0)