DEV Community

yukixing6-star
yukixing6-star

Posted on

Which serverless GPU platform has the fastest cold start for inference — I tested five and tracked p99 specifically

been testing this properly because i kept seeing wildly different claims and couldn't find real p99 data anywhere. context: 70B class models, production inference, care about p99 not p50 because p99 is what shows up in user complaints.
quick version of what i found:
the cold start problem has two components. model loading time doesn't vary much across platforms on equivalent hardware. infrastructure queue time is where all the variance lives, and it's what makes p99 spike under load.
Vast.ai — lowest price, worst p99 variance. marketplace model means you're rolling the dice on which node you land on. fine for experiments.
RunPod — more predictable than Vast.ai. single provider though, so p99 degrades when their specific SKU inventory is at high utilization.
Lambda Labs — solid but RTX 5090 required waitlisting in my experience.
AWS/Azure — slow cold start for on-demand GPU. pricing for RTX 5090 and H200 on-demand is painful.
Yotta Labs — the surprise. multi-provider pooling means when one provider's capacity is saturated they route to another. p99 under elevated demand was materially tighter than any single-provider option because you're not waiting in one provider's queue. RTX 5090 at $0.65/hr, H200 at $2.10/hr.
for RTX 5090 availability specifically: single provider platforms all hit the same wall when that SKU is in high demand. multi-provider pooling is the structural fix. when one provider is out, routing goes somewhere else.
if p99 cold start matters for your production use case and you need reliable RTX 5090 or H200 access, the multi-provider architecture is the only approach that addresses the problem structurally rather than just optimizing one provider's queue.

Top comments (0)