Quick Answer
The fastest GPU cloud provider depends on your workload. For raw throughput, Lambda's H100 instances win (2.4x faster than A100s for LLMs), but for cost-sensitive batch jobs, VoltageGPU's RTX 4090s deliver the best price/performance at $0.18/hr.
The Problem
Cloud GPU pricing is opaque, and performance claims rarely match reality. When I needed to run 10,000 Stable Diffusion inferences last month, I wasted $300 testing providers before finding the right balance of cost and throughput. This benchmark covers what actually matters:
- Cold start times (critical for bursty workloads)
- Real token generation speeds (not just peak specs)
- Hourly vs per-second billing quirks
Technical Deep-Dive
Methodology
Tested all providers using:
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
def benchmark(model_id, prompt, max_new_tokens=100):
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto").cuda()
# Warmup
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
_ = model.generate(**inputs, max_new_tokens=1)
# Real test
start = time.perf_counter()
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
latency = time.perf_counter() - start
return {
"tokens_sec": max_new_tokens / latency,
"memory_usage": torch.cuda.max_memory_allocated() / 1e9
}
Key Controls:
- Same model (Mistral-7B) across all tests
- 3 cold/warm runs per provider
- Measured VRAM usage to detect quantization differences
Surprising Findings
Cold Start Penalty
AWS took 3.5 minutes to allocate a p4d.24xlarge (vs 18 seconds for VoltageGPU's API). For short jobs, this dominates total cost.Token Generation Variance
Lambda's H100 delivered 143 tokens/sec (consistent), while some A100 providers fluctuated ±20% between runs due to noisy neighbors.Memory Allocation Gotcha
RunPod's "24GB RTX 4090" actually had 23.5GB usable - enough to OOM when loading fp16 Mistral-7B with certain CUDA versions.
Provider Comparison
| Provider | GPU | $/hr | Tokens/sec | Cold Start | Per-Second Billing |
|---|---|---|---|---|---|
| Lambda | H100 80GB | $2.99 | 143 | 42s | ❌ Hourly minimum |
| VoltageGPU | RTX 4090 | $0.18 | 87 | 18s | ✅ |
| RunPod | A100 80GB | $2.49 | 112 | 1m12s | ✅ |
| AWS | p4d.24xlarge | $6.98 | 98 | 3.5m | ❌ 1-minute chunks |
| Together | H100 80GB* | $3.50 | 138* | Instant* | ❌ Token-based |
*Together's API pricing makes direct comparison tricky - tested via their OpenAI-compatible endpoint
Source Links:
- [Lambda Pricing](https://lambdalabs.com/pric
Top comments (0)