Damaso Sanoja

Posted on May 7

GPU cloud servers for AI workloads: how to choose the right instance and deploy without waste

#ai #cloud #infrastructure #llm

Your team just hit VRAM OOM during a demo prep run. The A100 40GB you provisioned for a Llama-3-70B deployment looked fine on paper until the KV cache ballooned at 8K context. You could throw two H100s at it and move on, or you could run the 30 seconds of arithmetic you skipped before provisioning.

Four decisions separate teams that run GPUs above 70% utilization from those idling at 35% while paying full price: workload classification, VRAM calculation, instance selection, and pricing model alignment. Get any of them wrong, and you’ll either hit a production ceiling or burn budget on capacity you can’t fill. Once all four are locked in, deployment is the execution step that wires them together.

Start with your workload class, not the GPU spec sheet

Workload classification comes first because training, fine-tuning, and inference each leave a different compute signature on the hardware, and that signature is what tells you which GPU to rent. The same Llama-3-70B model behaves like three different problems depending on what you’re doing with it, and the cheapest viable instance changes accordingly.

Full training is the heaviest of the three because every parameter is in motion at once. Your GPU spends most of its time executing Allreduce across data-parallel replicas and shuttling optimizer states between High-Bandwidth Memory (HBM) and compute units, sustained over hours or days. The memory cost compounds quickly: a model trained with AdamW in mixed precision stores weights, gradients, first moments, and second moments, totaling 16-18 bytes per parameter depending on whether gradients are kept in FP16 or FP32. That’s why memory capacity caps your maximum batch size per device and memory bandwidth caps how fast weight updates land, and it’s also why most teams running on cloud GPUs avoid full training whenever a cheaper path exists.

That cheaper path is usually fine-tuning with LoRA, which keeps most of the base model out of the optimizer entirely. By freezing the base weights and training only low-rank decomposition matrices, LoRA collapses the parameter count that AdamW has to track: with rank=16 on Llama 3 8B, you’re training roughly 42 million parameters instead of 8 billion. The base model stays in BF16 (or FP16) on-device, the adapters themselves are negligible in size, and optimizer states only cover the trainable slice, which drops total VRAM to around 20GB for an 8B model. That’s a footprint a single A100 80GB can hold with room left for forward-pass activations, turning a multi-GPU job into a single-card one. Runpod’s LLM fine-tuning GPU guide covers this workload class in depth.

Inference flips the constraint again, because once training is done, the optimizer disappears and the bottleneck moves from capacity to bandwidth. The shape of that bottleneck depends on how you serve: batch inference maximizes throughput per dollar by packing more sequences into each forward pass and tolerating the latency needed to fill the batch, while real-time inference targets TTFT (time-to-first-token), which is FLOPS-limited during the prefill phase. Once prefill finishes, though, the workload changes character: the model enters the decode phase, where it generates one token at a time and inter-token latency scales with how fast the GPU can stream the KV cache off HBM. That’s the regime where memory bandwidth, not raw compute, sets the ceiling, and it’s why an H100 SXM’s 3.35 TB/s HBM3 bandwidth serves tokens faster than an A100’s 2.0 TB/s, with the gap widening as the KV cache grows with sequence length and batch size.

Data modality then layers a second axis on top of those three signatures, because the workload class tells you what’s happening on the GPU but not what’s filling its memory, and modalities fill memory very differently. LLMs concentrate the pressure on context length: they’re KV-cache-bound, with VRAM scaling against the number of tokens in flight, so an 8B model serving 32K-token sessions can need more memory than the same 8B model serving 2K-token chats. Diffusion models like SDXL push on the opposite lever, staying modest in parameter count (the SDXL base model sits at approximately 3.5B parameters across UNet and VAE, with the refiner adding 6.6B for the full pipeline) but ballooning with image resolution and batch size as the latent activations grow. Multimodal models like LLaVA sit at the intersection of those two pressures and pay both costs: the vision encoder produces image embeddings that inflate the effective sequence length before the language model ever sees the input, so the KV cache starts larger than a text prompt of the same nominal length would suggest, and you’ll hit VRAM limits at batch sizes that would serve a same-size pure-LLM without complaint.

Calculate your VRAM before you provision

Once you know your workload class and modality, the next question is how much memory the job actually needs, and that turns into a short arithmetic exercise before any instance gets provisioned. The inference VRAM formula is:

VRAM = (N_params x bytes_per_param) + KV_cache_size + framework overhead (10-15%)

The KV cache size formula is:

KV_cache_size = 2 x num_layers x num_heads x head_dim x seq_len x batch_size x bytes_per_element

Note that num_heads for GQA models refers to the KV head count, not the query head count (e.g., 8 for Llama-3-70B, not 64). You can find num_layers, num_heads (as num_key_value_heads), and head_dim in the model’s config.json on HuggingFace Hub.

Example for Llama-3-70B at 4K context, batch size 8:

Weights at BF16: 70B x 2 bytes = 140GB
Weights at INT4 via bitsandbytes: 70B x 0.5 bytes = 35GB
KV cache at BF16: 2 x 80 layers x 8 KV heads x 128 head_dim x 4096 tokens x 8 batch x 2 bytes = approximately 10.7GB
Framework overhead at BF16: 140GB x 0.12 = approximately 17GB
Total at BF16: approximately 168GB (requires 2x H100 80GB or more with tensor parallelism)
Total at INT4: approximately 35GB + 10.7GB KV cache + 5GB overhead = approximately 51GB (fits one A100 80GB)

The table below gives you the minimum per-precision VRAM numbers for LLM inference. All values include approximately 12% framework overhead. KV cache is excluded because it varies with sequence length and batch size, so add 2-10GB for typical serving configurations, or significantly more for long-context (8K+) or high-concurrency deployments.

Model Size	FP16/BF16	INT8	INT4	Min Instance (FP16)	Min Instance (INT4)
8B	~18GB	~9GB	~5GB	A100 40GB	RTX 4090 24GB
13B	~29GB	~14GB	~8GB	A100 40GB	RTX 4090 24GB
34B	~76GB	~38GB	~19GB	A100 80GB	A100 40GB
70B	~157GB	~78GB	~40GB	2x A100 80GB	A100 80GB

These values cover inference weight loading only. If you’re fine-tuning instead, the numbers shift: full AdamW mixed-precision training multiplies FP16 weight VRAM by 8x, while LoRA at rank=16 adds only about 4GB of combined overhead (activations, intermediate gradients, and optimizer states) on top of the frozen base model. Adjusting rank scales that overhead roughly linearly: rank=8 halves it with some quality cost, rank=32 doubles it for more expressivity.

Here’s where that 8x multiplier comes from. AdamW in mixed precision stores five components per parameter:

2 bytes (FP16 weights)
2 bytes (FP16 gradients)
4 bytes (FP32 master weights)
4 bytes (FP32 first moment)
4 bytes (FP32 second moment)

That totals 16 bytes per parameter (18 bytes if your implementation keeps FP32 gradients separately). For an 8B model: 8B x 16 = 128GB minimum, which exceeds a single A100 80GB. This is exactly why LoRA’s reduction to approximately 42M trainable parameters at rank=16 on the same 8B model makes single-GPU fine-tuning viable.

With your VRAM requirements calculated, the next step is matching them to actual hardware.

Match the GPU architecture to your workload class

A VRAM number on its own only tells you what fits, not what serves well, and two GPUs with the same 80GB sticker can give you very different throughput on the same model. Hardware specs differ enough across current GPU options that a poor choice creates production constraints you can’t optimize away later, so the next move is matching the workload signature from the first section to the architecture that actually runs it efficiently.

GPU	VRAM	Memory BW	BF16 TFLOPS	Multi-GPU Link	Ideal Workload
H100 SXM 80GB	80GB HBM3	3.35 TB/s	989	NVLink 4.0 (900 GB/s)	Large model training, high-concurrency inference
A100 80GB SXM	80GB HBM2e	2.0 TB/s	~312	NVLink 3.0 (600 GB/s)	Multi-GPU training, 34B+ inference
A100 80GB PCIe	80GB HBM2e	1.94 TB/s	~312	PCIe 4.0 (64 GB/s)	Single-card inference, LoRA fine-tuning
L40S 48GB	48GB GDDR6	864 GB/s	~362	PCIe 4.0 (64 GB/s)	Diffusion + LLM combo inference
RTX 4090 24GB	24GB GDDR6X	1.0 TB/s	~82.6	PCIe 4.0 (64 GB/s)	Prototyping, quantized 7B-13B
AMD MI300X	192GB HBM3	5.3 TB/s	~1307	Infinity Fabric (XGMI)	70B+ BF16 single-card serving

Start at the top of the table. The H100 SXM 80GB earns its price premium on any workload where inter-GPU communication, not raw compute, is what would otherwise constrain you: NVLink 4.0 delivers 900 GB/s bidirectional bandwidth within a node, roughly 14x PCIe 4.0, which translates to substantially faster Allreduce across eight GPUs. The math becomes concrete on a 70B tensor-parallel deployment across four H100s, where every forward pass exchanges activation tensors at layer boundaries across cards via all-reduce. NVLink absorbs that traffic; PCIe 4.0 at 64 GB/s turns it into the bottleneck.

If your job doesn’t need that interconnect, the A100 80GB is usually the right step down, and the choice between its two variants follows directly from the same bandwidth question. The PCIe variant delivers 1.94 TB/s of memory bandwidth versus the SXM’s 2.04 TB/s, close enough on a single card that memory-bound serving sees only marginal differences, so the PCIe variant runs 20-30% cheaper and fits single-card inference up to 34B at INT8 and LoRA fine-tuning of 8B-13B models. The SXM premium only pays off once you scale across cards, where NVLink 3.0 (600 GB/s) provides a 9.4x bandwidth advantage over PCIe 4.0 for tensor-parallel and Allreduce traffic.

The L40S sits one tier below the A100 on memory bandwidth and one tier above on rendering silicon, which gives it a narrower but real niche. Its GDDR6 memory tops out at 864 GB/s, putting raw LLM inference throughput below an A100 80GB on memory-bound workloads, but the Ada Lovelace rasterization silicon makes it the right pick for mixed pipelines that combine image generation (ComfyUI, SDXL) with LLM text generation. It fits SDXL at full resolution alongside a 34B LLM in INT4 at a cost-per-hour that’s competitive for that specific combination.

Below the L40S, the RTX 4090 24GB belongs in a different category entirely: prototyping, not production. At INT4 via bitsandbytes, it serves a quantized 13B model with meaningful throughput, but the 24GB VRAM ceiling and NVIDIA EULA restrictions on datacenter use of GeForce GPUs keep it in the development and quantization-testing tier. Graduate to an A100 80GB once the workload moves to production.

The AMD MI300X is the outlier in this lineup, and its case is narrow but compelling: a single card running Llama-3-70B in BF16. The 192GB HBM3 pool fits the full model with room for a usable KV cache, removing the complexity of a 4-GPU tensor-parallel setup, and Runpod’s MI300X vs H100 benchmark on Mixtral shows where that memory advantage translates into real throughput gains. The catch is the software side: ROCm 6+ has made PyTorch workable for standard training and inference, and ROCm became a first-class platform in vLLM as of early 2026 with prebuilt wheels, but custom CUDA extensions, Flash Attention variants, and Triton kernels still need to be checked against the ROCm HIP compatibility table and the vLLM ROCm compatibility matrix before you commit, and tested on an actual MI300X instance before production.

Networking: when interconnect becomes the bottleneck

The NVLink 4.0 vs PCIe 4.0 gap covered above is the within-node story; it’s only half of the interconnect picture once you scale beyond one chassis. The other half is what happens between nodes, and the two scales fail in different ways.

Within a single node, the parallelism strategy decides how much that NVLink-vs-PCIe gap actually costs you. Tensor-parallel inference exchanges activations across all GPUs on every forward pass and is exquisitely sensitive to the gap, which is why H100 SXM nodes exist. Pipeline-parallel inference, by contrast, hands a single activation tensor from one stage to the next in one direction, so PCIe 4.0 is often adequate, and the SXM premium stops paying for itself.

Across nodes, the relevant comparison is InfiniBand NDR at 400 Gb/s vs 100GbE Ethernet, and the cost shows up in synchronous data-parallel training where Allreduce gradient sync scales with model size and node count. A 70B run with 2-byte gradients moves 140GB per Allreduce step: roughly 11 seconds over 100GbE, under 3 seconds over InfiniBand NDR, and the Ethernet penalty grows with each node added. The practical heuristic: if your model fits on a single node for inference or LoRA fine-tuning (4x A100 80GB = 320GB covers 70B inference at BF16 with room for KV cache, or LoRA fine-tuning of the same model), stay there. Cross-node setup adds operational complexity that only memory constraints can justify.

One footgun lives below both of those layers. NCCL silently falls back to CPU-mediated transfers when direct GPU P2P isn’t available, cutting Allreduce throughput 30-40% versus correctly configured PCIe P2P (and far more versus NVLink). nvidia-smi topo -m flags this with PHB paths between GPUs; on some PCIe-only nodes, the fallback is unavoidable and needs to be priced into your projections. Verify topology and set NCCL P2P behavior explicitly before launching distributed training; the deployment section below covers the exact commands.

Align your pricing model to your usage pattern

Picking the right instance only solves half the cost problem; the other half is how you pay for it, because demand fluctuates while capacity doesn’t, and most GPU deployments idle for long stretches at full per-hour rates. The fix is matching the pricing tier to the usage pattern, and Runpod’s three tiers correspond to three patterns most teams actually run.

The first pattern is light or intermittent use, which is where pay-as-you-go with per-second billing pays off. A 30-minute fine-tuning experiment billed per second costs materially less than the same run billed by the hour, and at ten experiments a day the delta compounds, so PAYG is the right default for experimentation and any workload running under four hours per day. Check Runpod’s pricing page for current rates, since spot prices shift with capacity.

Once usage crosses into sustained load above roughly eight hours per day, that calculus inverts: per-second billing now charges premium rates on time the instance was going to be busy anyway. Reserved capacity is the answer for continuous training jobs or persistent inference endpoints, trading flexibility for meaningful per-hour savings and removing interruption risk from your critical path.

The third pattern, bursty API traffic, doesn’t fit either tier well: continuous reservation wastes budget at 3 am, and PAYG-per-second still pays for idle time between requests. Serverless endpoints bill per request and scale to zero between them, so cost stays proportional to actual usage when traffic swings from 10,000 requests at launch to 200 overnight. The tradeoff is cold-start latency (60-180 seconds for a 70B model load), which is fine for batch APIs but requires a minimum worker count of one for user-facing endpoints; Runpod’s serverless vLLM guide covers the full deployment pattern.

One lever cuts across all three tiers: quantization can change which instance class you’re paying for in the first place. INT4 via bitsandbytes shrinks weight VRAM roughly 4x versus BF16, which is often enough to drop down a class, and the per-hour saving compounds across whichever pricing tier you’re on. Llama-3-70B in BF16 needs approximately 168GB and at least two H100 80 GB; at INT4, it fits a single A100 80GB at approximately 45-51GB. The catch is task sensitivity: generation and summarization typically see minimal accuracy loss from INT4, while reasoning, long-context retrieval, and code generation show measurable degradation, so verify by running 50-100 representative prompts side-by-side on BF16, and INT4 builds with EleutherAI’s lm-evaluation-harness before you commit. Runpod’s quantization guide covers the full quality tradeoff analysis.

With a pricing model aligned to your usage pattern, the final step is deploying the container that translates your instance selection into a running endpoint.

Deploy from container to serving endpoint

Start with the base image, because a mismatched CUDA stack is the most common silent failure when a container moves between instance types. NVIDIA’s NGC containers (e.g., nvcr.io/nvidia/pytorch:25.x-py3 at the latest stable tag) pin CUDA and cuDNN versions tested against specific GPU architectures, so pin the full image tag in your Dockerfile and test on the target instance class before pushing to production.

With the base image fixed, the next choice is the serving framework. vLLM handles multi-GPU tensor-parallel inference, with PagedAttention allocating KV cache dynamically instead of reserving a worst-case slab up front. The --gpu-memory-utilization 0.90 flag caps the model executor at 90% of GPU memory (weights, activations, and KV cache blocks combined), leaving 10% free for framework overhead and preventing OOM at peak load.

Here’s a minimal vLLM deployment for Llama-3.1-70B across four GPUs. Gated models require license acceptance on HuggingFace Hub and HF_TOKEN set in your environment (covered below).

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 4096

That starts a 4-GPU tensor-parallel server with an OpenAI-compatible API endpoint. Verify the HuggingFace model ID before deploying, since Meta updates names across Llama versions; Runpod’s vLLM optimization guide covers workload-specific --gpu-memory-utilization tuning and GuideLLM throughput benchmarking.

For distributed training instead of serving, Ray Train with a TorchTrainer handles worker discovery and process group initialization on Runpod’s elastic training clusters. ray.init(address="auto") connects to an existing cluster (head node + workers), which must already be running; provision one via Runpod’s cluster console and grab the head node address from the dashboard.

On PCIe-only nodes, training also needs explicit NCCL P2P configuration before launch:

# Check GPU topology -- NV2/NV3/NV4 indicates NVLink; PHB or SYS indicates PCIe paths
nvidia-smi topo -m

# Launch with P2P enabled, and NCCL debug output active
NCCL_P2P_DISABLE=0 \
  NCCL_DEBUG=INFO \
  torchrun --nproc_per_node=4 \
    --nnodes=1 \
    train.py

In NCCL_DEBUG output, “via NVL” confirms NVLink paths, “via P2P” means PCIe direct, and “via SYS” means CPU-mediated transfer (worst case for throughput).

Credentials management is the same either way: inject HF_TOKEN, model registry credentials, and API keys as runtime environment variables, never baked into Docker layers (where they persist in image history across rebuilds and survive updates). Runpod’s console and SDK both support runtime env injection, which also makes rotation straightforward.

Finally, verify the instance is actually earning its cost. Track VRAM with nvidia-smi dmon -s u for per-second metrics, or DCGM for fleet-level monitoring with Prometheus. If a serving instance sits below 60% VRAM utilization at peak traffic, you’re over-provisioned: drop a class or raise the batch size to improve throughput per dollar.

Put it all together in four steps

Each of the four decisions above maps to one node in this decision tree:

To walk this path with your own model, start with the VRAM number. Open a Python shell with your model config loaded and run sum(p.numel() for p in model.parameters()) * 2 / 1e9 to get the BF16 weight size in gigabytes. Add 20% for framework overhead and KV cache at moderate sequence lengths, then cross-reference the VRAM table above to find the smallest Runpod instance that clears it.

If you want to skip the base image setup entirely, Runpod Hub carries pre-built templates for vLLM, Axolotl (fine-tuning), and ComfyUI (diffusion) with CUDA, cuDNN, and library versions pre-configured for the target workload. A template gets you from VRAM calculation to a live inference endpoint in under 15 minutes. Validate your instance choice against real traffic before committing to reserved capacity.

Pick your model, run the calculation, and start building on Runpod with no waitlist and no sales call required.

DEV Community