When 8 GPUs Is All You Need

#ai #llm #infrastructure #devops

TL;DR: 4 GPUs covers most 70B-200B production inference needs. 8 GPUs handles larger models and redundancy. You only need a multi-node cluster if you're pre-training from scratch or serving at hyperscale.

Most AI teams I talk to start the same way: they see what hyperscalers are selling, assume they need a cluster, and either overspend on compute they don't fully use, or underspec their first server and hit a wall three months in.

The wall is always the same. The model grows. Latency climbs. The team realizes the single GPU they started on was a proof of concept, not a production spec. Mid-project, mid-budget, rethinking everything.

For most inference workloads, 4 to 8 dedicated GPUs is where the math works.

The workloads that fit here

AI-based search platforms are the clearest case. If you're embedding an LLM into a search product, you're serving queries continuously, at low latency, with a model in the 70B to 200B parameter range. That workload needs memory bandwidth and consistency. A 4x or 8x H200 NVLink server holds the model in full VRAM, keeps GPU-to-GPU communication off the PCIe bus, and gives you predictable latency regardless of what else runs nearby.

AI media analytics has the same profile: processing video metadata, running multimodal inference pipelines, classifying content at scale. Continuous throughput workloads that run around the clock. Dedicated hardware economics beat cloud once these pipelines stop being intermittent.

Redundant dual DC setups belong in the conversation earlier than most teams think. Two 4x GPU servers across two EU datacenters gives you active-active inference with geographic redundancy. For teams with uptime requirements or data residency obligations, this architecture is simpler to operate than a single large cluster, with data staying in the EU locations you specify.

Why dedicated changes the calculation

On shared cloud infrastructure, GPU memory bandwidth degrades under load. Your workload competes with whatever else runs on that physical node. For inference, where time-to-first-token and tokens-per-second determine whether your product feels fast or broken, that unpredictability compounds.

On dedicated bare metal:

Spec	Detail
Memory bandwidth	H200 provides 4.8 TB/s of HBM3e memory bandwidth
GPU interconnect	NVLink keeps GPU-to-GPU traffic off the PCIe bus
Hardware sizing	CPU, RAM, and NVMe matched to the GPU config from day one

For teams with EU data residency requirements, dedicated infrastructure in EU datacenters means your training data and inference logs stay where your compliance team needs them.

Starting at 4, growing to 8

You don't have to start at 8. For 70B to 200B models, a 4x H200 NVLink server covers most production inference needs. With FP8 quantization and careful sharding, the same configuration can handle 405B-class workloads at moderate concurrency. That gives you room to validate your serving stack before expanding.

The DL385 Gen11 supports configurations with up to 8 GPUs, so teams that plan slot and power headroom from day one can grow from 4 to 8 on the same server without a chassis change.

GPU	Right for
H200 NVLink	70B to 405B models, production inference, memory-heavy workloads
H100	Teams where ecosystem stability matters: vLLM and TensorRT-LLM have years of H100 optimization
RTX Pro 6000	Parallel inference on smaller models, visual computing, VDI, rendering alongside AI workloads

When you need more

Pre-training a frontier model from scratch requires more than 8 GPUs. The multi-node cluster conversation is real and the interconnect requirements are different.

True hyperscale inference, serving hundreds of millions of daily requests across many model variants, outgrows a single server.

Most teams building new AI products are in a different phase: proving latency targets, validating the model in production, getting the inference stack right. That work fits on 4 to 8 dedicated GPUs.

The right configuration depends on your model, your precision target, and your concurrency requirements. If you're speccing out an EU-based deployment, start here: Leaseweb GPU Servers

Disclaimer: I'm on the infrastructure team at Leaseweb. EU-native, Netherlands-owned.