Reaching 3,140 Tok/s: Benchmarking the 96GB RTX Pro 6000 Blackwell

#webdev #ai #hardware #servers

Fellow devs and sysadmins, the days of sharding massive models across multiple consumer GPUs just to get enough memory are coming to an end.

At Fit Servers, we've been analyzing the NVIDIA RTX Pro 6000 Blackwell, and the specs are heavily targeted at local AI developers and server engineers.

The Hardware
GPU Die: GB202 (Blackwell)

CUDA Cores: 24,064

VRAM: 96 GB GDDR7 ECC (1.8 TB/s bandwidth)

TDP: 600W Max (Server Edition is passive)

Why Developers Should Care: Native FP4
The biggest architectural shift is the 5th-Gen Tensor Cores with native FP4 support. A full 70B parameter model in FP4 quantization requires ~35–40 GB of VRAM. With 96GB, you have massive headroom for long context windows.

In single-GPU inference, this card hits 3,140 tokens/sec, actually edging out the H100 SXM (2,987 tok/s) because the H100 lacks native FP4 hardware paths.

The Catch
It draws 600W. You need a dedicated, full-size tower or a proper rack setup with heavy-duty power supplies to run this safely.

We’ve published the full technical spec sheet, thermal requirements, and detailed throughput benchmarks over at Fit Servers.

🔗 For read more visit the blog link and find your next server configuration: https://www.fitservers.com/blogs/nvidia-rtx-pro-6000-blackwell/