RTX 5090 cuBLAS Bug, Neural Texture Compression, Multi-GPU vLLM Inference

#gpu #nvidia #hardware

RTX 5090 cuBLAS Bug, Neural Texture Compression, Multi-GPU vLLM Inference

Today's Highlights

Today's highlights include a significant performance bug found in cuBLAS for the unreleased RTX 5090, alongside a deep dive into Nvidia's new neural texture compression that drastically reduces VRAM usage. We also cover a practical guide for deploying massive LLMs efficiently across multiple GPUs using vLLM and mixed-precision quantization.

Surfacing a 60% SGEMM performance bug in cuBLAS on RTX 5090 (r/CUDA)

Source: https://reddit.com/r/CUDA/comments/1shrh0s/surfacing_a_60_sgemm_performance_bug_in_cublas_on/

This news reports a critical performance bug discovered within NVIDIA's cuBLAS library, specifically affecting FP32 SGEMM operations on the unreleased RTX 5090 GPU. The bug leads to a substantial 60% performance degradation, as cuBLAS dispatches an inefficient simt_128x32_8x5 kernel for specific batched FP operations, rather than an optimized one. This finding emerged during the development and benchmarking of a TMA-based FP32 SGEMM implementation, highlighting potential inefficiencies in the current or upcoming cuBLAS versions.

This discovery is vital for developers and researchers who rely heavily on cuBLAS for high-performance computing and AI workloads. A 60% performance hit in such a fundamental operation could severely impact the efficiency of neural network training, scientific simulations, and other compute-intensive tasks on next-generation NVIDIA hardware. It underscores the importance of rigorous benchmarking and in-depth kernel analysis, even for highly optimized libraries, to ensure hardware capabilities are fully leveraged.

Comment: Finding a 60% performance bug in cuBLAS for a future GPU is a major alert for anyone planning on next-gen deployments. This kind of low-level inefficiency can completely derail optimization efforts downstream if not addressed by NVIDIA or custom kernel development.

Benchmarking Nvidia's RTX Neural Texture Compression tech that can reduce VRAM usage by over 80% (r/nvidia)

Source: https://reddit.com/r/nvidia/comments/1siioe2/benchmarking_nvidias_rtx_neural_texture/

NVIDIA is showcasing its RTX Neural Texture Compression technology, a groundbreaking technique designed to drastically reduce GPU VRAM consumption. Initial benchmarks suggest this technology can cut VRAM usage by over 80%, a significant advancement for games, professional applications, and AI models that are increasingly bottlenecked by memory capacity. This innovation leverages neural networks to compress texture data more efficiently than traditional methods, reconstructing high-quality textures during rendering with minimal artifacts.

The ability to achieve such substantial VRAM savings has profound implications. For gamers, it means more complex textures, higher resolutions, faster loading times, and potentially better performance on GPUs with limited VRAM. For developers, it opens doors to creating more graphically rich environments and deploying larger AI models that rely on extensive texture or volumetric data without exceeding memory budgets. This technology could extend the lifespan of current-generation GPUs and enable new levels of visual fidelity on future hardware.

Comment: Reducing VRAM usage by 80% is a game-changer, especially for large language models and high-fidelity graphics. This tech directly addresses a major bottleneck and could dramatically expand the capabilities of existing and future GPUs.

Run Qwen3.5-397B-A13B with vLLM and 8xR9700 (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1simsqp/run_qwen35397ba13b_with_vllm_and_8xr9700/

This post details a practical guide on deploying the massive Qwen3.5-397B-A13B language model using vLLM across a multi-GPU setup, specifically highlighting its successful execution on 8xR9700 GPUs. The guide emphasizes utilizing mixed-precision FP4 (mxfp4) quantization, a critical technique for running models of this immense scale by reducing memory footprint while maintaining performance. vLLM, a high-throughput inference engine, is leveraged to optimize the serving of large language models, demonstrating efficient utilization of the multi-GPU hardware.

This news is highly significant for the AI community, particularly those working with large language models on local or prosumer GPU setups. It provides actionable insights into managing memory and compute requirements for state-of-the-art models, proving that even extremely large LMs can be run with careful optimization and multi-GPU orchestration. The mention of a "first guide to run 122B models" (referencing a previous guide) underscores its pioneering nature for making such formidable models accessible and performant.

Comment: Deploying a model this large across multiple GPUs with vLLM and mxfp4 is cutting-edge. This guide is invaluable for anyone trying to push the boundaries of local LLM inference on high-end consumer or professional hardware.