CUDA-Oxide 0.1 Lands; RTX 5090 Launches with 32GB & Hits 600 Tok/s

#gpu #nvidia #hardware

CUDA-Oxide 0.1 Lands; RTX 5090 Launches with 32GB & Hits 600 Tok/s

Today's Highlights

NVIDIA introduces CUDA-Oxide 0.1, an experimental Rust-to-CUDA compiler. Concurrently, the AORUS RTX 5090 INFINITY 32G officially launches, with benchmarks showing it can achieve 600 tokens/s on Gemma 4 26B using DFlash.

NVIDIA releases CUDA-Oxide 0.1 for experimental Rust-to-CUDA compiler (r/CUDA)

Source: https://reddit.com/r/CUDA/comments/1t7a6n9/nvidia_releases_cudaoxide_01_for_experimental/

This release introduces CUDA-Oxide 0.1, an experimental Rust-to-CUDA compiler developed by NVIDIA. It allows developers to write GPU kernels using the Rust programming language, offering a memory-safe alternative to C++ for CUDA development. The project aims to integrate Rust's modern language features, such as strong type safety and zero-cost abstractions, directly into the CUDA ecosystem. This compiler translates Rust code into PTX (Parallel Thread Execution), NVIDIA's assembly-like virtual instruction set architecture, enabling execution on NVIDIA GPUs.

This development is significant for the CUDA community as it opens the door for Rust developers to directly target NVIDIA hardware for high-performance computing and AI workloads. By leveraging Rust's safety guarantees, developers can potentially reduce common programming errors associated with manual memory management in C++, leading to more robust and reliable GPU applications. The experimental nature of this release suggests ongoing development, with a focus on gathering community feedback to refine the compiler and expand its feature set.

Comment: A Rust-to-CUDA compiler is a game-changer for writing safer, more robust GPU code without sacrificing performance. I'm eager to try porting some of my C++ kernels to Rust with this.

AORUS RTX 5090 INFINITY 32G launches with 2730 MHz boost clock (r/nvidia)

Source: https://reddit.com/r/nvidia/comments/1t7935d/aorus_rtx_5090_infinity_32g_launches_with_2730/

Gigabyte's AORUS brand has officially launched its RTX 5090 INFINITY 32G graphics card, marking a significant entry into the high-end GPU market. This new NVIDIA-based GPU comes equipped with 32GB of VRAM, catering to demanding graphical workloads, high-resolution gaming, and professional AI/ML applications. A key highlight of this launch is its impressive 2730 MHz factory-overclocked boost clock, promising substantial performance improvements over reference designs.

The RTX 5090 is expected to be based on NVIDIA's latest architecture, offering advancements in ray tracing, AI processing (Tensor Cores), and overall rasterization performance. The 32GB of VRAM is crucial for handling large textures, complex scenes, and voluminous AI models, preventing memory bottlenecks that can hinder performance in cutting-edge applications. The AORUS INFINITY series is known for its premium cooling solutions and robust power delivery, suggesting that this card will be designed to sustain its high clock speeds under heavy load, providing enthusiasts and professionals with top-tier hardware for their computational needs.

Comment: Another 5090 variant emerges, and 32GB VRAM is the sweet spot for many LLMs. That 2730MHz boost clock indicates serious thermal engineering to keep it stable.

Gemma 4 26B Hits 600 Tok/s on One RTX 5090 (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1t796qe/gemma_4_26b_hits_600_toks_on_one_rtx_5090/

A recent benchmark showcases the impressive inference capabilities of the Gemma 4 26B model, achieving a throughput of 600 tokens per second on a single NVIDIA RTX 5090 GPU equipped with 32GB of VRAM. The testing setup utilized vLLM version 0.19.2rc1 and specifically leveraged DFlash speculative decoding for optimized performance. The main model used was cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit, indicating a 4-bit AWQ quantized version, with a draft model also involved in the speculative decoding process.

This benchmark provides concrete evidence of the RTX 5090's power in AI inference and highlights the effectiveness of VRAM optimization techniques like DFlash when combined with advanced inference engines such as vLLM. Achieving 600 tok/s on a 26B model is a significant feat for local and single-card deployments, demonstrating that the latest consumer-grade GPUs, coupled with software optimizations, can handle substantial language models efficiently. This performance data is crucial for developers and researchers planning their hardware requirements for deploying large language models, emphasizing the interplay between GPU hardware, VRAM capacity, and advanced decoding algorithms.

Comment: 600 tok/s for Gemma 4 26B on a single 5090 is fantastic, especially with DFlash. This demonstrates how much mileage we can get from hardware when coupled with smart speculative decoding.