GPU-Accelerated LLMs: Serving at 1M Tok/s, Voxtral TTS, & 4-bit Weight Quantization
Today's Highlights
This week, dive into bleeding-edge LLM serving performance reaching 1M tokens/second on B200 GPUs, explore Mistral AI's new open-weight Voxtral TTS model, and optimize your local LLMs with 3.2x memory savings via TurboQuant's 4-bit weight quantization.
1M Tokens/Second Serving Qwen 3.5 27B on B200 GPUs (r/MachineLearning)
Source: https://reddit.com/r/MachineLearning/comments/1s4hxgu/d_1m_tokenssecond_serving_qwen_35_27b_on_b200/
This report details the impressive achievement of serving Qwen 3.5 27B at over 1 million tokens per second using 96 B200 GPUs with vLLM v0.18.0. The core finding reveals a critical insight into LLM serving optimization: Data Parallelism (DP=8) was found to nearly quadruple throughput compared to Tensor Parallelism (TP=8) for this specific model size. This suggests that for moderately sized models like Qwen 3.5 27B, the communication overhead associated with tensor parallelism can negate its benefits, making data parallelism a more efficient strategy for maximizing inference speed across multiple GPUs.
Developers keen on pushing the limits of their local or self-hosted inference setups will find these insights on parallelism strategies particularly valuable. While a cluster of 96 B200s is certainly out of reach for most, the underlying principles of optimizing vLLM for token throughput and understanding when to favor data parallelism over tensor parallelism are directly applicable to smaller multi-GPU setups. This benchmark highlights the continuous advancements in making large models more accessible and performant, even as hardware scales. The detailed process provides a tangible blueprint for those looking to fine-tune their vLLM deployments, whether on a single powerful workstation with multiple GPUs or a small local cluster, ensuring they can achieve optimal efficiency and responsiveness for their LLM-powered applications.
Comment: 1M tok/s is insane, even on 96 B200s. For my single RTX 5090, understanding the DP vs TP trade-offs in vLLM is still critical for squeezing out every last token, especially when batching. I'll be looking for how these insights might translate to consumer hardware.
Mistral AI to Release Voxtral TTS with Open Weights, Outperforms ElevenLabs Flash v2.5 (r/LocalLLaMA)
Source: https://reddit.com/r/LocalLLaMA/comments/1s46ylj/mistral_ai_to_release_voxtral_tts_a/
Mistral AI is set to release Voxtral TTS, a new 3-billion-parameter text-to-speech model with eagerly anticipated open weights. This model reportedly outperforms ElevenLabs Flash v2.5 in human preference tests, setting a new benchmark for quality in locally runnable TTS. Crucially for our self-hosted developers, Voxtral TTS boasts a minimal memory footprint, requiring only about 3 GB of RAM, making it highly suitable for deployment on consumer-grade hardware like RTX GPUs.
The model also achieves an impressive 90-millisecond time-to-first-audio (TTA) and supports nine languages, offering a versatile and low-latency solution for integrating high-quality speech synthesis into applications. This release is a game-changer for developers looking to incorporate advanced TTS capabilities into local AI projects, edge devices, or self-hosted agents without relying on proprietary APIs or massive cloud infrastructure. The availability of open weights means full control and the ability to fine-tune or adapt the model for specific use cases.
Comment: Open-weight TTS that beats ElevenLabs and runs on 3GB VRAM? That's a must-try. I'm imagining using this locally for agent voice feedback or even integrating it into a custom chat UI running on my RTX 5090. Goodbye, API costs.
TurboQuant for Weights: Near-Optimal 4-bit LLM Quantization with 3.2× Memory Savings (r/LocalLLaMA)
Source: https://reddit.com/r/LocalLLaMA/comments/1s51b5h/turboquant_for_weights_nearoptimal_4bit_llm/
This item introduces an adaptation of the TurboQuant algorithm for model weight compression, promising near-optimal 4-bit LLM quantization with an impressive 3.2× memory savings. Unlike many quantization schemes that sacrifice significant performance or accuracy, TurboQuant aims for 'lossless 8-bit residual,' suggesting a high-fidelity approach. For developers battling VRAM constraints on their local RTX GPUs, this represents a significant leap forward, potentially enabling the execution of much larger models that were previously out of reach.
The implementation is described as a 'drop-in replacement for nn.Linear,' indicating ease of integration into existing PyTorch-based LLM workflows. This practical utility means developers can quickly experiment with applying TurboQuant to their models without extensive refactoring. By drastically reducing the memory footprint of LLM weights, this technique directly addresses one of the biggest bottlenecks for local LLM development and inference, opening up new possibilities for building more powerful and complex applications on self-hosted infrastructure.
Comment: 3.2x memory savings for LLM weights with a 'drop-in replacement for nn.Linear'? This sounds like black magic I need on my RTX 5090 yesterday. Running bigger models is always the goal, and this could be the key to fitting that next 100B parameter beast into my VRAM.
Top comments (0)