DEV Community

soy
soy

Posted on • Originally published at media.patentllm.org

Qwen3.6-27B vLLM 0.19 Benchmarks, GLM 5.1 Local Performance, & Multimodal WaTale

Qwen3.6-27B vLLM 0.19 Benchmarks, GLM 5.1 Local Performance, & Multimodal WaTale

Today's Highlights

This week's top stories feature impressive local inference benchmarks for Qwen3.6-27B and GLM 5.1 using vLLM, sglang, and NVFP4 quantization, demonstrating high throughput on consumer and workstation GPUs. We also spotlight WaTale, a new fully local AI visual novel engine that integrates Ollama, Stable Diffusion, and Kokoro TTS for multimodal creative applications.

Qwen3.6-27B Achieves 80 tps, 218k Context on RTX 5090 with vLLM 0.19 (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1sv8eua/qwen3627b_at_80_tps_with_218k_context_window_on/

The open-weight Qwen3.6-27B model recently demonstrated remarkable local inference capabilities, achieving approximately 80 tokens per second (tps) generation with an expansive 218,000 token context window. This impressive performance was recorded on a single NVIDIA RTX 5090 GPU, showcasing the growing power of consumer-grade hardware for advanced AI workloads. The setup utilized vLLM version 0.19, a highly optimized inference engine renowned for its ability to maximize throughput and minimize latency in serving large language models.

A key factor in this high-efficiency deployment is the adoption of NVFP4 quantization combined with MTP (likely referring to multi-tensor parallelism or a similar memory/computation optimization technique). NVFP4 significantly reduces the memory footprint and computational requirements of the 27-billion parameter model without substantial loss in quality, making it feasible to run such a large model with an extensive context on a single GPU. This benchmark highlights the continuous advancements in both open-weight model development and the optimization tools available for local inference, pushing the boundaries of what can be achieved on self-hosted hardware setups. Users looking to leverage powerful LLMs for complex tasks requiring long context windows can now explore such configurations for robust local deployment.

Comment: Hitting 80 tps on Qwen3.6-27B with a 218k context window on a single 5090 is a massive leap for local capabilities. vLLM and NVFP4 are clearly the workhorses here, making long-context, high-throughput inference truly viable for self-hosters.

GLM 5.1 Reaches 40 tps, 2000+ pp/s Locally via sglang Patching and NVFP4 (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1svgtlh/glm_51_locally_40tps_2000_pps/

Local AI enthusiasts are seeing significant gains in self-hosted model performance, exemplified by the GLM 5.1 model's recent benchmarks. Through a meticulously optimized setup involving 'sglang patching' and a 'reap-ed nvfp4 version,' the GLM 5.1 model achieved an impressive 40 tokens per second (tps) for generation and over 2000 prefill tokens per second (pp/s). This setup demonstrates how specialized acceleration techniques can unlock high throughput for large, open-weight models on private infrastructure.

The 'sglang patching' likely refers to modifications or integrations with the sglang library, which focuses on efficient LLM inference, particularly for structured output and complex prompt flows. Coupled with NVFP4 quantization, which compresses the model weights to 4-bit floating-point precision, this allows for substantial memory savings and faster computation while maintaining model fidelity. The inference was successfully run on a cluster of four RTX 6000 Pro GPUs, with each card operating under a strict 350W power limit. This showcases effective resource management and advanced software optimization techniques to run state-of-the-art models with production-level performance on self-hosted systems, even when facing power constraints.

Comment: The GLM 5.1 benchmarks show that combining sglang patching with NVFP4 on robust workstation GPUs can deliver serious local performance. Achieving 40 tps and 2000+ pp/s with power limits is an engineering feat for self-hosted LLMs.

WaTale: A Fully Local AI Visual Novel Engine Powered by Ollama, SD, and Kokoro TTS (r/Ollama)

Source: https://reddit.com/r/ollama/comments/1svfchs/watale_a_free_fully_local_ai_visual_novel_engine/

Introducing WaTale, a novel, fully local AI-powered visual novel engine that allows users to create and interact with branching narratives entirely on their own hardware. This innovative application seamlessly integrates several open-source AI components to deliver a rich, multimodal experience without reliance on cloud services. At its core, WaTale leverages Ollama for orchestrating various large language models (LLMs) to generate dynamic text and dialogue, ensuring a deeply interactive storytelling experience.

For visual elements, WaTale incorporates Stable Diffusion (SD), an open-source generative AI model capable of producing high-quality images from text prompts, allowing for custom scenes and character designs within the narrative. The immersive experience is further enhanced by Kokoro TTS (Text-to-Speech), which provides natural-sounding voiceovers for characters, bringing the story to life. By combining these powerful, locally runnable models, WaTale demonstrates the incredible potential of self-hosted AI for creative applications, offering users complete control over their content, privacy, and infrastructure. It’s a compelling example of how diverse open-weight models can be combined to build sophisticated, consumer-ready applications.

Comment: WaTale is a fantastic showcase of local multimodal AI, proving how Ollama can integrate with Stable Diffusion and TTS for compelling applications. This makes complex creative tools accessible without cloud dependencies, directly runnable on your machine.

Top comments (0)