DEV Community

soy
soy

Posted on • Originally published at media.patentllm.org

Qwen3.6 Performance Boost with vLLM, New Ollama Management Tool & 35B Model

Qwen3.6 Performance Boost with vLLM, New Ollama Management Tool & 35B Model

Today's Highlights

This week's top stories highlight significant strides in local LLM performance and usability. A Qwen3.6-27B INT4 variant achieved 100 tps with vLLM on an RTX 5090, while a new Cockpit extension streamlines Ollama model management, making local AI more accessible. Additionally, the Qwen3.6 35B A3B Heretic model stands out for its quality and efficiency with IQ4XS/Q8 KV cache.

Qwen3.6-27B-INT4 Hits 100 TPS, 256K Context with vLLM 0.19 on RTX 5090 (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1sw21op/qwen3627bint4_clocking_100_tps_with_256k_context/

This report details a significant performance milestone for local inference, achieving 100 tokens per second (tps) with the Qwen3.6-27B model quantized to INT4. The setup utilizes a single NVIDIA RTX 5090 GPU, known for its high VRAM and processing power, and leverages vLLM version 0.19. This combination allows for a massive 256k context length, pushing the boundaries of what's possible for long-context generation on consumer-grade hardware.

The technical achievement highlights the effectiveness of vLLM's advanced serving architecture, which includes optimized KV cache management and attention mechanisms, in maximizing throughput. Quantization to INT4 is crucial for fitting larger models into VRAM and improving inference speed, making this powerful Qwen3.6 variant accessible for local deployment. Such high token throughput and extensive context length are particularly beneficial for applications requiring complex reasoning, code generation, or comprehensive document analysis without needing cloud services.

Comment: Hitting 100 tps on a 27B INT4 model with 256k context on a single consumer GPU like the RTX 5090 is wild. vLLM is clearly doing heavy lifting, making massive models and contexts practical for self-hosted, real-time use cases.

Ollama Model Management Simplified with New Cockpit Extension (r/Ollama)

Source: https://reddit.com/r/ollama/comments/1swe2cf/i_tried_to_create_a_simple_cockpit_extension_to/

A new Cockpit extension has been developed to streamline the management of local Ollama models. This tool addresses a common need for users running Ollama on self-hosted servers, providing a graphical interface within the Cockpit web console to easily pull, list, and delete LLM models. For system administrators and homelab enthusiasts, Cockpit offers a centralized platform for server management, and this extension integrates Ollama directly into that workflow.

The extension allows users to interact with their local Ollama instance without resorting to command-line prompts, simplifying the process of onboarding new models, monitoring deployed ones, and freeing up storage space. This significantly lowers the barrier to entry for managing a diverse collection of local AI models, making it more accessible for users to experiment with different open-weight architectures like Llama, Gemma, or Qwen, and facilitating smoother updates. It's a practical enhancement for anyone leveraging Ollama for local AI inference.

Comment: This Cockpit extension is exactly what some self-hosters need. Managing Ollama via CLI is fine, but a GUI for pulling/listing/deleting models makes it so much more approachable for a homelab setup. Definitely worth a try for easier local AI ops.

Qwen3.6 35B A3B Heretic: Top-Tier Open Model with IQ4XS/Q8 KV Cache (r/LocalLLaMA)

Source: https://reddit.com/r/LocalLLaMA/comments/1sw5fb7/qwen36_35b_a3b_heretic_kld_00015_incredible_model/

A new variant of the Qwen3.6-35B model, dubbed "A3B Heretic," is garnering significant attention within the local LLM community for its exceptional performance, particularly noted for its low Kullback-Leibler Divergence (KLD) of 0.0015. This metric often indicates a model's ability to maintain high quality and coherence, suggesting that the A3B Heretic version is incredibly effective as an uncensored model. It is praised by users as "BY FAR the best uncensored model" found in the Qwen 3.6 35B category.

The model is highlighted for its practicality in local inference setups, utilizing IQ4XS quantization and Q8 KV cache, which enables it to fit comfortably within 24GB of VRAM. This makes it viable for users with high-end consumer GPUs, expanding access to powerful 35B parameter models. Furthermore, its ability to handle a substantial 262K context length without failure on multi-turn interactions underscores its robustness and capability for complex, extended conversations or document processing.

Comment: The Qwen3.6 35B A3B Heretic, especially with IQ4XS and Q8 KV cache, sounds like a game-changer for those with 24GB VRAM. A KLD of 0.0015 is impressive, indicating a very capable and high-quality model, particularly for uncensored use cases.

Top comments (0)