DEV Community

Cover image for Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe
Xavier Rey-Robert
Xavier Rey-Robert

Posted on

Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe

If you want to reproduce my current local Hermes Agent + Qwen3.6-27B setup, this is the shape I would start from.

Target

One local coding agent.
One 24GB GPU.
Long context.
Tools enabled.
Thinking enabled.

No child agents fighting the main request.

The goal is not peak tok/s on a short prompt. The goal is: can the same agent session keep working after hours of tool calls without losing prefix locality, timing out during prefill, or getting wrecked by auxiliary requests?

Model

This setup is intentionally text-only.

I am not serving the multimodal GGUF variant here. The working configuration uses groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit through vLLM with --language-model-only.

That choice matters. On a 24GB RTX 3090, the text-only GPTQ-Marlin path gave the best balance I found between long context, prefix caching, stable agent behavior and usable decode speed. Vision should be handled by a separate service/model if needed.

vLLM

The useful shape:

CUDA_VISIBLE_DEVICES=0 vllm serve groxaxo/Qwen3.6-27B-GPTQ-Pro-4Bit \
   --served-model-name qwen3.6-27b-gptq-pro-4bit \
   --dtype float16 \
   --quantization gptq_marlin \
   --tensor-parallel-size 1 \
   --max-model-len 131072 \
   --max-num-seqs 1 \
   --kv-cache-dtype fp8_e5m2 \
   --enable-prefix-caching \
   --reasoning-parser qwen3 \
   --enable-auto-tool-choice \
   --tool-call-parser qwen3_coder \
   --gpu-memory-utilization 0.95 \
   --max-cudagraph-capture-size 32 \
   --language-model-only
Enter fullscreen mode Exit fullscreen mode

I used a recent vLLM nightly, not an old stable image (0.20.1rc1.dev16+g7a1eb8ac2).

The two flags people will want to argue about:

--max-num-seqs 1
--max-model-len 131072
Enter fullscreen mode Exit fullscreen mode

I use max_num_seqs=1 deliberately. With an agent, parallelism is not free. Title generation, context compression, retries, browser checks, tool calls and side jobs can all steal KV/cache locality from the main request. On one 24GB GPU I prefer one useful request over two requests sabotaging each other.

131k context is tight, but workable here. If your service OOMs, reduce context before adding MTP or enforce-eager. I would test 110k, then 100k, then 80k.

What I Did Not Keep

No Qwen3 Next MTP/speculative decoding in the stable config. It caused crashes/OOMs/404s for my useful context sizes.

No enforce-eager. (saves memory but degrades performances)

No explicit Hermes max_tokens: 16384 cap. I removed it because it made debugging truncation and long reasoning/final-answer behavior harder.

Hermes

Point Hermes at the OpenAI-compatible vLLM endpoint and use the same served model name: qwen3.6-27b-gptq-pro-4bit.

The settings that mattered for me:

  • context around 131072
  • thinking enabled
  • preserve_thinking enabled
  • long provider/client timeout
  • child agents disabled
  • no hard max_tokens cap
  • tool calls allowed, but not parallelized into chaos

The timeout matters. At large context, a real prefill can look like a dead provider if the client gives up after 180s. Use long timeouts before blaming the model.

Here is the Hermes-specific config excerpt, redacted and trimmed to the parts that matter for the Qwen3.6/vLLM setup:

models:
  default: qwen3.6-27b-gptq-pro-4bit
  provider: vllm-qwen36.mylabdomain.com
  context_length: 131072
  extra_body:
    chat_template_kwargs:
      enable_thinking: true
      preserve_thinking: true

providers:
  vllm-qwen36.mylabdomain.com:
    name: vLLM Qwen3.6 27B
    api: https://vllm-qwen36.mylabdomain.com/v1
    api_key: <redacted>
    default_model: qwen3.6-27b-gptq-pro-4bit
    request_timeout_seconds: 1800
    stale_timeout_seconds: 1800

agent:
  max_turns: 240
  gateway_timeout: 1800
  gateway_timeout_warning: 900
  gateway_notify_interval: 600
  gateway_auto_continue_freshness: 3600
  api_max_retries: 3
  reasoning_effort: none # does not disable Qwen thinking
  verbose: true
  image_input_mode: text
  disabled_toolsets:
    - delegation

compression:
  enabled: true
  threshold: 0.85
  target_ratio: 0.2
  protect_last_n: 20
  hygiene_hard_message_limit: 400

context:
  engine: compressor

delegation:
  max_concurrent_children: 0
  child_timeout_seconds: 1800
  max_spawn_depth: 1
  orchestrator_enabled: true
  inherit_mcp_toolsets: true
  default_toolsets:
    - terminal
    - file
    - web

auxiliary:
  compression:
    provider: custom
    model: qwen3.6-27b-gptq-pro-4bit
    base_url: https://vllm-qwen36.mylabdomain.com/v1
    api_key: <redacted>
    timeout: 1800
    extra_body:
      chat_template_kwargs:
        enable_thinking: true

  title_generation:
    provider: custom
    model: qwen3.6-27b-gptq-pro-4bit
    base_url: https://vllm-qwen36.mylabdomain.com/v1
    api_key: <redacted>
    timeout: 1800
    extra_body:
      chat_template_kwargs:
        enable_thinking: true

display:
  streaming: true
  show_reasoning: true
  interim_assistant_messages: true
  tool_progress: all

cron:
  max_parallel_jobs: 1
Enter fullscreen mode Exit fullscreen mode

Cache Discipline

This part is mostly orchestration, not a vLLM flag.

What I can control: stable cron prompts, stable skills, no child-agent swarm, no parallel debug/dev jobs, long timeouts, and one main request at a time.

What I cannot fully control: the exact way Hermes serializes every internal prompt. If volatile session state or auxiliary material lands before stable instructions, prefix reuse will suffer.

So the practical rule is simpler: make the inputs you control boring and repeatable, and avoid side requests competing with the main session.

vLLM prefix caching helps, but it is not a magic persistent cache database. Treat it as an in-memory serving optimization and shape your traffic accordingly.

Expected Behavior

Healthy run:

  • low TTFT when prefix reuse hits
  • 60-90s TTFT can still happen on large context transitions
  • decode around high-30s tok/s on my setup
  • stable tool use
  • no recurring full-prefill on every continuation
  • no child-agent swarm

Bad run:

  • repeated full-prefill on every request
  • auxiliary requests firing while the main agent waits
  • model reaches output cap before final answer
  • MTP instability/OOM
  • tool loops caused by sampling/config/model mismatch

Quick Performance Snapshot

This is not a formal benchmark, just a sanity check against the real OpenAI-compatible endpoint.

Small prompt:

  • prompt: 41 tokens
  • output: 384 tokens
  • TTFT: 0.29s
  • decode: 45.3 tok/s

Large prompt, cold-ish prefix:

  • prompt: 41,985 tokens
  • output: 384 tokens
  • TTFT: 38.6s
  • decode: 41.8 tok/s
  • effective prefill: ~1,087 prompt tok/s

Same large prompt immediately repeated:

  • prompt: 41,985 tokens
  • output: 384 tokens
  • TTFT: 1.59s
  • decode: 42.1 tok/s

That last line is the important one for agent work. Prefix caching does not make the model "faster" in the abstract; it makes repeated long-context continuations stop paying the full prefill cost when the prefix remains stable.

My Practical Takeaway

For this workload, the model was not the blocker. Qwen3.6-27B is strong enough to be useful locally as a coding agent.

The hard part is serving discipline: context size, request sequencing, prefix reuse, timeout policy and avoiding self-inflicted concurrency.

If you only test "does the model answer one prompt?", you are testing the wrong thing.

Test the loop.

Top comments (0)