If you want to reproduce my current local Hermes Agent + Qwen3.6-27B setup, this is the shape I would start from.
Target
One local coding agent.
One 24GB GPU.
Long context.
Tools enabled.
Thinking enabled.
No child agents fighting the main request.
The goal is not peak tok/s on a short prompt. The goal is: can the same agent session keep working after hours of tool calls without losing prefix locality, timing out during prefill, or getting wrecked by auxiliary requests?
Model
This setup is intentionally text-only.
I am not serving the multimodal GGUF variant here. The working configuration uses groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit through vLLM with --language-model-only.
That choice matters. On a 24GB RTX 3090, the text-only GPTQ-Marlin path gave the best balance I found between long context, prefix caching, stable agent behavior and usable decode speed. Vision should be handled by a separate service/model if needed.
vLLM
The useful shape:
CUDA_VISIBLE_DEVICES=0 vllm serve groxaxo/Qwen3.6-27B-GPTQ-Pro-4Bit \
--served-model-name qwen3.6-27b-gptq-pro-4bit \
--dtype float16 \
--quantization gptq_marlin \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--max-num-seqs 1 \
--kv-cache-dtype fp8_e5m2 \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.95 \
--max-cudagraph-capture-size 32 \
--language-model-only
I used a recent vLLM nightly, not an old stable image (0.20.1rc1.dev16+g7a1eb8ac2).
The two flags people will want to argue about:
--max-num-seqs 1
--max-model-len 131072
I use max_num_seqs=1 deliberately. With an agent, parallelism is not free. Title generation, context compression, retries, browser checks, tool calls and side jobs can all steal KV/cache locality from the main request. On one 24GB GPU I prefer one useful request over two requests sabotaging each other.
131k context is tight, but workable here. If your service OOMs, reduce context before adding MTP or enforce-eager. I would test 110k, then 100k, then 80k.
What I Did Not Keep
No Qwen3 Next MTP/speculative decoding in the stable config. It caused crashes/OOMs/404s for my useful context sizes.
No enforce-eager. (saves memory but degrades performances)
No explicit Hermes max_tokens: 16384 cap. I removed it because it made debugging truncation and long reasoning/final-answer behavior harder.
Hermes
Point Hermes at the OpenAI-compatible vLLM endpoint and use the same served model name: qwen3.6-27b-gptq-pro-4bit.
The settings that mattered for me:
- context around 131072
- thinking enabled
- preserve_thinking enabled
- long provider/client timeout
- child agents disabled
- no hard max_tokens cap
- tool calls allowed, but not parallelized into chaos
The timeout matters. At large context, a real prefill can look like a dead provider if the client gives up after 180s. Use long timeouts before blaming the model.
Here is the Hermes-specific config excerpt, redacted and trimmed to the parts that matter for the Qwen3.6/vLLM setup:
models:
default: qwen3.6-27b-gptq-pro-4bit
provider: vllm-qwen36.mylabdomain.com
context_length: 131072
extra_body:
chat_template_kwargs:
enable_thinking: true
preserve_thinking: true
providers:
vllm-qwen36.mylabdomain.com:
name: vLLM Qwen3.6 27B
api: https://vllm-qwen36.mylabdomain.com/v1
api_key: <redacted>
default_model: qwen3.6-27b-gptq-pro-4bit
request_timeout_seconds: 1800
stale_timeout_seconds: 1800
agent:
max_turns: 240
gateway_timeout: 1800
gateway_timeout_warning: 900
gateway_notify_interval: 600
gateway_auto_continue_freshness: 3600
api_max_retries: 3
reasoning_effort: none # does not disable Qwen thinking
verbose: true
image_input_mode: text
disabled_toolsets:
- delegation
compression:
enabled: true
threshold: 0.85
target_ratio: 0.2
protect_last_n: 20
hygiene_hard_message_limit: 400
context:
engine: compressor
delegation:
max_concurrent_children: 0
child_timeout_seconds: 1800
max_spawn_depth: 1
orchestrator_enabled: true
inherit_mcp_toolsets: true
default_toolsets:
- terminal
- file
- web
auxiliary:
compression:
provider: custom
model: qwen3.6-27b-gptq-pro-4bit
base_url: https://vllm-qwen36.mylabdomain.com/v1
api_key: <redacted>
timeout: 1800
extra_body:
chat_template_kwargs:
enable_thinking: true
title_generation:
provider: custom
model: qwen3.6-27b-gptq-pro-4bit
base_url: https://vllm-qwen36.mylabdomain.com/v1
api_key: <redacted>
timeout: 1800
extra_body:
chat_template_kwargs:
enable_thinking: true
display:
streaming: true
show_reasoning: true
interim_assistant_messages: true
tool_progress: all
cron:
max_parallel_jobs: 1
Cache Discipline
This part is mostly orchestration, not a vLLM flag.
What I can control: stable cron prompts, stable skills, no child-agent swarm, no parallel debug/dev jobs, long timeouts, and one main request at a time.
What I cannot fully control: the exact way Hermes serializes every internal prompt. If volatile session state or auxiliary material lands before stable instructions, prefix reuse will suffer.
So the practical rule is simpler: make the inputs you control boring and repeatable, and avoid side requests competing with the main session.
vLLM prefix caching helps, but it is not a magic persistent cache database. Treat it as an in-memory serving optimization and shape your traffic accordingly.
Expected Behavior
Healthy run:
- low TTFT when prefix reuse hits
- 60-90s TTFT can still happen on large context transitions
- decode around high-30s tok/s on my setup
- stable tool use
- no recurring full-prefill on every continuation
- no child-agent swarm
Bad run:
- repeated full-prefill on every request
- auxiliary requests firing while the main agent waits
- model reaches output cap before final answer
- MTP instability/OOM
- tool loops caused by sampling/config/model mismatch
Quick Performance Snapshot
This is not a formal benchmark, just a sanity check against the real OpenAI-compatible endpoint.
Small prompt:
- prompt: 41 tokens
- output: 384 tokens
- TTFT: 0.29s
- decode: 45.3 tok/s
Large prompt, cold-ish prefix:
- prompt: 41,985 tokens
- output: 384 tokens
- TTFT: 38.6s
- decode: 41.8 tok/s
- effective prefill: ~1,087 prompt tok/s
Same large prompt immediately repeated:
- prompt: 41,985 tokens
- output: 384 tokens
- TTFT: 1.59s
- decode: 42.1 tok/s
That last line is the important one for agent work. Prefix caching does not make the model "faster" in the abstract; it makes repeated long-context continuations stop paying the full prefill cost when the prefix remains stable.
My Practical Takeaway
For this workload, the model was not the blocker. Qwen3.6-27B is strong enough to be useful locally as a coding agent.
The hard part is serving discipline: context size, request sequencing, prefix reuse, timeout policy and avoiding self-inflicted concurrency.
If you only test "does the model answer one prompt?", you are testing the wrong thing.
Test the loop.
Top comments (0)