Xavier Rey-Robert

Posted on Jun 19

I Stopped Chasing MTP TPS and Got a Local 27B Agent That Actually Stayed Usable on 24GB VRAM

#ai #llm #vllm #agents

I was already happy with my groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit + vLLM + Hermes recipe: one local agent, one 24GB GPU, long context, tools, thinking enabled, and enough serving discipline that the session could keep working after hours of edits, terminal calls, retries, compression, and context growth.

So when Jackrong released Qwopus3.6-27B-v2, I wanted to see if the same recipe would hold.

I rented an A100, burned a few dollars on quantization, and published the result: XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1

The artifact

It is a GPTQ-Pro 4-bit quantized derivative of Jackrong/Qwopus3.6-27B-v2. The GPTQ-Pro recipe and much of the practical quantization know-how come from groxaxo, so credit where it is due: https://github.com/groxaxo

The goal was usability: make this model practical in the local coding-agent setup I actually run.

That means one local agent, one 24GB GPU, long context, tools, thinking enabled, and enough serving discipline that the session can keep working after hours of edits, terminal calls, retries, compression, and context growth.

The target was not a short-prompt benchmark.

it was not:

"Does it answer one prompt?"

The real question is:

"Does the loop hold?"

for reference the quantization shape is:

- GPTQ-Pro / GPTQModel
- 4-bit
- group size 128
- 256 calibration samples
- 2048 calibration sequence length
- FOEM alpha 0.25 / beta 0.2
- vLLM + GPTQ-Marlin
- target: RTX 3090 / 24GB VRAM

The serving shape

I serve it text-only.

No multimodal path in this setup. No speculative decoding. No parallel request pile-up.

CUDA_VISIBLE_DEVICES=0 vllm serve XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 \
  --served-model-name qwopus3.6-27b-gptq-pro-v1 \
  --language-model-only \
  --dtype float16 \
  --quantization gptq_marlin \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8_e5m2 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code

max_num_seqs=1 is deliberate.

On one 24GB card, parallelism is not free. Title generation, compression, retries, summaries, and side requests can all compete with the main agent request.

I would rather have one useful request finish cleanly than two requests sabotaging each other.

Why no speculative decoding?

Because on this setup it did not improve the thing I care about: end-to-end long-context agent throughput.

This artifact should be treated as non-MTP for vLLM speculative decoding. It keeps some MTP-related config metadata, but the published weight index does not contain actual mtp.* tensors.

I also tested a follow-up artifact with real MTP tensors restored and the large MTP linears quantized. Draft acceptance was real, but on a single RTX 3090 it was still slower than the non-MTP baseline for the useful 100k-131k context range.

For this workload, MTP adds:

memory pressure
serving complexity
no end-to-end speedup on 1x3090

On larger GPUs or short-prompt workloads, speculative decoding may be worth revisiting.

For a 24GB long-context coding agent, I leave it off until proven otherwise.

Token/s is not the whole story

The useful question is whether the agent keeps the prefix hot and avoids paying full prefill again and again.

Healthy behavior looks like this:

large context still works
prefix-cache hits are common
TTFT drops when the prefix is reused
tool calls stay stable
the main request is not fighting side jobs

Observed on my 3090-class setup:

average prompt: ~33k tokens
average TTFT: ~5.7s
prefill throughput: ~1917 tok/s
decode estimate: ~43 tok/s
prefix cache hit ratio: ~83%

That is the metric cluster I care about.

Not "hello world" speed.

Repeated long-context continuations.

What the vLLM run showed

The more interesting evidence is not a single decode number.

It is the shape of the vLLM metrics over a long agent-style run.

In my 12-hour terminal-bench-style run, the endpoint was not just answering tiny prompts. It was handling repeated tasks, retained context, tool calls, longer generations, fresh starts, retries, and compression.

That is much closer to how a coding agent actually behaves.

The useful signals were:

queue time stayed low
prefix-cache reuse recovered after task changes
finish reasons were mostly normal stops
length caps and errors were not dominating
tool-call behavior stayed stable under long-context pressure

The decode number is only half the story. For long-context agents, prefill throughput matters just as much, because every cold or partially cold prompt has to pay that cost before useful generation starts. A decent decode rate with terrible prefill still feels bad. A setup with good prefix reuse and healthy prefill throughput is what makes repeated long-context continuations tolerable.

The panels I would show are the ones that explain the loop, not just speed.

Prompt and generation throughput during the long run. The ~43 tok/s generation rate is only half the story: prompt throughput is what determines how painful long-context prefill is when the cache is cold or only partially reusable. A usable local agent needs both decent decode speed and tolerable prefill behavior.

Finish reasons during the long run. The useful signal is that most requests end with normal stop reasons, not length caps or errors. For an agent loop, this matters as much as token throughput.

The prefix-cache graph is the one I care about most.

Prefix-cache hit rate over the long run. Drops are expected when task shape changes; the useful signal is that cache reuse returns when locality stabilizes.

A new task naturally breaks locality. That is fine. The important part is that when the prompt shape stabilizes again, prefix reuse comes back.

That is the difference between a local agent that keeps working and one that keeps paying full prefill until the session becomes painful.

TTFT is part of the serving contract

At 100k+ context, TTFT is not always tiny.

That does not automatically mean the model is slow or broken. Sometimes the server is doing real prefill work. If the prefix is cached, TTFT drops. If the task shape changes, the cache is colder and the server has to pay the cost again.

TTFT during long-context agent traffic. Spikes are not automatically failures; they often mean the server is paying real prefill work after a colder prompt transition.

This is why short client timeouts are toxic for local long-context agents.

For this setup, long provider and gateway timeouts are not cosmetic. They are part of making the agent loop reliable.

Why FP8 KV matters

At this context length, the weights are only part of the memory story.

The KV cache becomes the constraint.

That is why I use:

--kv-cache-dtype fp8_e5m2

I do not treat FP8 KV as magic. It is a practical tradeoff that helps make the long-context setup fit on a 24GB card.

Practical takeaway

The working setup is the combination:

Qwopus3.6-27B GPTQ-Pro
vLLM GPTQ-Marlin
text-only serving
131k context
FP8 KV cache
prefix caching
max_num_seqs=1
thinking enabled
long timeouts
no speculative decoding
no child-agent swarm

If you only test one prompt, you are testing the wrong thing.

For coding agents, the real test is the loop.

Anyone else running similar single-GPU 24GB agent loops? Curious what tricks worked for you on prefix caching or KV cache.

@vllm_project @NousResearch

DEV Community