DEV Community: byeongsoo kang

The Open-Model Cost Chart Everyone's Sharing Is API Prices. Here's What Self-Hosting Actually Gets You (Measured)

byeongsoo kang — Tue, 23 Jun 2026 04:17:41 +0000

There's a chart going around: intelligence on the y-axis, cost to run on the x-axis, and a green "most attractive" quadrant in the upper left where high intelligence meets low cost. The takeaway everyone's posting is that the green quadrant is almost entirely open source. DeepSeek, GLM, MiniMax, Kimi, Qwen all show up smart-enough and cheap, while the closed frontier models sit expensive on the right.

It's a real trend and the chart isn't wrong. But read the x-axis label: cost to run is a blended API price. That number answers "what does it cost to call this model through somebody's API," which is a different question from "what does it cost to run this yourself." For those of us who self-host, the second question is the whole point, and the chart quietly hides the answer.

So here's what it skips — measured on the two cards I own.

The catch: you can't self-host the green quadrant

The open models winning that value quadrant aren't small. Take GLM-5.2 — the one everyone points to when they say the open frontier finally caught up. It's coding-first, currently the strongest open weight on the coding benchmarks: a ~744B-parameter MoE with about 40B active per token. And unlike the closed three, the weights are actually MIT-licensed. That's its whole pitch: you can run it yourself, no per-token fee, weights on your own box. The cheap API price (around $1.40 in and $4.40 out per million tokens, roughly a sixth of GPT-5.5) is the headline. But the thing that sets it apart is the other half: you can run it yourself.

Then you try to. 744B at Q4 is roughly 372GB of weights. The other value-quadrant models are the same class, DeepSeek and Kimi running from the high hundreds of billions up toward a trillion. None of that fits a desktop GPU, or two, or four. "Self-hostable" here means a server with several 80GB datacenter cards, the exact infra headache the "open is cheap" story was supposed to spare you. So the option is real, it's just not real on the hardware most people own.

So when you self-host, you don't get the green quadrant. You get whatever fits on the card in front of you, which is a tier below. The useful question is: how far below, and is it good enough? That part I can answer with numbers instead of a chart.

What actually runs on a consumer card

Two tiers, both single consumer GPUs, models running fully on the GPU through Ollama. These are my own measured runs from earlier write-ups, pulled into one place:

GPU (used price)	best model that fits well	gen tok/s	prefill tok/s	context headroom
11GB — GTX 1080 Ti (~$200)	Gemma 4 12B QAT	~32	~315	12B at 16k with q8 KV
	Qwen3 8B	~46	~1390	comfortable
24GB — RTX 3090 (~$800)	Qwen3.6 27B Q4 + MTP	~75	—¹	dense 27B fits in VRAM

¹ Prefill doesn't reduce to one number on this card; it scales hard with context. At 64k the first token took about 59s. See "Long context is the real tax" below.

The 11GB card tops out comfortably at a 12B. A dense 27B doesn't fit one of them at all. The 24GB card moves you up to a dense 27B at a fast ~75 tok/s once speculative decoding is on, and that's the sweet spot: a 27B is a real step up in capability from a 12B, and it still lives entirely in VRAM.

On the intelligence chart, those are the mid-tier models, well below the green-quadrant frontier-open ones. So that's the real answer to "what does self-hosting get you": solid, useful, a tier under the cheap-API winners.

What the API number hides

Three costs that never show up as a dollar figure on that chart, and all three bit me at some point.

The VRAM ceiling is a wall, not a slope. A model either fits or it doesn't. The 27B that flies on a 3090 simply won't load on an 11GB card — no "a bit slower" middle ground at the boundary, it just fails, and your only move is a smaller model or a bigger card.

Spilling a MoE to system RAM looks like the obvious escape hatch when a model is too big. It isn't. I tried it with a 35B-A3B across two 1080 Tis and got about 17 tok/s — once the experts get mmapped to system RAM the whole thing goes memory-bandwidth-bound, and a CPU nearly tied it. A 12B living entirely in VRAM often feels snappier than a 35B that spills, which isn't what the parameter count would tell you.

The 3090's catch shows up at long context. It generates fast, but prompt processing scales hard: at 64k tokens the first token took about 59 seconds before generation even started. That latency never appears in a tokens-per-dollar number, and for anything retrieval-heavy it's the thing you feel.

So is it worth self-hosting?

If you're chasing the cheapest intelligence-per-token, the chart is right and the answer is often no. A cheap API to something like GLM-5.2 will beat your 3090 on raw capability per dollar, because you're not paying to keep a card idle between prompts, and you're getting a 744B model instead of a 27B.

Self-hosting is a bad way to win the cost game. What it buys you is the stuff that axis never measures: your data stays on the box, it runs offline, you can fine-tune and pin versions, and nobody deprecates a model out from under you. That last one is less abstract than it sounds. A weight already sitting on your disk under MIT is the one version nobody can reprice, retire, or region-lock on you later, which is part of why the open releases are starting to get talked about as insurance and not just a cheaper API. I run a local research assistant over my own papers for exactly that reason, and "a tier below the frontier" is completely fine for it. That's what you're paying for — privacy, control, a version nobody can pull out from under you. The per-token math is a side issue.

So that's the bit the chart leaves out. On API the open models do win on price — no argument there. But once the weights are on your own card you drop a tier, you hit the VRAM wall, and long prompts crawl. Nobody self-hosting at home is doing it to shave a few dollars a month. They do it because the weights are theirs, sitting on a disk nobody can reprice or retire.

Caveats

These are two cards I actually own, an 11GB Pascal and a 24GB Ampere, single-GPU, Ollama, the specific quants from my earlier posts. I don't have a 4090, a 5090, or a multi-card rig, so I can't speak to those tiers and I'm not going to guess at them. The model sizes for the big MoEs are approximate; if you're quoting them, check the current model cards. Numbers are from my own runs and are stable, not claimed to the decimal.

I Added a Verify Layer to My Local RAG to Catch Hallucinations. It Caught Me Being Wrong Twice About My Own Corpus

byeongsoo kang — Fri, 19 Jun 2026 04:18:14 +0000

I'm building a small, fully-local research assistant: a RAG over my own papers, running on Ollama, nothing leaving the box. The risk that actually worries me isn't speed or cost. A research tool that cites a wrong number while sounding sure of itself is worse than no tool, because you'll believe it.

Andrej Karpathy's llm-wiki note had a piece I kept thinking about. Instead of re-retrieving from scratch each query, you have the model build a persistent wiki, and during ingest a lint pass checks the pages against each other for contradictions. I wanted something adjacent at answer time: after the RAG drafts an answer, break it into claims and check each against the sources, then flag whatever a source doesn't actually support.

I should be precise about what that is, since the post is partly about citation accuracy. Karpathy's lint compares wiki pages to each other during ingest. What I built compares each answer-claim to its retrieved passage at answer time. That's groundedness (or faithfulness) checking, the same family as the RAGAS faithfulness metric and various self-check methods. The idea to bolt it on came from llm-wiki; the mechanism is standard groundedness checking, run locally on a small model.

I built it, measured it, and the honest version of the result is better than a clean win would have been. It includes the part where I was wrong twice about what was in my own corpus.

The verify layer

About 80 lines on top of my existing RAG. After the normal retrieve-and-answer step:

Decompose the draft into atomic claims (one local LLM call).
For each claim, an LLM-as-judge call returns {supported, cite, why}. Supported only when a specific excerpt states it.
Flag the unsupported ones.

Here's one verdict verbatim, so it's concrete. The claim was a deliberately corrupted "AUROC is 0.92," checked against the passage that reports 0.804:

{ "supported": false,
  "cite": null,
  "why": "passage states AUROC 0.804; 0.92 does not appear" }

Cost, since this is a local-8B context and it matters: verify turns one answer call into roughly N+2 calls (one to decompose, one per claim to check). For a five-claim answer on my 1080 Ti that's about 15-20 extra seconds. Not free, not painful.

First eval, and almost shipping a false finding

To see if it catches fabrication I wrote some questions I assumed weren't answerable from my three-paper corpus. One asked for the AUROC of the synergy model "on the held-out test set." The baseline answered "0.804 [1]," and my verify layer passed it. I wrote it up as a miss: verify let a fabricated statistic through.

Then I grepped my own corpus for 0.804. It was there, seven times. So I rewrote it the other way: the number was real, the model was right, verify passed it correctly. A tidier story, and I almost shipped that one too.

It's also wrong. Look at what the passages actually say. Every 0.804 is reported as a GroupKFold cross-validation result, and one line states it outright: "No separate held-out test set was used due to the limited sample size." My question asked for a held-out test AUROC. There is no held-out test set. The model took a real cross-validation number and pinned it to an evaluation that doesn't exist, and verify passed it because the digits 0.804 were sitting right there in the context.

So I was wrong twice about my own corpus, in opposite directions, before landing on what happened: a right-number-wrong-context hallucination that claim-checking sailed straight past. The first lesson is awkward but worth stating plainly: you can't measure hallucination without ground truth, and "the number is real" is not the same as "the answer is right."

Doing it properly

I threw out the vibe-based questions and built a controlled benchmark: eight pairs of claims. Each pair has one true claim (a fact I confirmed by grep) and one false claim, the same statement with a number or entity corrupted to something I confirmed was absent from the corpus. AUROC 0.804 against AUROC 0.92. "pathway membership and gene essentiality scores" against "patient age, BMI, and smoking status." "implicates the hippocampus, amygdala, prefrontal cortex" against "leaves the hippocampus and amygdala unaffected."

That labeling step earned its keep immediately. Three of my first-draft "false" claims used terms (cerebellum, CRISPR, Loewe) that grep found were actually in the corpus, so they weren't false at all. The same mistake as the AUROC, caught before it counted.

Then I feed the verifier the context that supports the true claim and ask it to judge both claims against that same context.

The benchmark result

Both the lenient prompt and a stricter "numbers must match verbatim" prompt scored the same: 8 of 8 fabrications caught, 0 of 8 true claims wrongly flagged.

Two caveats keep that honest. With n=8 it's a point estimate, not a guarantee; the Wilson 95% interval on 8/8 runs from about 67% to 100%, so read it as "no failures in eight trials." And note the setup is the easiest possible one: the supporting passage is guaranteed present and the corrupted value is guaranteed absent. This measures the judge when retrieval is already perfect, not the pipeline. The strict prompt changing nothing isn't evidence it's useless either, because no pair in the set would separate the two prompts. Every corruption is far from its real value. 0.804 vs 0.81 would be the test, and I don't have one.

So: given perfect context and a blatant corruption, a local 8B model judges reliably. Good to know, and not the hard case.

The hard case the benchmark can't reach

The benchmark used false claims I wrote by hand. That sidesteps the scenario I actually care about for a co-scientist: a hallucination the model itself produces, judged by a verifier that is the same model and may share its blind spots.

So I ran a second eval. Six false-premise or absent-specific questions, ground-truth checked. The model's own draft is the unit under test, and I judge each draft twice: once with the same model (qwen3:8b) and once with a different one (gemma4 12B QAT). One caveat I owe you: that judge is larger than the 8B answerer, so any cross-model gain here is "different and bigger," not cleanly "different." A same-size cross-model check would isolate that, and I didn't run one.

On four of six the model abstained correctly. Asked the dose of meropenem (never in the corpus), the sample size of a meta-analysis (the stress paper is a scoping review with none), it said the context doesn't contain it. Grounding works when the gap is obvious.

On the other two it fabricated, and both were right-number-wrong-context:

"AUROC on the held-out test set" → "0.804" again. The held-out AUROC the question presumes does not exist.
"AUROC after removing the pathway-membership features" → "0.627." That 0.627 is real, but it's the number for removing ribosome-targeting combinations, a different ablation. The model grabbed a nearby real value and reattributed it.

The same-model judge caught zero of those two. It rubber-stamped both of its own fabrications. The cross-model judge caught one: gemma flagged the 0.627, with the reason "that AUROC resulted from removing ribosome-targeting combinations." A different model checked what the number was attached to; the original model, having produced the misattribution, did not re-examine it. Neither judge caught the held-out one. The digits 0.804 are real, so there's nothing in the number to contradict, and catching it means knowing that no held-out set exists. When I checked, the sentence that says so, "no separate held-out test set was used," wasn't even in the retrieved context. The model never saw the thing that made the premise false.

That's the measured core of this whole exercise, and the failures split three ways, not two. A value simply absent from the context (0.92) gets caught reliably, even when the same model judges. A value that's real but pinned to the wrong thing (0.627, from a different ablation) slips past the same-model judge and is only sometimes recovered by a second one. A false premise, like a held-out AUROC when no held-out set exists, gets through both, and that one bottomed out in retrieval since the sentence refuting it never reached the context. The layer I built reliably catches only the first kind. A model refereeing its own output inherits its own blind spots; a second model recovers some of the misattributions, not the premise-level error.

What the flags were actually pointing at

One more thread from the messy first eval. Back then the verifier had flagged a couple of true claims ("prefrontal cortex," "gene essentiality scores") as unsupported. After the clean benchmark, where it never did that, those looked inconsistent.

The resolution: verify checks claims against the retrieved context, not the whole corpus. Those claims were true and in the corpus, but the passage proving them wasn't among the chunks retrieved for that question. The verifier said "not in what you gave me," correctly.

A flag, then, can mean three things, and at answer time you can't tell them apart without ground truth: a retrieval miss, a real fabrication, or a true fact that lives outside the corpus. In the cases from my first eval that I could actually check, the flags were retrieval misses, which is why "re-retrieve and re-check" is a better default reaction than "delete the claim." I didn't measure the proportion, so I won't put a number on how often each happens.

Takeaways

What I actually measured, on one corpus with qwen3:8b:

Given good context and a blatantly corrupted value, the verifier is reliable (8/8, with the n=8 caveat).
The model abstains on clearly-absent information (4/6 here).
Both fabrications were real-number-wrong-context: one a misattributed value (0.627, the wrong ablation), one a false premise (a held-out AUROC where no held-out set exists). The same-model judge caught neither (0/2); it rubber-stamped its own output.
The cross-model judge caught the misattributed value but not the false premise (1/2). The premise error was the harder kind, and the sentence that would refute it wasn't even retrieved.

What I suspect but did not measure, so take it as an impression: in a grounded RAG, most of what looks like the model inventing facts is really retrieval not surfacing the passage, or the prompt not grounding hard enough. If that holds, the leverage is in retrieval and grounding, not a bigger model. I'd want a labeled run before saying it harder.

And the practical version: claim-checking reliably catches only values that are absent from the context. It misses a real number attached to the wrong question, and misses a false premise outright. Use a different model to judge than to answer, not on the strength of one recovered case out of two, but because a model that produced a misattribution won't re-examine it while an unrelated one cross-checks what the number is attached to. Expect it to recover some misattributed values, not false premises. Treat a flag as "go re-retrieve," not "delete." And you cannot evaluate any of this without ground-truth labels. I almost shipped two false findings about my own corpus, and grep fixed both.

Limitations are most of the honesty here: one small corpus, an 8B answerer, eight hand-built pairs plus six probes, no held-out scoring, no subtle-but-true corruptions like "significant" vs "trending." This is a first measurement, not a verdict.

Credit: the inspiration is Karpathy's llm-wiki pattern and the full desktop implementation at nashsu/llm_wiki. What I built is plain groundedness checking moved to answer time, run on a local box. The code and eval scripts are in my paper-rag repo if you want to poke holes in the setup. Please do.

What Actually Runs Well on a GTX 1080 Ti in 2026 (Measured)

byeongsoo kang — Fri, 12 Jun 2026 07:27:00 +0000

The "GPU poor" narrative has flipped this year: 24GB-and-below cards are suddenly fine, thanks to quantization-aware training (near-bf16 quality at Q4 size) and MTP (free decode speed). But most of those posts are running 3090s and 4080s. I wanted the floor: what actually runs well on a GTX 1080 Ti — an 8-year-old card with 11GB — in 2026? So I measured it.

The numbers

Single GTX 1080 Ti (11GB), Ollama with flash-attention, num_ctx 8192, all running 100% on the GPU (no CPU offload):

model	size	gen tok/s	prefill tok/s	VRAM
Qwen3 8B	5.2 GB	~46	~1390	6.6 GB
Gemma 4 12B QAT (unsloth UD-Q4_K_XL)	6.7 GB	~32	~316	8.0 GB
Gemma 4 12B QAT (Google)	7.0 GB	~31	~314	8.2 GB
Gemma 4 12B (regular Q4)	7.1 GB	~29	—	8.4 GB

The headline: a 12B model runs at ~30 tok/s on an 8-year-old card, comfortably inside 11GB, fully on the GPU. That's faster than most people read, and well into "usable for daily work" territory. An 8B sits around 46 tok/s with much faster prefill.

A few things worth noting

QAT buys a small but real speed edge. The Gemma 4 12B QAT builds (~31–32 tok/s) come in a bit faster than the regular Q4 (~29) and slightly smaller — about a 9% gen-speed gain, consistent with what I measured earlier on the same card. Not magic, but free.

Prefill scales with size, hard. Qwen3 8B processes the prompt at ~1390 tok/s; the Gemma 12Bs at ~315 — roughly 4× slower. On a Pascal card the prompt-processing stage is where you feel the model size, so for long prompts the smaller model's lead widens beyond the gen-speed gap. (This is the same prefill wall story, scaled to old hardware.)

12B is the comfortable ceiling for one card. A 12B Q4 lands around 8GB and leaves room for a real context. The QAT 12B even fits 16k context on an 8GB card with KV-cache quantization, so an 11GB 1080 Ti has comfortable headroom.

Where the ceiling actually is

A dense 27B (Q4 ≈ 17GB) does not fit one 11GB card — you either split across two cards or it spills to system RAM and crawls. And spilling is worse than it sounds on this class of hardware: I benchmarked the 35B-A3B MoE on 2× 1080 Ti and it ran at only ~17 tok/s, because the experts get mmapped to system RAM and the whole thing goes memory-bandwidth-bound — a CPU nearly tied it. So "more VRAM via a second old card" helps you fit bigger models, but the bandwidth ceiling means a dense 12B that lives entirely on one card often feels snappier than a 35B that spills.

The takeaway

If you've got a 1080 Ti gathering dust: in 2026 it runs a 12B at ~30 tok/s and an 8B at ~46, fully on the GPU, no cloud, no rate limits. QAT made the quality competitive and the size friendly; the card was always fast enough for this once the models got small enough. The "GPU poor are eating well" story reaches all the way back to 2017 silicon — you just stay at or below 12B and let it sit entirely in VRAM.

Caveats

Single GTX 1080 Ti, single request, Ollama + flash-attention, num_ctx 8192, the specific quants above. Gen tok/s from the server's own token timings; numbers are stable but not claimed to ±0.1.
The 35B-A3B 2× 1080 Ti figure is from my earlier write-up, not this run.
27B+ "doesn't fit" assumes a single 11GB card and Q4-class quant; a second card or heavier KV-quant changes the math (at a speed cost).

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent

byeongsoo kang — Thu, 11 Jun 2026 07:26:13 +0000

In my MTP post, speculative decoding roughly doubled Qwen3.6-27B generation on a 3090. It's tempting to read that as "turn on MTP, go faster." So I measured it on a different model — Gemma 4 12B QAT — and it's a big win on my 3090. But the same model with the same MTP draft runs slower on an M1 Max. MTP isn't a free switch; it's a hardware-dependent lever.

My 3090 numbers

Gemma 4 12B QAT (UD-Q4_K_XL) + an MTP draft head (Q8_0-MTP, a 0.47 GB nextn head, not a full second model), single RTX 3090, decode tok/s, 3 runs each:

config	mean tok/s	speedup	draft acceptance
baseline (no MTP)	85.9	1.00×	—
MTP `n-max 2`	159.4	1.86×	0.77
MTP `n-max 3`	167.4	1.95×	0.69

A clean ~1.9× on the 3090, and unusually stable — run-to-run CV was under 0.5% (my earlier Qwen3.6-27B MTP runs were far noisier at 5–7%, needing a dozen runs; Gemma here settled in three). Same n-max 3 sweet spot as Qwen, same counterintuitive shape: deeper speculation has lower per-token acceptance (0.69 vs 0.77) but higher throughput, because more tokens land per verify step. Per-category, the win ranged 1.8×–2.2× (RAG and coding best at ~2.2×). The whole thing fit in ~8 GB of VRAM.

The same model, slower on an M1 Max

Here's the part worth the post. A cross-hardware benchmark (another tester's runs, same speed_bench.py harness and --jinja settings, so it's comparable) put Gemma 4 12B QAT+MTP at:

hardware	MTP speedup (n-max 2)
RTX 3090 (mine)	1.86×
RTX 5070 Ti laptop	1.74×
M1 Max (16", 64 GB)	0.87× — slower

Same model, same MTP draft, and the M1 Max actually loses ~13% by turning MTP on.

Why MTP can make you slower

Speculative decoding wins when verifying a batch of drafted tokens is cheap relative to generating them one at a time. The draft head proposes several tokens, the main model checks them in one parallel pass, and accepted drafts give you multiple tokens for about the cost of one verify. That only pays off when you have spare compute and the verify pass is cheap:

Capable CUDA GPU (3090, 5070 Ti): lots of compute headroom, the parallel verify is cheap, the drafts land → 1.7–1.9×.
Apple Silicon (M1 Max), unified memory: running the draft adds compute the architecture doesn't have to spare relative to its memory bandwidth, and that overhead outweighs the parallel-verify gain. Net result: slower than just decoding normally.

So MTP's speedup is a function of the draft-cost-to-verify-cost ratio on your specific hardware, not a property of the technique. Capable CUDA GPU: yes. Apple Silicon: measure first — it can backfire.

Honest caveats

Only the 3090 numbers are mine; the M1 Max / 5070 Ti figures are another tester's runs. Same harness and settings, so the comparison is fair, but it isn't a single controlled rig — read the cross-machine table as directional.
Gemma 4 12B it runs as a thinking model under --jinja (output routes to a reasoning channel); this doesn't affect the decode-tok/s measurement, which comes from the server's own token timings, and all machines used the same --jinja.
MTP throughput can vary run to run; the 3090 numbers were stable (CV < 0.5%), but the M1 Max's 0.87× is close enough to 1.0 that it's worth re-running before treating it as exact — the direction (a net loss) is the point.

Reproduce it (3090)

llama.cpp mainline, commit e3471b3 (already accepts the Gemma MTP draft — no special build), CUDA sm86.
Models from unsloth/gemma-4-12B-it-qat-GGUF: main gemma-4-12B-it-qat-UD-Q4_K_XL.gguf, draft gemma-4-12B-it-Q8_0-MTP.gguf.
Server: llama-server -m <main> --model-draft <draft> --spec-type draft-mtp --spec-draft-n-max 3 -np 1 -ub 512 -c 16384 -ngl 99 -fa on --jinja
Client: speed_bench.py --bench qualitative --category all --osl 1024 --concurrency 1, compared with speed_bench_compare.py.

Takeaway

MTP is one of the best generation-speed levers on a capable CUDA GPU — ~1.9× here on a 3090, for free quality-wise (the verify keeps the output exact). But "speculative decoding makes you faster" is a hardware claim, not a universal one. On Apple Silicon it can be pure overhead. Measure it on your own box before you commit to it.

Gemma 4 QAT on a 1080 Ti: What 'Quantization-Aware' Actually Buys — and Fitting the 12B on 8 GB at 16k

byeongsoo kang — Thu, 11 Jun 2026 02:39:05 +0000

Quantization-Aware Training (QAT) is the headline feature of the Gemma 4 release: models trained to survive 4-bit quantization, so the Q4 version stays close to full quality instead of degrading the way a naive post-training quant does. The pitch is great. I wanted to know what it actually buys on hardware most people would call obsolete — a GTX 1080 Ti — and whether it makes the 12B usable on an 8 GB card. So I measured three things: quality, speed, and footprint.

Short version: the quality claim is real (against naive Q4), the speed win is modest (~9% over a regular Q4), and the 12B fits an 8 GB GPU at 16k context if you quantize the KV cache. Details below.

1. Quality: the part QAT is actually about

QAT's whole point is quality retention at Q4. Unsloth publishes a clean way to see it — top-1 token agreement with the full model, their dynamic UD-Q4_K_XL vs a naive Q4_0:

model	UD-Q4_K_XL	naive Q4_0
Gemma 4 E2B	98.16%	89.29%
Gemma 4 12B	88.76%	74.08%
Gemma 4 31B	96.67%	87.91%

For the 12B that's a ~15-point gap — naive 4-bit drops a lot of the model's token choices, QAT + dynamic quant keeps most of them, at ~72% less memory than BF16 (6.72 GB vs 23.8 GB). That's the real optimization.

Honest caveat on this: that big gap is against naive Q4_0. Against a good modern quant like Q4_K_M, the difference is much smaller — which matched my own experience: on my coarse hands-on probes (token-glitch counts, a niche domain question) I couldn't reliably separate the QAT build from a solid Q4_K_M. So I trust the benchmark numbers for the quality story, not my eyeballs — and the practical read is "QAT is the best quality-per-byte option, but if you're already on a good Q4_K_M the day-to-day difference is subtle."

One useful, slightly counterintuitive tip from Unsloth: stick to UD-Q4_K_XL — going to higher precision (Q5/Q6/Q8) of these QAT models actually degrades accuracy, because the QAT was tuned for the 4-bit target.

2. Speed and size on a 1080 Ti

I ran Gemma 4 12B three ways on a single 8-year-old GTX 1080 Ti (num_ctx 8192, 100% GPU):

build	gen tok/s	VRAM
regular Q4	28.3	7.6 GB
Google QAT	31.0	7.5 GB
Unsloth QAT (UD-Q4_K_XL)	30.8	7.2 GB

So the QAT builds are ~9% faster and slightly smaller than a regular Q4 — a real but modest win, and all three run fully on one old card. Don't expect the quality numbers above to also make it dramatically faster; the speed/size gain is incremental. The headline is "a 12B runs comfortably at ~30 tok/s on a 1080 Ti," which is itself a nice statement about QAT-sized models on old hardware.

3. The useful part: fitting the 12B on 8 GB at 16k context

This is the question I actually get asked: can you run the 12B on an 8 GB card with a 16k context and keep it fast? The model weights are ~7 GB, so on an 8 GB card you have ~1 GB for the KV cache — and 16k of KV is the squeeze. I measured the footprint at 16k with each KV-cache type (single GPU, flash-attention on):

KV cache	VRAM @ 16k	fits 8 GB?
f16 (default)	7.7 GB	❌ no (driver reserve pushes it over)
q8_0	7.4 GB	✅ yes (tight, ~0.5 GB headroom)
q4_0	7.2 GB	✅ yes (more margin)

All three stayed 100% GPU at 16k. The default f16 KV (7.7 GB) won't reliably fit an 8 GB card once you count the driver/display reserve — which is why a naive attempt spills to CPU and crawls. Quantize the KV to q8_0 and you're at 7.4 GB with negligible quality cost; that's the sweet spot. Drop to q4_0 if you've got a display attached and want margin.

A neat detail: the KV cache at 16k is small here — q8 and q4 differ by only ~0.2 GB — because Gemma interleaves sliding-window (local) and global attention, so most layers cap their KV at the window size regardless of context length. The footprint is dominated by the ~7 GB weights, and KV quantization just buys the last ~0.3–0.7 GB you need to slip under 8 GB. (This is the flip side of the prefill wall: cheap KV doesn't make the prompt process faster, it just makes it fit.)

The recipe that works:

# ollama
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# then run gemma-4-12b-qat with num_ctx 16384, all layers on GPU

# llama.cpp equivalent
llama-server -m gemma-4-12b-qat-UD-Q4_K_XL.gguf -c 16384 -fa on -ngl 99 \
  --cache-type-k q8_0 --cache-type-v q8_0

Check ollama ps / nvidia-smi shows 100% GPU — if any layer offloads to CPU, throughput tanks, and that's your signal to drop the KV to q4_0 or trim the context.

Honest caveats

Speed numbers are at num_ctx 8192; the 8 GB/16k footprint numbers are at 16k — different tests, both on the same model/quant.
The 8 GB fit was measured on a 1080 Ti (11 GB) constrained to one GPU; I'm reporting the actual VRAM used, from which 8 GB fit is clear, but a real 8 GB card with a display attached has slightly less usable VRAM — so q8_0 is "fits headless," and q4_0 is the safer bet with a monitor plugged in.
The quality numbers are Unsloth's published top-1 agreement, not my own benchmark run; my hands-on probes were too coarse to add to them.

Wrap-up

Is Gemma 4 QAT a good model? Yes — it's the best quality-per-byte way to run Gemma 4 locally, the quality retention vs naive Q4 is real and measured, and on practical hardware it's genuinely useful: a 12B at ~30 tok/s on a 1080 Ti, and a 12B at 16k on an 8 GB card if you quantize the KV. Just don't expect the "near-BF16 quality" story to also mean a big speedup — the speed/size win over a good Q4_K_M is modest. The real story is accessibility: QAT is what lets a 12B feel comfortable on a card most people wrote off years ago.

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

byeongsoo kang — Wed, 10 Jun 2026 02:23:43 +0000

My MTP post showed multi-token prediction roughly doubling Qwen3.6-27B's generation on a 3090. A reader asked the question I'd skipped: what about prompt processing at long context? So I measured it — and that turns out to be the real wall, the one MTP can't climb.

TL;DR

On a single RTX 3090, prefill (prompt processing) for Qwen3.6-27B drops from ~1,575 tok/s at 1k context to ~852 at 128k — so a 64k-token prompt takes ~59 seconds before the first token appears, and 128k takes ~2.5 minutes. MTP speeds the decode phase, not prefill, so on a long-context / short-answer request (the typical RAG shape) its 2× generation win shrinks to ~3% of total latency. MTP is real; it just stops mattering exactly where long-context RAG lives.

Prefill vs context size

llama-bench, Qwen3.6-27B IQ4_XS, prefill only (-n 0), flash-attention on, single RTX 3090:

context	prefill tok/s	time to first token	peak VRAM	fits 24 GB?
1,024	1,575	0.65 s	16.0 GB	yes
16,384	1,432	11.4 s	16.5 GB	yes
65,536	1,111	59.0 s	19.6 GB	yes
131,072	852	153.8 s	23.6 GB	barely (98.5%)

Prefill throughput falls ~46% from 1k to 128k as the attention cost grows with sequence length, and time-to-first-token climbs roughly linearly with prompt size. These numbers are rock-stable (CV < 0.5%) — prefill is compute-bound, unlike the noisier MTP generation numbers from last time.

MTP speeds decode, not prefill

Speculative decoding (MTP) works during generation: a draft proposes several tokens ahead and the main model verifies them in one pass. Prefill is a different phase — a single forward pass over the whole prompt to build the KV cache, before any token is generated. MTP doesn't touch that pass, so it can't reduce time-to-first-token. What it reduces is the per-token cost of everything after the first token.

That's not the same as "MTP doesn't help long context." If you generate a lot of tokens, MTP still cuts the generation portion. The honest question is: how big is the generation portion relative to prefill?

When MTP actually matters: the latency math

Using the measured prefill above and generation from the last post (~75 tok/s with MTP vs ~45 without), total latency = time-to-first-token + generation time:

request shape	prefill (TTFT)	generation (MTP / off)	total (MTP / off)	MTP saves
1k context, 200-token answer	0.65 s	2.7 s / 4.4 s	3.3 s / 5.1 s	~35%
64k context, 200-token answer	59 s	2.7 s / 4.4 s	61.7 s / 63.4 s	~3%
64k context, 2,000-token answer	59 s	27 s / 44 s	86 s / 103 s	~17%

So MTP's value is entirely a function of the generation-to-prefill ratio. Short prompt, long answer → MTP shines (a third off total latency). Long prompt, short answer → prefill swallows everything and the 2× barely registers. Same speedup on the decoder; completely different impact on the wall clock.

What this means for RAG

The middle row — long prompt, short answer — is exactly the shape of retrieval-augmented generation: you stuff thousands of tokens of retrieved context in and ask for a short, grounded answer. That's the case where MTP helps least, and it's why a fat-context RAG can feel sluggish even on a setup that benchmarks fast on generation. The thing you actually wait for is the one-time prefill of the context, once per query.

This is directly relevant to my own local paper-RAG: the lever that improves its latency isn't a faster decoder — it's keeping the retrieved context tight (good chunking and reranking so you pass fewer, better tokens), which keeps prefill cheap. A reranker that lets you send 4k of relevant context instead of 40k of marginal context buys more real-world latency than MTP does.

The 24 GB wall

128k context fit — barely. At 23.6 GB it used 98.5% of the card, leaving ~380 MiB of headroom and nothing for anything else. The model's native context goes higher (~256k), but on a 24 GB 3090 this quant tops out around 128k before the KV cache spills or OOMs. So if you're planning long-context work on a single 3090: ~128k is the practical ceiling, and the prefill at that point is a 2.5-minute wait before the model says a word.

Honest caveats

Single RTX 3090, single request, Qwen3.6-27B IQ4_XS. Batching / concurrency is a different story and changes the prefill economics (chunked prefill, prefix caching, etc.).
The generation figures (~75 / ~45 tok/s) carry the run-to-run variance from the last post (MTP CV ~5–7%), so the latency-math rows are illustrative round numbers, not claimed to ±0.1 s. The pattern — MTP's share collapsing as context grows — is the robust part.
Prefill numbers themselves are tight (CV < 0.5%).
"Time to first token" here is pure prompt processing; real TTFT also includes a little sampling and setup overhead.

Reproduce it

RTX 3090 24 GB (sm86), llama.cpp commit e3471b3, Qwen3.6-27B IQ4_XS (bartowski).
Prefill sweep: llama-bench -m <iq4xs.gguf> -p 1024,16384,65536,131072 -n 0 -ngl 99 -fa 1 -r 2
Time-to-first-token = context_size ÷ prefill_tok_s.

Wrap-up

MTP is still the best single lever for generation speed on this card — but "generation speed" and "how long until I see an answer" are different questions, and at long context they diverge hard. If your workload is long-context RAG, the number that owns your latency is prefill, and no amount of speculative decoding will move it. The cheapest win there isn't a faster decoder; it's sending fewer, better tokens. Thanks to the reader who asked the question that made me measure it.

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 ~75 tok/s)

byeongsoo kang — Tue, 09 Jun 2026 09:21:43 +0000

A reader on my last post said Ollama was leaving a lot on the table — that a tuned backend with multi-token prediction (MTP) could roughly double my 3090's throughput. So I went and measured it, one lever at a time. The short version: they were right that MTP roughly doubles it, and below is the exact path that got me there on my box.

Update (2026-06-10) — corrected after community feedback. Two things in the first version were off, and r/LocalLLaMA was right to flag them. (1) ik_llama does support MTP — I'd used the deprecated -mtp flag; the canonical form is --spec-type mtp:n_max=3,p_min=0.0. (2) My headline 80.2 was a lucky 3-run draw — re-running both engines at n=12 gives ik_llama 75.2 and mainline llama.cpp 74.6: a tie at ~75 tok/s (≈2.1× over Ollama). So the honest headline is ~75 tok/s, both engines support MTP, and they're statistically identical. I've updated the numbers below and kept the story. Thanks to the folks who caught it.

TL;DR

On a single RTX 3090, Qwen3.6-27B generation went from 35.7 tok/s (Ollama) to ~75 tok/s (llama.cpp + MTP) — a measured ≈2.1× — by stacking three independent levers: a leaner engine, a smaller quant, and speculative decoding. The interesting part isn't the headline; it's which lever bought how much, and a couple of things that tripped me up on the way. (To be precise up front: MTP on its own is ~1.6× at the same quant — the ≈2.1× is what you get when all three levers stack. ik_llama and mainline llama.cpp both do MTP and land within noise of each other at ~75.)

The lever table

All on one RTX 3090, Qwen3.6-27B, 200 tokens generated, flash-attention on:

step	what changed	backend	quant	MTP	gen tok/s	vs Ollama	VRAM
baseline	—	Ollama	Q4_K_M	—	35.7	1.00×	23.2 GB
1	engine	ik_llama.cpp	Q4_K_M	—	41.9	1.17×	17.3 GB
2	+ quant	ik_llama.cpp	IQ4_XS	—	47.5	1.33×	15.1 GB
3	+ MTP	llama.cpp / ik_llama	IQ4_XS	on	~75	≈2.1×	~15 GB

A note on fairness (and sample size): rows 0–2 use each engine's own native bench path, and row 3 is llama-server. For a clean apples-to-apples read of MTP alone, I re-ran both engines at n=12: mainline llama.cpp 45.1 (off) → 74.6 (on) = 1.65×, and ik_llama 47.2 (off) → 75.2 (on) = 1.59× — statistically a tie at ~75 tok/s (MTP-on has a CV of ~5–7%; that variance is inherent to speculative decoding, since draft acceptance fluctuates run to run). My very first run reported 80.2, but that was a lucky high draw from a 3-run sample; the 12-run mean is ~75, so that's the honest number. (Both the Ollama baseline and the llama.cpp runs fit fully in VRAM; the baseline ran at num_ctx 8192 and the llama.cpp runs at -c 4096 — generation throughput is largely insensitive to that as long as nothing spills to CPU, though it accounts for part of the VRAM difference in the table.)

Levers 1 and 2: engine and quant

Moving the same Q4_K_M model from Ollama to a bare-metal ik_llama.cpp build (CUDA, flash-attention, compiled for the 3090's sm86) took me from 35.7 → 41.9 tok/s, and dropped VRAM from 23.2 → 17.3 GB. Ollama is convenience-first — it sizes things generously and doesn't expose the lower-level knobs — so a hand-built engine is faster out of the gate. Swapping the quant from Q4_K_M to IQ4_XS added a bit more and shrank VRAM further: 47.5 tok/s, 15.1 GB. Roughly a third faster, and nothing exotic yet. (Does IQ4_XS cost quality? I checked perplexity on wikitext-2 after a reader asked: Q4_K_M = 6.996, IQ4_XS = 6.997 — a +0.01% difference, comfortably inside the error bars (±0.046). IQ4_XS can regress more on other architectures, but for Qwen3.6-27B the quant swap was effectively free.)

Lever 3: MTP (where the real jump is)

Multi-token prediction / speculative decoding is the big one. The idea: a small, fast draft predicts several tokens ahead, and the main model verifies them in one pass — when the drafts are accepted, you get multiple tokens for roughly the cost of one. Because the main model verifies every drafted token before it's emitted, the output is preserved — this is a throughput win, not a quality tradeoff.

Two things were worth knowing for my setup:

Both ik_llama and mainline llama.cpp do MTP — but the flag matters. I first tried ik_llama's -mtp, which it rejected as legacy, and wrongly concluded ik_llama couldn't do MTP. A reader set me straight: the canonical form is --spec-type mtp:n_max=3,p_min=0.0, and with it ik_llama runs MTP fine (~75 tok/s, matching mainline). Mainline llama.cpp added MTP recently (PR #22673, merged 2026-05-16) and uses --spec-type draft-mtp. Either engine gets you there.
Ollama's GGUF couldn't be reused. Qwen3.6 changed rope.dimension_sections from 3 to 4 elements; Ollama's stored blob still has the older 3-element layout, so llama.cpp refused it (expected 4, got 3). I grabbed a properly-converted GGUF instead (bartowski / a nextn-equipped MTP build) — a small heads-up if you're tempted to point llama.cpp at your existing Ollama blob.

With an MTP-equipped IQ4_XS GGUF and n-max 3, generation lands around ~75 tok/s — whether via mainline llama.cpp's --spec-type draft-mtp or ik_llama's --spec-type mtp:n_max=3,p_min=0.0.

Tuning MTP: more accepted drafts isn't more speed

The one knob that mattered for me was the draft depth (n-max, how many tokens to draft ahead):

config	gen tok/s	draft acceptance
n-max 2	77.5	78.1%
n-max 3	80.2	70.3%
n-max 4	70.7	53.4%
n-max 3 + p-min 0.6	54.1	80.0%
n-max 3 + KV q8_0	74.6	64.5%

The counterintuitive bit: higher acceptance ≠ faster. Pushing p-min to 0.6 raised acceptance to 80% but dropped throughput to 54 — the extra rejected drafts cost more than they save. Plain f16 KV beat q8 KV too. n-max 3 with f16 KV was the sweet spot. (These sweep rows are single runs, so read the pattern, not the absolute decimals — the stable 12-run figure for n-max 3 is ~75. I also went looking for a "prefill-off" trick I'd heard about and couldn't find it as a flag in current llama.cpp — draft depth was the lever that actually moved the number for me.)

Honest caveats

Keeping these front and center, because they're the difference between a benchmark and a benchmark you can trust:

~75 tok/s is this box's number (RTX 3090, WSL2), as a 12-run mean. My first writeup said 80.2 from a 3-run sample — that was a lucky high draw, and re-running at n=12 corrected it to ~75. Generation under MTP has real run-to-run variance (CV ~5–7%) because draft acceptance fluctuates.
Prefill numbers are noisy — my test prompt was short (~56 tokens), so I'm not headlining prefill. (A reader rightly asked about prompt processing at >64k context, where prefill can dominate latency; MTP only speeds generation, not prefill — that's a measurement for a follow-up.)
The bartowski Q4_K_M and Ollama's Q4_K_M are the same quantization family but different conversions (the rope change above), so they're not bit-identical weights. The model and quant family are matched; the conversion isn't.
Single GPU, single request. No batching or concurrency tested — that's a different question.
One benchmarking trap that cost me time: llama-cli -n <N> is ignored under -no-cnv, so the model just generates until timeout (mine produced a 2 GB output file and looked like a 39-minute hang — it was runaway generation). Use llama-bench for token-exact non-MTP runs, and llama-server with n_predict for MTP.

Reproduce it

Hardware: RTX 3090 24 GB (Ampere, sm86), WSL2 Ubuntu 24.04, driver 591.74, nvcc 12.0.
ik_llama.cpp (commit bbe1a51): cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_NATIVE=ON
llama.cpp / mainline (commit e3471b3): cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DBUILD_SHARED_LIBS=OFF
Models: bartowski/Qwen_Qwen3.6-27B-GGUF (Q4_K_M, IQ4_XS); a nextn/MTP-equipped Qwen3.6-27B-MTP-IQ4_XS GGUF for the speculative step.
Non-MTP bench: llama-bench -m <gguf> -p 56 -n 200 -ngl 99 -fa 1 -r 3
MTP run, mainline: llama-server -m Qwen3.6-27B-MTP-IQ4_XS.gguf -ngl 99 -fa on -c 4096 --spec-type draft-mtp --spec-draft-n-max 3
MTP run, ik_llama: same model/flags, but --spec-type mtp:n_max=3,p_min=0.0. Then POST /completion with n_predict: 200; draft acceptance ≈ 70%.

Wrap-up

So the reader's nudge was a good one — Ollama really was leaving a clean ~2× on the table for this model on this card, and most of it is the MTP step. Ollama stays my default for everyday use (it's simple and it's what my tooling talks to); this build is the "I want every token/sec" setup. And honestly the best part of posting it was the correction: the thread caught both my legacy-flag mistake (ik_llama does MTP) and my lucky 80.2 draw (the honest 12-run mean is ~75) — so the version you're reading is the one the community helped get right.

Building a Fully-Local Research RAG on 2 GTX 1080 Ti + an RTX 3090 — 3 Gotchas

byeongsoo kang — Fri, 05 Jun 2026 23:56:32 +0000

I wanted to ask questions about my own papers without shipping them to a cloud API. This is the real story of building that — a private, fully-offline RAG with hybrid retrieval and reranking — across a pile of old GPUs and one newer one. Three things each cost me the better part of a day, and none of them were what I expected.

The goal: a private RAG over my own papers

I'm a researcher with a folder of PDFs I can't (and won't) upload to a hosted API. I wanted natural-language, cited answers over that corpus, running entirely on my own hardware. So I built a small tool — paper-rag, about 200 lines of Python — with the whole stack local:

PDFs → chunk → BGE-M3 dense (Ollama) ┐
               BM25 sparse (fastembed)┴→ Qdrant (embedded, on disk)
                                            │
question → dense + sparse → RRF fusion → cross-encoder rerank → top passages
                                                                     │
                                                                     ▼
                                             local LLM (Ollama) → cited answer

Dense embeddings catch meaning; BM25 sparse catches exact terms (gene names, identifiers); fusing the two and then reranking with a cross-encoder gives the LLM far better context than cosine-alone. No server, no API key, nothing leaves the box.

The hardware was just what I had lying around: 2× GTX 1080 Ti (Pascal, 2017, 11 GB each) on one machine, and later a single RTX 3090 (24 GB, Ampere) on another. That old-plus-new mix is where the lessons came from.

Gotcha 1: the embedding model kept freezing the entire GPU

On the 1080 Ti box (running under WSL2), a long ingest would make the BGE-M3 embedder hang — and not gracefully. The llama-server process dropped into uninterruptible D-state, nvidia-smi itself stopped responding, and the whole GPU was wedged. No signal could kill it; only a full wsl --shutdown brought it back.

I chased the wrong causes first:

"It's the batch size." It wasn't. Once the model degraded, even a single 600-character chunk timed out at 90 seconds.
"A newer Ollama will fix it." I checked the changelogs — the next few patch releases only fixed an unrelated model crash.

The actual fix was to stop using the GPU for that job at all. BGE-M3 is tiny (~1 GB), so I pinned it to CPU and kept the LLM on the GPU:

printf 'FROM bge-m3\nPARAMETER num_gpu 0\n' > bge-m3-cpu.Modelfile
ollama create bge-m3-cpu -f bge-m3-cpu.Modelfile
RAG_EMBED=bge-m3-cpu python rag.py ingest ./papers   # embeddings on CPU, answers still on the GPU

A ~100-chunk corpus embeds in well under a minute on CPU, and the GPU never wedges again. The lesson: on old Pascal cards under WSL the embedding path is where things break — and the embedder doesn't need the GPU in the first place.

Gotcha 2: my 27B ran at half speed until I capped the context

On the 3090 I loaded qwen3.6:27b (a 27.8B model, Q4, ~17.4 GB of weights) and saw ~17 tok/s. For a 27B that fits in 24 GB, that felt wrong.

ollama ps told the story: the model was loaded at 24.6 GB, with ~4 GB spilled to CPU. But the weights are only 17.4 GB — so what ate the other 7 GB? The KV cache. This model ships a 256K-token native context, and Ollama sized the cache to match it, overflowing VRAM and forcing an offload that throttled everything.

Cap the context to what you actually use and it all fits on the card:

RAG_NUM_CTX=8192 python rag.py ask "..."   # or options.num_ctx in your API call

Result: 100% on-GPU, ~36 tok/s — 2× faster — for a setting that loses nothing on RAG-sized prompts. The lesson: a huge native context silently taxes you even when you never use it. Set num_ctx to your real working size.

Gotcha 3: I have an old pair AND a new card — don't merge them

The tempting idea: pool the 2× 1080 Ti (22 GB) and the 3090 (24 GB) into one big inference cluster. You can (llama.cpp's RPC backend, or exo). You usually shouldn't.

The two machines are joined by a 1 GbE LAN — orders of magnitude slower than NVLink or PCIe. Cross-machine tensor parallelism is bottlenecked by that link, and the slow Pascal cards drag the fast Ampere one down to their level. Merging only pays off if you must run a model too big for any single box — and even then it's slow.

What actually works is role specialization:

3090 → the latency-critical work: LLM generation + reranking + query embedding.
1080 Ti box → throughput/batch work: bulk corpus embedding (on CPU, per Gotcha 1) and ingestion.

A small env knob points embeddings at one box and the LLM at another:

OLLAMA_URL=http://gpu-box:11434      RAG_LLM=qwen3.6:27b  \
RAG_EMBED_URL=http://127.0.0.1:11434 RAG_EMBED=bge-m3-cpu \
python mcp_server.py

Same bge-m3 on both boxes, so the dense vectors are compatible — you can even ingest on one machine and serve from another. ollama ps on each box confirms the split: bge-m3-cpu on the old box's CPU, the 27B on the 3090. The two machines do different jobs in parallel instead of fighting over one model.

The payoff: it's actually useful now

Hybrid + rerank produces noticeably better context, which shows up as cleaner cited answers, with each claim tagged to a source page — and when the answer genuinely isn't in the retrieved context, it says so instead of inventing one.

And because the tool speaks MCP, I can call it from an agent (Claude, Cursor, …): search_papers and ask_papers show up as tools, so the agent searches and cites my corpus — still fully local.

There's a quieter reason I want this grounded and local: small quantized models are confidently wrong on niche facts. A RAG anchored to the actual papers is how I catch that, instead of trusting the model's memory.

Honest limitations

Old Pascal cards are slow, and the embedder needs the CPU workaround above.
Reranking runs on CPU (fastembed/ONNX) — fine for a personal library, not a giant corpus.
It's a personal tool, not a production search system: naive fixed-size chunking, no semantic chunking yet.
Answer quality is your local model's quality — verify domain-specific claims.

Try it / let's compare notes

Code is MIT-licensed: github.com/shoo99/paper-rag — ~200 lines plus a small MCP server.

A couple of questions for anyone running similar setups:

For offload-heavy or multi-box rigs, has anyone gotten cross-machine pooling (llama.cpp RPC / exo) to actually beat just running a smaller model on the fastest single card?
What's your go-to reranker for local RAG — sticking with a small cross-encoder, or has something better shown up?

Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas, and Why Q8 Beat Q4 on My Own Field

byeongsoo kang — Fri, 05 Jun 2026 04:41:43 +0000

TL;DR (Quick Answer)

Gemma 4 12B just dropped, so I ran it on a GTX 1080 Ti (Pascal, 2017) to see what an 8-year-old card does with a 2026 model. Real numbers, and a few honest surprises:

Speed: ~28 tok/s at Q4_K_M on a single 1080 Ti (~8 GB VRAM). The 12B fits one card, so the second GPU sits idle.
Three things broke before it worked: the GGUF is multimodal and its vision projector crashes Ollama; it's a reasoning model that hides its answer in a thinking channel; and Q4 produces visible token glitches.
The interesting part — Q4 vs Q8. I asked it real bioinformatics questions. At Q4 it answered concepts and code well but got a niche method (the HEIDI test) confidently backwards, with garbled characters sprinkled in. Going to Q8_0 (12.7 GB, split across both 1080 Tis, ~30% slower at ~19.5 tok/s) removed the glitches and fixed the wrong answer.

Bottom line: for chat and drafting, Q4 on one old card is genuinely usable. For work where details matter, the higher quant across two cards is worth the speed hit — and it's the one case where the second 1080 Ti finally earns its keep.

Setup

Hardware: 2× NVIDIA GTX 1080 Ti (11 GB each), Pascal cc 6.1, driver 581.57, via WSL2.
Runtime: Ollama 0.30.2. Gemma 4 isn't in Ollama's library yet, so I pulled the unsloth GGUF: ollama pull hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M.

The 3 things that broke first

1. It's multimodal — and the vision projector crashes Ollama.
First generation returned nothing. The logs:

error: Failed to load CLIP model from .../blobs/sha256-7d10888...
llama-server process has terminated: exit status 1

Gemma 4 12B-it ships with a vision (CLIP) projector, and Ollama 0.30.2 fails to load it — taking down the whole model server. If you only want text, you have to strip the projector. Pull the model, then rebuild it text-only from the same blobs (no re-download):

ollama show --modelfile hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M > Gemma4.Modelfile
# delete the second `FROM ...` line — the mmproj/CLIP blob — keep only the text GGUF
ollama create gemma4-12b-text -f Gemma4.Modelfile

2. It's a reasoning model — your answer hides in thinking.
With the text model, generation worked but content came back empty while eval_count was 200. The output was all going into the reasoning channel and getting cut off mid-thought at the token cap. Fix: disable thinking.

{ "model": "gemma4-12b-text", "think": false, "messages": [ ... ] }

With think: false, clean answers in ~10 seconds.

3. Q4 has visible token glitches.
At Q4_K_M, prose came out with occasional garbled characters — literally self-さattention, ściindicates, stray Korean/Japanese codepoints injected mid-word. Code blocks were clean; only prose was affected. (Spoiler: Q8 fixes this.)

Speed (Q4, single 1080 Ti)

think: false, num_predict=256, measured via Ollama's API:

Generation: ~27.6 tok/s (27.5 / 27.6 / 27.7 — rock stable)
VRAM: ~8 GB on GPU0; GPU1 completely idle (0 MiB) — a 12B at Q4 fits one card, so the second GPU does nothing.

Quality: I asked it about my actual field

Speed is easy; is it useful for real work? I gave it four bioinformatics questions and checked the answers honestly:

Question	Verdict
RNA-seq normalization (raw vs TPM vs FPKM; DESeq2 input)	✅ Correct and precise
Pandas function to filter a DESeq2 results table	✅ Correct, clean, usable
Troubleshoot an implausibly high DEG count	✅ Good — batch effects, PCA, outliers, covariates
What a small HEIDI p-value means (SMR/colocalization)	❌ Confidently backwards

That last one is the lesson. HEIDI is a niche test: a small p-value means the locus fails (heterogeneity/linkage — you filter it out). Q4 Gemma told me a small p-value means a single causal gene — the exact opposite. It was fluent and sure of itself. If you don't already know the answer, that's the dangerous kind of wrong.

The payoff: Q4 vs Q8

So I pulled Q8_0 (12.7 GB) and rebuilt it text-only the same way. At 12.7 GB it no longer fits one card — Ollama splits it across both 1080 Tis (~7 GB each). Same questions:

	Q4_K_M	Q8_0
Size / GPUs	7 GB / 1 card (GPU1 idle)	12.7 GB / 2 cards (~7 GB each)
Speed	~28 tok/s	~19.5 tok/s (−30%)
Token glitches	`self-さattention` etc.	gone — clean ✅
HEIDI answer	backwards ❌	correct ✅ ("small p = fails, filter it out")

Less quantization bought three things: the glitches disappeared, it got the niche domain detail right, and — because the bigger file overflows one card — the otherwise-idle second 1080 Ti finally did work. The cost was ~30% throughput.

(Honesty note: I asked Q8 the HEIDI question with a more pointed framing than Q4, so that single comparison isn't perfectly controlled. The token-glitch difference, on identical prompts, is unambiguous.)

When does the second 1080 Ti actually help?

Combining this with an earlier 35B-MoE run, a clear rule emerges:

Model fits one card (12B Q4): second GPU is idle — useless.
Model overflows one card (12B Q8, or a 35B): it spills to the second card, which now helps.

The second 1080 Ti isn't about speed; it's about fitting a bigger or higher-precision model.

Honest Limitations

One model, two quants, one box; your tok/s will vary with CPU, RAM, and context length.
Q8 HEIDI test used a more direct prompt — suggestive, not a controlled A/B.
Quality judged on a handful of prompts, not a benchmark suite.
Ollama 0.30.2's Gemma 4 support is early (the CLIP crash, the reasoning-channel behavior); later versions may change this.

Reproduce

ollama pull hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M       # or :Q8_0 for the 2-card run
ollama show --modelfile hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M > m.Modelfile
# remove the mmproj/CLIP `FROM` line, keep the text GGUF
ollama create gemma4-12b-text -f m.Modelfile
# then call /api/chat with "think": false

FAQ

Q: Can a GTX 1080 Ti run Gemma 4 12B?

Yes — ~28 tok/s at Q4 on a single card, ~19.5 tok/s at Q8 across two. Just strip the vision projector (it crashes Ollama 0.30.2) and disable the reasoning channel with think: false.

Q: Q4 or Q8?

Q4 for speed and casual use (one card). Q8 when correctness matters: on my field's questions it removed the token glitches and fixed an answer Q4 got backwards — at ~30% lower speed, and it needs both cards.

Q: Why did the second GPU sit idle at Q4?

A 12B at Q4 is ~7 GB and fits one 11 GB card, so Ollama uses one GPU. Only when the model overflows one card (Q8, or a larger model) does the second card get used.

Resources

Model: unsloth/gemma-4-12b-it-GGUF
Related: 35B MoE on 2× 1080 Ti · Ollama

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data

byeongsoo kang — Wed, 03 Jun 2026 05:55:34 +0000

A field report from building a CPU-only, distributed LLM pipeline for large-scale scientific literature extraction. No GPUs. A lot of quantization. And four silent data-quality bugs that taught me more than the happy path ever did.

The constraint that started it all

Our team runs an internal research cluster: a couple dozen older x86 servers, plenty of RAM, zero GPUs. The mandate was to extract structured data — effect sizes, the entity each one describes, and the direction of effect — from ~10,000 full-text research papers, so a downstream meta-analysis could pool them.

The obvious 2024-era answer is "send it to a hosted LLM API." That wasn't on the table for data-governance reasons: the corpus had to stay on-prem. So the real question became:

Can you do serious LLM extraction at the 10k-document scale with CPUs only?

Spoiler: yes — but the interesting part isn't the throughput. It's that correctness, not speed, turned out to be the hard problem. Let me walk through the architecture, then the four bugs that each silently corrupted the data in a different way.

The stack

Everything is open source and CPU-friendly:

llama.cpp serving quantized GGUF models over its OpenAI-ish HTTP endpoint. We ran a MoE model (~35B total / ~3B active, Q8) as the high-throughput workhorse on 8 nodes, and a ~400B model (Q3) on a dedicated node for the heaviest pass.
BGE-M3 (1024-dim) for embeddings, also on llama.cpp, across 8 nodes.
Qdrant as the vector DB.
Plain Python (requests + ThreadPoolExecutor) for orchestration. No Ray, no fancy scheduler — just a queue and one worker bound per node, because each llama.cpp server runs --parallel 1: on CPU, inference is memory-bandwidth bound, so one in-flight request already saturates the memory bus and batching buys little.

Each node is a dual-socket Xeon, ~36 cores total (AVX-512), no accelerator. The 35B MoE generated ~6 tokens/s per node; with 8 nodes load-balanced, a sentence took ~10s end to end and the full 14k-sentence extraction finished in a few hours.

MoE was the unlock for CPU: ~3B active parameters per token means it generates at a usable rate even without a GPU, while delivering quality far above what its ~3B active count alone would suggest.

The ~400B Q3 model was reserved for a separate, earlier abstract-level pass — a different job at a different scale, out of scope for this post — where its stronger one-shot reading paid off. On a single CPU node it ran at low single-digit tokens/s, so routing the sentence-level corpus through it was never viable; everything below is the 35B MoE.

RAG is not extraction (the distinction that bites everyone)

First, a clarification I had to make repeatedly, because it confuses people (it confused me):

RAG (embed → vector search) answers questions. You chunk text, embed it, and at query time retrieve the top-k most semantically similar passages to ground an LLM's answer. Great for "find me passages about X."
Extraction for meta-analysis needs numbers — every effect size from every paper, aggregated. That is exhaustive structured extraction, not retrieval.

A vector DB stores d = -0.45 as a text token inside an embedding. It will happily find that sentence by meaning, but it cannot compute over the number. If your goal is to pool effect sizes, embeddings are the wrong tool. You want extraction.

So the pipeline is a hybrid: a cheap mechanical pass to find candidate sentences, then an LLM to interpret them.

10k full-text papers
   │
   ├─ ① regex pre-filter  (mechanical, no understanding)
   │     keep sentences that have a number near a target-entity keyword
   │     → ~14k candidate sentences
   │
   └─ ② LLM mapping       (the judgment step)
         each sentence → {entity, metric, direction, value, measure_type}
         → structured JSON for the meta-analysis

Regex is the funnel; the LLM is the brain. Neither replaces the other.

Now the fun part.

Bug #1 — the chunker that silently deleted 79% of the data

The embedding side (the RAG corpus) had its own chunking pipeline. It looked fine. Counts looked fine. Then someone asked a simple question — "how many points are actually in the collection?" — and the numbers didn't add up: ~1M chunks generated, ~217k points in the DB.

A 78% gap. Where did 800k chunks go?

The culprit was the point ID. Each chunk got an ID derived from (paper_id, chunk_index). Reasonable — except chunk_index was reset to 0 at the start of every section:

for section in sections:
    for j, chunk in enumerate(chunk_text(section)):   # j resets per section!
        point_id = make_id(paper_id, j)               # collision: (abstract,0) == (methods,0)
        upsert(point_id, ...)

So a paper's abstract chunk-0 and its methods chunk-0 and its results chunk-0 all hashed to the same point ID. Qdrant upserts are idempotent by ID, so each new section silently overwrote the previous one. Every paper collapsed to roughly max(chunks in any single section) points.

I confirmed it by replaying the raw chunks: 27,222 chunks across a sample → only 5,672 unique (paper_id, chunk_index) pairs. 79.2% collision on the sample, closely matching the 78% gap across the full DB (the small delta is just sampling — one is a replayed subset, the other the whole collection).

The fix is a one-liner — make chunk_index a running counter across the whole paper (and derive the ID with a deterministic hash like hashlib/UUID, not Python's per-process hash(), so IDs stay stable across runs) — but the lesson isn't the fix. It's that a silent overwrite produces a database that looks completely healthy: green status, fast queries, plausible counts. Nothing errors. You only catch it if you reconcile "things I generated" against "things that landed."

Reconcile your pipeline's input count against its output count at every hop. Silent data loss doesn't throw.

Bug #2 — recursive chunking that duplicated 75% of the text

While fixing #1, I re-ran the chunker on a fresh corpus and a sample paper produced 7,588 chunks, of which only 1,897 were unique — 75% duplicates.

The XML parser walked sections like this:

for sec in body.findall(".//sec"):          # ALL <sec>, including nested ones
    paragraphs = sec.findall(".//p")          # ALL <p>, recursively

In journal XML, sections nest. A parent <sec> contains child <sec>s. .//p is recursive, so the parent emitted all of its children's paragraphs — and then each child <sec> was visited separately and emitted them again. Deeply nested papers (a conference-proceedings document with 600 sub-sections was the worst) exploded.

Fix: take direct-child paragraphs only (sec.findall("p")), plus a within-paper dedup as a safety net. Chunks dropped to the honest count, embedding time dropped with it.

.// in XPath is a footgun when your tree is recursive and you also iterate the tree.

Bug #3 — the reasoning model that never stops thinking

Onto the extraction LLM — the 35B MoE workhorse. It's a reasoning model that emits a <think>…</think> block before its answer. The first run capped generation at 512 tokens with stop=["\n\n"]. Result: 0% parse rate. The \n\n stop fired inside the thinking block, truncating mid-thought; no JSON ever appeared.

OK, remove the bad stop, give it room. Bump to 1024 tokens. Now ~42% parse — better, but a third of outputs were still <think> with no </think>: the model hit the token cap still reasoning.

So give it more room. 2048 tokens, 600-second timeout, quality-first. I ran a single hard sentence as a test. It generated 6,144 characters in 269 seconds and still hadn't closed the think block — it was literally mid-sentence, "Let's draft the JSON:", when it ran out of budget. At that rate, 14k sentences would take ~5 days and still fail on the hard ones.

The model wasn't slow. It was non-terminating: on ambiguous inputs it reasoned in circles and never committed to an answer. More tokens didn't help; it just thought more.

The fix is a known trick for reasoning models in raw-completion mode: pre-close the think block in the prompt so the model skips deliberation and answers directly:

prompt = f"...<|im_start|>assistant\n<think>\n\n</think>\n\n"
#                                  ^ empty, pre-closed → no open-ended reasoning

Latency dropped from "minutes, maybe never" to ~10 seconds, deterministically. The whole 14k run finished in hours, at 99.96% parse.

A reasoning model with no thinking budget is a liability for bulk structured output. If you don't need the chain-of-thought, close it.

Bug #4 — empty outputs, and a one-character fix

No-think mode had its own quirk: on ~14% of the harder sentences, the model returned completely empty output. Not bad JSON — nothing. Deterministic (temperature 0), so retries reproduced the emptiness exactly.

The model, forced to answer immediately, was "blanking" on sentences it found ambiguous. The fix was almost insultingly small: seed the assistant turn with an opening bracket so the model is already inside a JSON array and must continue it:

prompt = f"...<|im_start|>assistant\n<think>\n\n</think>\n\n["
#                                                          ^ forces JSON to start

(You then prepend the [ back when parsing, since the completion only returns what comes after the prompt.) This recovered 298 of 301 empties → 99.86% parse on the hard subset.

When a model can output "nothing," constrain the output space so "nothing" isn't reachable.

The bug that wasn't a bug: precision vs. recall

The last lesson is subtler. The first extraction run mapped a number whenever a sentence had a number near a target-entity keyword. The audit found ~50% of the mapped "effect sizes" weren't the target effect at all — they were regression-predictor t-values (age, sex, medication), correlations with secondary task scores, even positional coordinates (x = -28) the regex had grabbed as if they were measurements.

That noise produced a confident-but-spurious aggregate signal. Garbage in, significant garbage out.

The fix had two halves, and getting it wrong in an instructive way:

Filter at the source. Only feed the LLM sentences from papers that are actually about the topic, with a target entity and a real effect statistic. (First attempt: filter at the sentence level — too aggressive, it dropped real effects whose context lived in the neighboring sentence. Second attempt: filter at the paper level — recovered ~1,100 real effects the sentence filter had thrown away.)
Make the prompt relationship-aware. Tell the model what kind of number counts and what to reject (predictors, task-performance correlations, coordinates, cluster stats), with a worked rejection example.

But I over-corrected: my first sharpened prompt rejected so aggressively it returned [] for valid patient-vs-control effects too (1/15 on a sanity sample). The filter and the prompt were fighting — the filter guaranteed the paper was on-topic, but the prompt still demanded an explicit topic keyword in the sentence. Once I told the prompt "you can trust that this sentence is from an on-topic paper; extract the entity's effect and only reject these specific noise types," recall snapped back (9/15) with zero coordinate leakage.

Precision in the mapped set went from ~49% to ~66% at the sentence level; at the paper level — meaning every paper that contributes an effect is genuinely on-topic — it was 100%. Total entries dropped from ~4,900 to ~1,700, almost all of it noise. The residual ~34% sentence-level noise isn't pooled blind, but be precise about what catches it: the load-bearing filter downstream is entity normalization against a controlled vocabulary — off-target entities (age, sex, medication) get dropped there — backed by a validation gate. (Stratifying by measure type and dedup are cleanup, not misclassification removal: a predictor t-value mislabeled as a target effect sails right through those.) The mapping's job is to maximize signal and flag; the controlled-vocabulary step is where the final noise is supposed to die.

The most dangerous extraction failure isn't a crash or a low parse rate. It's clean-looking data that's confidently wrong. Audit what your pipeline includes, not just what it drops.

Takeaways

CPU + quantized MoE is genuinely viable for 10k-document LLM work: at ~6 tok/s/node across 8 nodes, the full 14k-sentence extraction finished in a few hours. The bottleneck was never compute — it was correctness.
Reconcile counts at every hop. Both data-loss bugs (#1, #2) were invisible from status dashboards; only input-vs-output reconciliation surfaced them.
Reasoning models need a thinking budget — or none at all — for bulk structured output. Pre-closing <think> and seeding the output bracket turned a 5-day non-terminating job into a few-hour deterministic one.
For extraction, precision is the silent killer. A permissive regex + a literal LLM will hand you a statistically significant result built on coordinates and covariates. Filter at the right granularity, and tell the model what to reject.

None of these are exotic. They're the unglamorous correctness work that sits between "the demo runs" and "the numbers are trustworthy" — which, for anything feeding a real analysis, is the whole job.

This pipeline powered the large-scale literature extraction behind our chronic-stress scoping-review preprint (Research Square).

Tools used: llama.cpp, BGE-M3, Qdrant, Python. All on-prem, all CPU.

A MOGONET-Style Multi-Omics Biomarker Pipeline: Why a Near-Random Graph Net Still Earns Its Place

byeongsoo kang — Wed, 03 Jun 2026 05:24:09 +0000

TL;DR (Quick Answer)

This is an honest engineering write-up of a MOGONET-style multi-omics consensus biomarker pipeline built as an internal R&D project at sysofti.

The headline — on a small synthetic cohort (n=30), the graph network alone scores near-random in leak-free 5-fold cross-validation (AUC 0.53 ± 0.16). Yet as one voter in a 5-evidence consensus, the top-10 ranking is 90% real markers (9 of 10 are known periodontitis genes).
The lesson — a single model that looks weak in honest evaluation can still be a useful voter. That contrast is the whole point of the consensus design, and we show it with data.
What it is — per-omics Graph Convolutional Networks (GCN) over a sample-similarity graph, attention-fused, contributing to a consensus score alongside differential-expression hubs, Random Forest, a DNN, and co-expression modules.
What it is *not* — the official MOGONET. We dropped the original's VCDN fusion for attention fusion. Call it "MOGONET-based." All numbers are from synthetic data with embedded ground-truth markers — code validation, not a clinical claim.

If you're implementing multi-omics integration, the parts you can't get from the paper are below: the real results, the leakage-aware evaluation, and the bugs we hit.

What MOGONET Is (the One-Line Mental Model)

MOGONET (Multi-Omics Graph cOnvolutional NETwork) learns a separate GCN per omics view on a sample-similarity graph (patients as nodes, edges by feature similarity), then fuses the per-view embeddings for classification and biomarker discovery. Reference: Wang et al. 2021, Nature Communications 12:3445; the GCN itself is Kipf & Welling 2017.

Mental model: "build one graph net per omics layer, let each form an opinion, then combine those opinions."

What We Simplified — and Why

The original MOGONET fuses views with a View Correlation Discovery Network (VCDN). We replaced it with attention-weighted fusion:

Why — with tiny cohorts (tens of samples), VCDN's extra parameters were a liability; attention fusion gave a simpler intermediate-fusion scheme that still up-weights the more informative omics per sample.
The tradeoff — we lose the explicit cross-view correlation modeling that is part of MOGONET's original contribution. So this is honestly MOGONET-based, not a reimplementation. The source docstring says as much: "Simplified implementation of MOGONET."

Architecture

Input: X_views = [omics1 (n×p1), omics2 (n×p2), ...]   (n = common samples)
  └─ per-view StandardScaler
  └─ per-view k-NN (cosine) adjacency  (n×n)
ViewEncoder (per omics):  GraphConv(p→128) → BN → ReLU → GraphConv(128→64)
  → view embedding (n×64)
Attention fusion:  softmax(Linear(64→1)) over views → weighted sum (n×64)
Classifier:  Linear(64→32) → ReLU → Linear(32→n_classes)

class GraphConvLayer(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
    def forward(self, x, adj):
        return torch.mm(adj, self.linear(x))   # propagate over the sample graph

class MOGONET(nn.Module):
    def __init__(self, input_dims, hidden_dim=128, latent_dim=64, n_classes=2):
        super().__init__()
        self.encoders = nn.ModuleList([ViewEncoder(d, hidden_dim, latent_dim) for d in input_dims])
        self.attention = nn.Linear(latent_dim, 1)
        self.classifier = nn.Sequential(nn.Linear(latent_dim, 32), nn.ReLU(), nn.Linear(32, n_classes))
    def forward(self, views, adjs):
        embeddings = [enc(x, adj) for enc, x, adj in zip(self.encoders, views, adjs)]
        stacked = torch.stack(embeddings, dim=0)                       # n_views × n × latent
        attn = F.softmax(self.attention(stacked).squeeze(-1), dim=0)   # per-view, per-sample
        fused = (stacked * attn.unsqueeze(-1)).sum(dim=0)              # n × latent
        return self.classifier(fused)

Sample-similarity graph — k-NN (cosine), no self-loops on purpose (see below):

def build_adjacency(X, k=5):
    sim = cosine_similarity(X)
    adj = np.zeros_like(sim)
    for i in range(len(sim)):
        top_k = np.argsort(sim[i])[-k-1:-1]      # top-k neighbours, excluding self
        adj[i, top_k] = sim[i, top_k]
        adj[top_k, i] = sim[top_k, i]            # symmetrize
    row_sum = adj.sum(axis=1, keepdims=True); row_sum[row_sum == 0] = 1   # guard zero-sum rows
    return adj / row_sum

The Engineering Decisions That Mattered

Sample-node graph, not feature graph. Nodes are patients; edges are patient-patient similarity. Same-group patients cluster, so the GCN smooths group signal.
No self-loops — on purpose. Standard GCN uses Ahat = A + I so a node keeps its own features. We deliberately omit the self-loop so each node's representation is built purely from its sample-neighborhood, pushing the model toward group structure rather than individual raw features. It is a tradeoff (you give up the node's own signal each layer), and we flag it as a choice, not an accident.
Per-view scaling + common-sample intersection. Each omics standardized independently; only samples present in all views are used.
Consensus over a single model. MOGONET is one of five evidence sources by design — Hub (DE+PPI), ML (Random Forest), DL (DNN), WGCNA co-expression, and MOGONET — with a multi-evidence bonus:

avg_score = sum(scores.values()) / max(len(scores), 1)
composite = avg_score * (1 + 0.3 * (n_sources - 1))   # reward agreement across sources

As the results show, this design choice is what makes the pipeline useful despite any single model being weak.

Results (Synthetic Data, with Ground Truth)

We validate on a synthetic periodontitis case-control set (3 omics — transcriptomics 500, proteomics 200, metabolomics 100 features × 30 samples, 15 disease / 15 control, seed-fixed) with known biomarkers deliberately embedded: up-regulated inflammatory genes (MMP8, MMP9, IL1B, IL6, TNF, RANKL, CTSK, TLR4 …) and down-regulated bone-formation genes (COL1A1, RUNX2, SP7, BGLAP, OPG …). Embedding known markers gives ground truth — you can check whether the pipeline recovers them, which is impossible on a real cohort.

Note on sources: the pipeline defines five evidence sources, but in this run WGCNA returned no co-expression hubs, so four sources actually contributed (Hub, ML, DL, MOGONET).

The consensus ranking surfaces real markers

Of 793 candidate features, the top-30 consensus included 13 of the 25 embedded markers. The ranking is strikingly clean at the top:

Rank	Gene	Composite	Sources	Known marker
1	MMP8	1.888	4	★
2	COL1A1	1.212	3	★
3	MMP9	1.020	4	★
4	IL6	1.000	1	★
5	IL1B	0.900	4	★
6	METAB_0031	0.866	1	—
7	TLR4	0.856	3	★
8	RANKL	0.838	3	★
9	CTSK	0.803	3	★
10	SP7	0.678	3	★
11	MYD88	0.672	3	★

Precision@10 = 0.90 — 9 of the top 10 are known markers (only METAB_0031 is not).
Recall@10 = 0.36, Recall@20 = 0.52 (9 then 13 of 25 known markers); it plateaus by 20 because a few embedded markers were given weak synthetic signal (e.g. TNF, fold-change ≈ 1.1).

More evidence = more trustworthy

Breaking the top-30 down by which sources agreed makes the consensus logic concrete:

4 sources → 3 genes, all 3 known (100%): MMP8, MMP9, IL1B.
3 sources → 17 genes, 9 known.
2 sources (DL + MOGONET) → 8 genes, 0 known — pure noise.
1 source → 2 genes, 1 known.

The signal lives where independent methods agree. A gene flagged by four sources was always real here; genes flagged by only two were not.

The honest part: the graph net alone is near-random

We cross-validated MOGONET as a standalone classifier, rebuilding the sample graph from training folds only to avoid leakage:

MOGONET 5-fold CV AUC = 0.53 ± 0.16 (folds: 0.44, 0.44, 0.78, 0.33, 0.67)

That is barely above chance. With n=30 (six test samples per fold) and a transductive sample-graph model, a single GCN simply cannot generalize here — and its training AUC near 1.0 is mostly the leakage and the injected signal talking. This is exactly why MOGONET is wired in as one voter, not the decision-maker. The consensus result above is strong because it doesn't trust any single model, including this one.

Honest Limitations

Simplified model. No VCDN fusion — attention instead. "MOGONET-based," not a reimplementation.
MOGONET is a weak standalone classifier here (CV AUC 0.53). Useful only in aggregate. It also scores all 793 features, so its solo discriminative power is low.
Synthetic, small (n=30). Results validate the code's ability to recover injected signal — not clinical performance. External cohorts are required for any real claim.
Single run (seed 42). Known markers are stable at the top; the unnamed GENE_xxxx candidates shuffle on re-runs.
Self-loop omission is a design choice with a cost — worth A/B testing against the standard A + I formulation.
Feature importance is an approximation (first-layer weight magnitude), not a gradient-based attribution.

What Broke Along the Way (Real Notes)

Zero-sum adjacency → NaN. If a sample's k-NN cosine similarities summed to zero, row-normalization divided by zero and propagated NaNs. Fixed with a row_sum[row_sum == 0] = 1 guard.
Attribute-name mismatches (fixed twice). Pulling feature importance broke on AttributeError when the sklearn-wrapper conventions clashed with the nn.Module attribute names (view_encoders → encoders, model → model_).
Common-sample collapse. When omics measured different sample sets, the intersection shrank fast. Added a "≥6 common samples" guard that skips gracefully instead of crashing.
MOGONET scores everything. It assigns weight to all 793 features, so it appeared in all top-30 entries — the multi-evidence bonus is what keeps it honest.

What We'd Improve Next

Report consensus performance under the same leak-free CV, not just MOGONET's.
A/B test self-loops (Ahat = A + I).
Gradient-based attribution (Integrated Gradients) instead of first-layer weights.
Add VCDN fusion and compare head-to-head with attention fusion.
External multi-omics cohort for real-world validation.

FAQ

Q: Is this the official MOGONET implementation?

No — a simplified, MOGONET-based design: per-omics GCN with attention fusion, without the original's VCDN view-correlation network.

Q: If MOGONET's CV AUC is only 0.53, why keep it?

Because it is one voter in a five-source consensus, not the classifier. Single models overfit small cohorts; consensus rewards agreement across independent methods, and that ranking recovered known markers at 90% precision in the top 10. A weak voter still adds signal when combined.

Q: Why validate on synthetic data?

Embedded known markers give ground truth, so you can measure recovery (recall/precision) — impossible on a real cohort where the answer is unknown. It validates the code, not clinical utility.

Q: Why omit GCN self-loops?

Intentional: without a self-loop, each node's representation comes purely from its sample-neighborhood, pushing the model toward group structure rather than individual features. It is a tradeoff worth A/B testing, not a universal recommendation.

Q: Can I use this on my own multi-omics data?

Yes — the classifier is sklearn-compatible (fit/predict/predict_proba). Build the sample graph from training data only to avoid leakage, and don't over-read AUC on small cohorts.

Resources

Reference implementation (clean, standalone, MIT): github.com/shoo99/mogonet_lite
Original paper: Wang T. et al. (2021), MOGONET integrates multi-omics data via graph convolutional networks for biomarker discovery, Nat Commun 12:3445.

Running a 35B MoE (Qwen3.6-35B-A3B) on 2x GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?

byeongsoo kang — Wed, 03 Jun 2026 05:18:42 +0000

TL;DR (Quick Answer)

I actually ran Qwen3.6-35B-A3B — a 35B-parameter mixture-of-experts model (only 3B active per token) — on a pair of 8-year-old GTX 1080 Ti cards (22 GB combined). Real, measured numbers:

Generation speed: ~20 tokens/sec on 2× 1080 Ti (IQ4_XS quant), stable across runs (19.4 / 21.4 / 20.0).
Single GPU: ~16.8 tok/s. CPU-only (i9-14900K): ~17.1 tok/s. The second 1080 Ti buys only ~20% over one card — and, the kicker, the GPUs barely beat a modern CPU here (~+18%), because the MoE experts stay mmap'd in CPU RAM regardless. See the honest update below.
It only "fits" because of the MoE + CPU-mmap trick. ~13 GB of the model sits on the two GPUs; ~18 GB of expert weights are mmap'd from CPU RAM, and only the active 3B runs each token.
Quant matters for 22 GB: the default qwen3.6:35b-a3b tag is 23.9 GB and spills to CPU. You want ≤ IQ4_XS (~17.7 GB) to keep it (mostly) on the GPUs.

Bottom line: a 35B MoE is genuinely usable on this box in 2026 — but the honest workhorse turned out to be the i9-14900K CPU; the used 1080 Ti cards add only ~18%. Pick a sparse MoE and a quant that mostly fits — and know that for an offload-heavy MoE, a fast CPU + RAM bandwidth matters as much as the GPUs.

The setup (and one gotcha)

GPUs: 2× NVIDIA GeForce GTX 1080 Ti (11 GB each, 22 GB total), Pascal, compute capability 6.1.
Driver: 581.57 (Windows host, used via WSL2 passthrough). This matters — recent Ollama bundles CUDA 13, which refuses drivers older than 570. On the older 560 driver it silently fell back to CPU (total_vram=0). Updating to 581 fixed it.
Ollama: v0.30.2. Interesting detail: its cuda_v13 build skips Pascal ("compute capability not in compiled architectures", cc 6.1), so it auto-falls back to the bundled cuda_v12 build to use the 1080 Ti. Good to know if you're on old hardware.

Why a "35B" model runs on old cards at all

Qwen3.6-35B-A3B is a mixture-of-experts (MoE): 35B total parameters, but only ~3B are active for any given token. So the compute per token is small (3B-class), even though all the experts must be available in memory.

That's the whole reason this works on Pascal: the GTX 1080 Ti has no tensor cores and modest FP16, so a dense 35B would crawl. A sparse 3B-active MoE keeps the per-token math light, and the bottleneck shifts to where the weights live — which is exactly what the dual-GPU question is about.

Quant fit on 22 GB

You can't just ollama pull qwen3.6:35b-a3b — that default is 23.9 GB and won't sit on 22 GB of VRAM. Measured GGUF sizes:

Quant	Size	Fits 22 GB?
Q3_K_M	~16.6–17.1 GB	✅ comfortable
IQ4_XS	~17.7 GB	✅ best quality that fits
Q4_K_S	~21 GB	⚠️ too tight (spills with KV cache)
Q4_K_M / default	23.9 GB+	❌ offloads to CPU

I used IQ4_XS.

Results: single vs dual 1080 Ti

Same model (IQ4_XS), same prompt, num_predict=256, measured via Ollama's /api/generate:

Config	Generation	Prefill	Model on GPU	Model on CPU (mmap)
CPU only (i9-14900K)	~17.1 tok/s	—	0 GB	whole model in RAM
1× GTX 1080 Ti	~16.8 tok/s	~50 tok/s	~3 GB	~18 GB+
2× GTX 1080 Ti	~20.3 tok/s	~50 tok/s	~13 GB (4 + 9.3)	~18 GB

Under load, the busier card drew up to ~101 W, GPU utilization sat around 26–33% — telling: the cards are waiting a lot, because the CPU-mmap'd experts are the bottleneck, not raw GPU FLOPs.

Update (2026-06-03) — the honest punchline, after an r/ollama reader pushed back ("those numbers are slow for A3B"). I measured CPU-only on the same box — an Intel i9-14900K (32 threads, DDR5): ~17.1 tok/s. That's basically tied with a single 1080 Ti, and only ~18% behind both GPUs combined. So for this offload-heavy MoE, the old Pascal cards barely beat a modern CPU — the 14900K does most of the work and the GPUs mostly shave overhead. The honest framing isn't "a 35B runs on 2× 1080 Ti" so much as "a 35B MoE runs on a fast desktop CPU, and old GPUs add ~18%." When the experts have to live in CPU RAM, your CPU + memory bandwidth — not the GPU — set the ceiling. (On hardware where the whole MoE is VRAM-resident, the GPU story would look very different.)

So, does the second 1080 Ti help? A little — ~+20% over one card, ~+18% over CPU-only — by keeping ~9 GB more of the model in VRAM. But not 2×, and not the win you'd hope: an MoE that overflows your combined VRAM is gated by the CPU-side experts in every config here.

Reproduce it

# (driver must be 570+ for current Ollama; check with: nvidia-smi)
ollama pull hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS

# generate + read the eval rate
curl -s http://127.0.0.1:11434/api/generate -d '{
  "model": "hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS",
  "prompt": "Explain mixture-of-experts in 150 words.",
  "stream": false,
  "options": {"num_predict": 256}
}'
# tokens/sec = eval_count / (eval_duration / 1e9)

To force a single GPU for comparison, start the server with CUDA_VISIBLE_DEVICES=0 ollama serve.

Honest Limitations

One quant, one model, one box. IQ4_XS on 2× 1080 Ti; your tokens/sec will shift with quant, context length, CPU, and RAM speed.
Prefill measured on a short prompt (~55 tokens) — treat ~50 tok/s as a ballpark; long-context prefill on Pascal will be slower.
IQ4_XS is a ~4-bit quant — fine for chat/drafting, but it's not full-precision quality.
MoE-specific. These conclusions (the modest dual-GPU gain, the CPU-mmap behavior) are about this sparse MoE. A dense model that fully fits VRAM would scale differently across two cards.
A few runs, not a statistical study — numbers are representative, not p-valued.

FAQ

Q: Can a GTX 1080 Ti really run a 35B model in 2026?

A sparse MoE one, yes — Qwen3.6-35B-A3B at IQ4_XS ran ~20 tok/s on two of them. A dense 35B would not be usable. The 3B-active design is what makes it work.

Q: Will a second 1080 Ti double my speed?

No. Here it added ~20%. The MoE experts stay memory-mapped in CPU RAM in both single- and dual-GPU setups, so the second card helps but doesn't scale linearly.

Q: Why did Ollama ignore my GPU until I updated the driver?

Recent Ollama bundles CUDA 13, which requires NVIDIA driver ≥ 570. On an older driver it falls back to CPU silently. Update the driver; Ollama then uses its cuda_v12 build for Pascal cards.

Q: Which quant should I use on 22 GB?

IQ4_XS (~17.7 GB) for the best quality that stays (mostly) on the GPUs; Q3_K_M if you want more headroom for context. Avoid the 23.9 GB default — it spills to CPU.

Resources

Model: Qwen3.6-35B-A3B GGUF (bartowski)
Ollama · benchmark via /api/generate (eval_count / eval_duration).