DEV Community: Arsen Apostolov

Does a Second GPU Increase Ollama's Context Window? (Quadro P2000 + RTX 3090 Tested)

Arsen Apostolov — Thu, 09 Jul 2026 12:37:52 +0000

TL;DR

Short version: no. I dropped a much older GPU (Quadro P2000, 5GB, Pascal, 2016) next to an RTX 3090 (24GB, Ampere) on the same box, ran the same context-length ladder (8K→128K) through Ollama and vLLM on qwen3-coder:30B-A3B, and got zero extra usable context in either engine — and a 74% decode-speed hit for the trouble. Ollama hits the identical Chunk too big wall at ctx=65536 whether the P2000 is there or not. vLLM refuses tensor-parallel across the two cards entirely — not a VRAM problem, a flat compute-capability rejection (Minimum capability: 75. Current capability: 61.) that fails in 40 seconds, before any memory profiling. And the one real, measured effect of adding the P2000 to Ollama: decode speed goes from 76 → 19.5 tok/s at ctx=49152 once the P2000 gets pulled in as an actual compute device.

Full narrative version — the two-stage collapse, the prompt-cache validation bug caught mid-sweep, the CUDA13-silently-drops-Pascal finding — is on Medium.## The setup

ardi (dual Xeon E5-2680 v4, 128GB RAM, openSUSE Leap) has a Quadro P2000 sitting in a second slot next to the RTX 3090 this whole series has run on so far. Same model as phase 1 (qwen3-coder:30B-A3B), same box, four legs: {Ollama, vLLM} × {3090 only, 3090+P2000 tandem}, priced through HomeLab Monitor against real GPU power draw.

Ollama: same wall, extra tax

ctx	3090 only decode tok/s	tandem decode tok/s	P2000 VRAM (tandem)
8,192	124.3	122.0	6 MB / 0%
24,576	108.2	70.0	62 MB / 0%
32,768	99.4	61.0	62 MB / 0%
49,152	75.7	19.5	3,580 MB / 55%
65,536	fatal: `Chunk too big`	fatal: identical `Chunk too big`	—

Two separate costs, not one: decode already falls behind at ctx=24576 while the P2000 is still basically idle (62MB, 0% util) — some scheduling overhead just from having a second visible device. Then the real collapse hits at ctx=49152, when the P2000 actually gets pulled into the compute path (3.58GB, 55% util) and decode craters to 19.5 tok/s. Same context ceiling either way, worse speed the whole way there.

vLLM: doesn't even get to try

Expected failure mode going in: tensor-parallel splits the ~17GB AWQ checkpoint roughly in half, and the P2000's 5GB doesn't hold its ~8.5GB share. Actual failure, at ctx=8192, in 40 seconds, before any memory profiling:

ValueError: The quantization method auto_awq is not supported for the current GPU.
Minimum capability: 75. Current capability: 61.

AWQ's Marlin kernel needs compute capability 7.5+ (Turing and later). The P2000 is 6.1 (Pascal). Not a close VRAM call — a flat architectural exclusion, decided before capacity is even checked.

Bonus finding: Ollama's own CUDA13 build almost drops the P2000

Boot log, before any of the above:

skipping CUDA device — compute capability not in compiled architectures
device="Quadro P2000" cc=610
archs="[750 800 860 870 890 900 1000 1030 1100 1200 1210]"

Falls back to a legacy cuda_v12 runtime that does support Pascal — so it works, just via a path most people wouldn't notice without reading boot logs. This 2016 card is now old enough that modern quantized-inference stacks are starting to architecturally step around it, not just outrun it.

What wasn't the point of this one

Not claiming a second GPU is never worth it — a matched pair, or a smaller-but-newer card, is a different setup entirely. This was specifically: does this 5GB Pascal card, next to this 3090, on these two engines, buy anything. Check compute capability against your quantization scheme before you do the VRAM math — it can end the conversation first.

Every number above priced through HomeLab Monitor — open source, MIT licensed — against ardi's real GPU power draw. Full write-up with all four charts and the mid-sweep debugging on Medium.What's the oldest card you've tried to tandem into a rig — did it actually pull weight, or did you just assume it was?

Whisper large-v3 VRAM Requirements: Why It Won't Fit on a 5GB GPU (and What We Tried Instead)

Arsen Apostolov — Wed, 08 Jul 2026 03:29:30 +0000

TL;DR

whisper-large-v3 OOMs on a 5GB GPU (Quadro P2000) at float16, int8_float16, and full int8 — before serving a single request. Root cause is architecture overhead (32-layer encoder-decoder, activations, CUDA context), not just weight size. Fine-tuned whisper-tiny → base → small → small-v2 on Common Voice Bulgarian instead: held-out WER improved from 88.2% → 32.7% across escalating model size, but never closed the gap to large-v3's 27.3%. A community large-v3-turbo Bulgarian fine-tune claiming 9.97% WER on FLEURS scored 31.2% on our own held-out set — same ballpark as our own model, not the win the model card implied. Built a real dual-GPU nginx failover (P2000 = fine-tune, 3090 = large-v3) that worked correctly on deploy, then failed a real spontaneous-speech test badly enough to roll back to large-v3-only within ~5 seconds. Core finding: Common Voice read-aloud WER does not predict real assistant-use transcription quality.

The setup

ardi has one RTX 3090 (24GB) doing LLM inference work, and a Quadro P2000 (5GB) that's sat idle for about two years. Jarvis, a self-hosted assistant, depends on Whisper for speech-to-text — testing showed only large-v3 handles Bulgarian well; smaller stock checkpoints are fine for English, not for a lower-resource language. large-v3 sits permanently loaded on the 3090, the same card needed for local LLM serving.

Question: can the idle P2000 take Bulgarian transcription off the 3090's hands via a Bulgarian-specific fine-tune small enough to fit 5GB?

(One naming note so the rest of this makes sense: the container running here is whisper-asr-webservice wrapping faster-whisper — not the separate WhisperX project, despite what I've been calling it internally for months.)

Attempt 1: does large-v3 just fit?

Tested large-v3 on the P2000 at three precisions:

float16        -> OOM
int8_float16   -> OOM
int8           -> OOM

All three OOM before serving a request. Not a quantized-weight-size problem — the encoder-decoder's non-weight overhead (32 layers, activations, CUDA context) exceeds 5GB regardless of precision. whisper-tiny loads at 318MB with no issue, ruling out a driver/compatibility problem. medium (769M params) was the practical ceiling for raw model size — 3.87GB used, 1.2GB headroom — but a generic multilingual medium isn't good enough for Bulgarian on its own.

Attempts 2–4: escalating fine-tunes

Fine-tuned on Mozilla Common Voice Bulgarian, on the 3090, via HuggingFace transformers Seq2SeqTrainer. Evaluated on the same 150 held-out test clips (never seen in training) for every model:

small-v2 = same architecture as small, retrained on train+other combined (6,739 rows vs 4,952) for 5 epochs. Validation WER by epoch: 32.17 → 28.99 → 28.21 → 28.21 → 28.44 — flattened, then rose at epoch 5 (overfitting), so load_best_model_at_end correctly kept the epoch 3/4 checkpoint rather than the final one. No more clean Bulgarian Common Voice data exists beyond train+other, so this is the practical ceiling for this data/model-size combination.

Two gotchas caught along the way:

# 1. CUDA_VISIBLE_DEVICES alone doesn't guarantee GPU index matches
# nvidia-smi's PCI-bus order -- a run silently landed on the P2000
# instead of the intended 3090 until:
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1

# 2. ardi's root disk (already at a tight 90% baseline) filled to 100%
# mid-training from accumulated dataset/HF caches -- silent SIGKILL,
# no traceback. Fixed by pointing the cache at a bigger volume instead
# of the system disk:
export HF_HOME=/backup/hf-cache
export CACHE_DIR=/backup/whisper-bg-tiny-data

Neither is the interesting part of this story, but both cost real debugging time — worth checking explicitly on any shared multi-GPU box.

Attempt 5: the community shortcut that didn't reproduce

Searched Hugging Face for an existing Bulgarian ASR fine-tune before pushing further on limited training data. Found sam8000/whisper-large-v3-turbo-bulgarian-bulgaria — a fine-tune of large-v3-turbo (same 32-layer encoder as full large-v3, decoder pruned from 32 to 4 layers), claiming 9.97% WER on the FLEURS Bulgarian benchmark.

Converted to CTranslate2, it does fit the P2000 — 4.1–4.2GB used, ~900MB headroom — tight but real (the third bar in the VRAM chart above). Evaluated on the same held-out Common Voice test set used for every model above:

sam8000/whisper-large-v3-turbo-bulgarian-bulgaria: 31.2% WER
our own small-v2 (fine-tuned):                     32.7% WER

Statistically the same result, not the dramatic win the model card implied. The 9.97% FLEURS number isn't fake — it just doesn't transfer to a different eval set with different preprocessing/normalization. Always re-measure a candidate on your own eval, apples to apples, before trusting a model card's headline number.

The part that worked: dual-GPU failover

Built a real deployment: two whisper containers (P2000 = small-v2, 3090 = large-v3 unchanged) behind an nginx sidecar using proxy_next_upstream for automatic failover. One detail that shapes what "failover" means here: whisper-asr-webservice loads its model eagerly at process boot, not per-request — so this isn't a live per-call fallback, it's "is this backend up or down," decided once at startup.

Deployed live, confirmed it actually worked — routing correct. The old standalone production container was kept stopped, not deleted, for the entire session — the eventual rollback was a container start, not a rebuild.

The test that actually mattered

Real spontaneous speech — describing colors and objects out loud, not Common Voice-style read sentences. Verdict: "quite, quite, quite weak." Noticeably worse than the 32.7% benchmark WER suggested for casual listening. Rolled back to large-v3-only production immediately — ~5 seconds, because the old container was never torn down.

What we deliberately didn't do next

Didn't publish a GitHub repo for the fine-tuned checkpoints — the result isn't good enough to ship as a "solution."
Didn't chase a 6th fine-tune attempt (medium-size, more data augmentation) — diminishing returns were already visible in the epoch curve, and the deeper problem (domain mismatch between read-aloud and spontaneous speech) wouldn't be fixed by more of the same data.
Didn't keep the dual-GPU stack running "just in case" — production reverted to exactly its pre-session state, P2000 idle again.

The actual finding

Common Voice is people reading prepared text aloud in clean conditions — a different domain from spontaneous conversational speech directed at an assistant (prosody, hesitation, mic quality, vocabulary). A benchmark WER on read-aloud speech didn't predict real assistant-use quality here, for either our own fine-tune or a community model claiming a much better number on a different benchmark. This generalizes past Bulgarian and past Whisper: eval-set domain match matters more than the headline metric.

Full narrative version — the charts, the physical GPU install photo, the "why I still don't have a use for this card" ending — on Medium.

Every VRAM ceiling and WER number above was measured via HomeLab Monitor — MIT licensed, one container, the same tool that's priced every benchmark in this series.

Curious if anyone's gotten a Bulgarian (or other lower-resource-language) Whisper fine-tune to hold up on real spontaneous speech, not just a read-aloud benchmark — and what closed the gap if so.

vLLM vs llama.cpp vs Ollama: What Happens When Your Model Doesn't Fit in 24GB VRAM

Arsen Apostolov — Sun, 05 Jul 2026 05:54:01 +0000

TL;DR

Benchmarked llama.cpp, Ollama, and vLLM across 5 models (1B to 116.8B params) on one RTX 3090 (24GB) + 128GB RAM home-lab box, priced through HomeLab Monitor. Inside 24GB, vLLM's continuous batching scales aggregate throughput 3.9x-5.4x from concurrency 1 to 8 (llama.cpp only manages 1.2x-1.9x, even with -np 8 explicitly set to match). Past 24GB — two models deliberately chosen to force RAM-spill — llama.cpp and Ollama both degrade to single-digit tok/s and keep generating. vLLM OOMs outright on both, at the same ~22.1-22.2GB-used / <700MB-free ceiling, regardless of quantization scheme. Sub-plot: llama.cpp's manually-tuned layer offload beats Ollama's automatic split by 37x on time-to-first-token during RAM-spill, while landing on nearly identical steady-state decode speed.

The roster

Model	Vendor	Type	Fits in 24GB?
Gemma 3 1B	Google	dense	yes
Qwen3-Coder 30B-A3B	Alibaba	MoE (~3.3B active)	yes
Gemma 4 26B-A4B	Google	MoE (~4B active)	yes
GLM-4.5-Air 106B-A12B	Zhipu	MoE (~12B active)	no, deliberately
GPT-OSS 120B-A5.1B	OpenAI	MoE (~5.1B active)	no, deliberately

(Gemma 4 is real — Google's newest release as of this writing, not a Gemma 3 typo.)

3 prompt tiers (short/medium/long), concurrency 1 and 8, 2 reps per cell, 15 backend×model pairs total. Caveat stated up front: the first three models ran against my production Ollama (OLLAMA_NUM_PARALLEL=1, serialized by default — real daily-use config); GLM and GPT-OSS ran against a separate isolated instance (OLLAMA_NUM_PARALLEL=4) since they needed a clean volume anyway. Ollama's concurrency=8 numbers for the first three models are not its concurrency ceiling — they're its actual default production behavior.

Concurrency, inside 24GB

Aggregate decode tok/s, concurrency 1 → concurrency 8:

Model	Ollama	llama.cpp	vLLM
Gemma 3 1B	125.6 → 71.4	294.1 → 400.6	235.5 → 1172.1
Qwen3-Coder 30B-A3B	129.3 → 108.4	157.2 → 183.9	172.0 → 677.9
Gemma 4 26B-A4B	84.5 → 78.5	118.8 → 220.6	133.8 → 723.4

vLLM's own c1→c8 scaling: 3.9x-5.4x (paged attention, requests slot into idle cycles). llama.cpp's, even with -np 8 matched to the concurrency level: 1.2x-1.9x — it pre-declares a fixed KV-cache reservation per parallel slot before the server starts, so concurrency is a config decision, not a runtime one. Head-to-head at c8: vLLM beats llama.cpp by 2.9x-3.7x, beats Ollama's serialized default by 6.3x-16.4x (caveat above applies).

The cliff, and vLLM's wall

GLM-4.5-Air (~52% of layers spilled to system RAM under llama.cpp's tuning) and GPT-OSS-120B (~67% spilled) were picked specifically to not fit. llama.cpp and Ollama both ran them — slow, single-digit tok/s, but real generation, no crash. vLLM failed outright on both:

# GPT-OSS-120B, native MXFP4, --cpu-offload-gb 45
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.08 GiB.
GPU 0 has a total capacity of 23.56 GiB of which 533.69 MiB is free.
Process ... has 22.21 GiB memory in use.
RuntimeError: Engine core initialization failed.

# GLM-4.5-Air, pre-quantized AWQ, --cpu-offload-gb 36
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.16 GiB.
GPU 0 has a total capacity of 23.56 GiB of which 685.69 MiB is free.
Process ... has 22.12 GiB memory in use.

Same shape, different model, different quantization path. I retried GLM at --gpu-memory-utilization 0.78 (down from 0.90, to force more declared headroom) — got the byte-for-byte identical error: 22.12 GiB used, 685.69 MiB free, 1.16 GiB requested. That rules out the utilization knob as the fix; the base weight + offload footprint is already pinned at the ceiling before profiling starts. Two models, two quant schemes, same ~22GB wall — reads as a real limit of vLLM's CPU-offload path for >100B-param MoE on one 24GB card on this stack, not a per-model quirk.

TTFT: the 37x gap that steady-state doesn't show

On the models that ran everywhere, steady-state decode is nearly a tie once warmed up — GPT-OSS-120B's longest tier: 7.65 tok/s (llama.cpp) vs 7.6 tok/s (Ollama). GLM: 4.58 vs 4.59. Time-to-first-token is a different story:

Model	Ollama TTFT	llama.cpp TTFT	Gap
GLM-4.5-Air	13.6s	8.1s	1.7x
GPT-OSS-120B	274.0s	7.3s	37x

llama.cpp's -ngl is a number I computed myself from the model's real config.json (layer count, per-layer size) — -ngl 12 for GPT-OSS, offloading ~21GB deliberately. Ollama figures the split out automatically at load time, and on a freshly-pulled, partially-RAM-resident 65GB model, that automatic path is expensive. Same destination, very different path there.

What it costs (BGN per 1M output tokens, real GPU energy)

Model	Ollama	llama.cpp	vLLM
Gemma 3 1B	0.19	0.05	~0*
Gemma 4 26B-A4B	0.25	0.14	0.04
Qwen3-Coder 30B-A3B	0.16	0.13	0.04
GLM-4.5-Air	2.61	1.95	OOM
GPT-OSS-120B	10.00	1.43	OOM

*vLLM's Gemma 3 1B run finished in 6s — too fast for the power sampler to catch a reading, recorded near-zero. A sampling limitation on short bursts, not a genuine free result.

GPT-OSS-120B on Ollama costs ~7x more real electricity per million tokens than llama.cpp for the identical model — the TTFT convenience tax from above, showing up again in currency.

Three disclosed vLLM checkpoint swaps

The original plan was on-the-fly bitsandbytes 4-bit quant for every vLLM leg. It failed for every MoE model, for three distinct, verified reasons — not the same error copy-pasted three times:

Qwen3-Coder-30B: ValueError: BitsAndBytes quantization with padded hidden_size ... Parameter shape (786432, 1) != checkpoint shape (2048, 768) — bnb can't dequantize this MoE's padded expert layout. Fix: pre-quantized AWQ checkpoint. Ran clean after (677.9 tok/s aggregate @ c8).
Gemma 4 26B-A4B: AttributeError: MoE Model Gemma4ForConditionalGeneration does not support BitsAndBytes quantization yet. A new architecture, bnb path not wired up yet. Fix: a different pre-quantized checkpoint — which then hit a pydantic error because its config.json says compressed-tensors, not AWQ, despite the repo name. Fixed by dropping the explicit --quantization flag entirely and letting vLLM auto-detect.
GLM-4.5-Air: not a failure — a practicality call. Skipped a 212GB native bf16 download to test a bnb+MoE+CPU-offload combo the vLLM community already flagged as shaky, went straight to a ~63GB pre-quantized AWQ checkpoint that tests the exact same question.

Every root cause above came from the actual container logs, not from assuming precedent carried over from the previous model's failure.

What wasn't tested

Only two --gpu-memory-utilization values before accepting the OOM as final, not a full --cpu-offload-gb sweep. No multi-GPU / tensor-parallel vLLM path — a different question from "does single-card CPU offload work." Ollama's c8 numbers for the first three models are its production default, not its concurrency ceiling. And one raw llama.cpp per-request timing (Gemma 4, medium tier, c8) self-reported an impossible 250,024 tok/s from a near-zero-duration completion — the aggregate figures used throughout are total-tokens-over-wall-time, which isn't corrupted by that, but it's a known rough edge in the raw per-request logs.

Full narrative version, with the RAM-spill mechanics and the redacted dashboard screenshot: on Medium.

Every number above was priced through HomeLab Monitor — open source, MIT licensed — against the RTX 3090's real power draw.

If you're already running one of these three backends: has yours ever tried to load something that just didn't fit — and did it fail loud or fail quiet?

Local LLM vs Claude: Benchmarking qwen3-coder:30b as a Production Agent Backend

Arsen Apostolov — Fri, 03 Jul 2026 11:16:36 +0000

TL;DR

Replayed 27 real historical tasks from Jarvis (my LangGraph agent, ~90 tools) through qwen3-coder:30b on an RTX 3090, scored against Claude's actual production answers to the same tasks. Quality: Claude 89.4/100 vs qwen 22.8/100. Cost: qwen ~5,150x cheaper per task ($0.00015 vs $0.763, real GPU electricity vs real API billing). Reliability: qwen leaked malformed tool-call tags into 26% of answers and only overlapped with the tools the task actually needed 14.8% of the time. Same qwen3-coder:30b scored 100% in an earlier, much smaller benchmark — the gap here is about tool-surface complexity, not the model being bad.

The question

Jarvis is a real personal AI agent — LangGraph create_react_agent, ~90 tools spanning email/calendar/notes/files/messaging/code, running on Claude in production. qwen3-coder:30b had already scored 100% task success in a controlled 17-task benchmark on the same RTX 3090. Obvious next question: drop it into the real agent and see what happens.

The setup

28 real task prompts pulled from Jarvis's own Langfuse traces (90-day window), stratified 4×7 across calendar / code / email / files / general / messaging / notes.
Claude's answers are real production history, not re-run. Re-running through the sandbox would hand it fake stub data it never saw — that's a worse baseline, not a fairer one.
qwen runs fresh, through a sandboxed replay harness: the real Jarvis agent code in-process, every write-capable tool intercepted (nothing sent/written for real), and every mocked read-only tool serves the real recorded output from that task's original trace when available — not a generic stub. Same data, both models.
1/28 tasks excluded (336,906-char prompt, over any 16K–24K context window) → 27 scored.
Judge: LLM-as-judge (claude-opus-4-8), scored independently per answer (not pairwise) to avoid position bias, 1–5 → 0–100.
Every qwen run priced as a HomeLab Monitor experiment against real 3090 power draw. Claude's cost is Langfuse's recorded API billing.

Caveat, stated plainly: the judge is a Claude model scoring Claude's own answers alongside qwen's — self-preference bias is a documented effect in LLM-as-judge setups and probably inflates the gap somewhat. It doesn't explain a 66-point gap, a 26% malformed-output rate, or two tool-call loops, but it's a real methodology limitation, not a footnote.

Getting here took three re-runs: a judge-response parsing bug that silently neutral-scored ~40/54 calls, a mock-data bug that starved qwen of real inbox/calendar content on 16/28 tasks while Claude's baseline had the real thing, and a Claude-API rate limit that neutral-scored another batch mid-scoring. All three caught by checking score distributions, not by trusting a clean exit code — worth knowing before trusting the numbers below.

The numbers

	Claude	qwen3-coder:30b
Avg quality (0–100)	89.4	22.8
Cost / task	$0.763 (real API billing)	$0.00015 (real GPU electricity)
Total cost, 27 tasks	$20.60	$0.004
Total energy	—	0.0396 kWh

~5,150x cheaper per task for qwen (precise, currency-converted from a 0.0072 BGN total across all 27 tasks, at 1 BGN = $0.5547 — an earlier rough estimate of 180x on this project was wrong, this is the corrected number).

By category (Claude | qwen | n):

calendar:   90 | 30 | 4
code:       87 | 25 | 3
email:      92 | 15 | 4
files:      88 | 15 | 4
general:    85 | 30 | 4
messaging:  87 | 22 | 4
notes:      97 | 22 | 4

qwen's best relative showing (calendar, general) is still a third of Claude's score. It never wins a category.

Where it breaks

Malformed tool-call leak — instead of a real LangGraph tool call, qwen sometimes emits the call as raw text in its final answer:

<function=send_email>
{"to": "...", "subject": "...", "body": "..."}
</function>

That happened on 7/27 tasks (26%). The user reading that answer sees broken syntax where a real action should have been confirmed or a real answer given.

Tool-overlap recall: 14.8% average, measured over the 18/27 tasks where the original historical trace actually used at least one tool (9 tasks needed none). Most of the time qwen reached for different tools than the ones that actually solved the task — or none.

Repetitive-loop failure on 2/27 tasks: pilot-17 (email, 24 tool calls, 138.6s, ~196.7K input tokens) and pilot-27 (messaging, 27 tool calls, 148.9s, ~196.7K input tokens) both called the same already-answered tool (run_command, todo_write) repeatedly instead of stopping. Confirmed via raw logs both tasks got real replayed data (replayed_real_data: true) — a genuine stopping-condition failure, not a data-starvation artifact.

One more data point worth having, not a verdict: on a task where both models actually called send_email(...) in the harness (intercepted, nothing sent), Claude told the user the email had been sent — a fabrication. qwen correctly disclosed the send didn't go through. Not "qwen is more honest" — it's also the model leaking raw tags 26% of the time. Both mishandled the mock, just differently.

Scope of the claim

Same qwen3-coder:30b, same GPU, scored 100% on a 17-task controlled benchmark with a much smaller tool surface. This isn't "local LLMs are bad" — it's that a model excellent on a scoped benchmark isn't automatically a safe drop-in for a large, real, ~90-tool production surface with a 31KB context prompt and real messy history behind it. Task/tool-surface complexity mattered as much as raw model quality here. Claude isn't flawless either — see the fabricated send-email confirmation above.

Jarvis stays on Claude for now. The cost number is real enough to be worth a narrower follow-up — testing qwen on just the categories where it scored closest (calendar, general) as a cheap fallback path, rather than a full swap.

Full narrative version, charts, and the three-bug scoring saga: on Medium.

Every qwen run here was priced through HomeLab Monitor against the 3090's real power draw — MIT licensed, one container, reproducible if you want to price your own local-model experiments the same way.

Curious where the line is for you: how cheap does a local model have to be before you'd trust it with a slice of a real agent, and which slice would you pick first?

How to Run Reliable Local LLM Agents on an RTX 3090: A Benchmark (5 Models, Priced in Watts)

Arsen Apostolov — Sun, 28 Jun 2026 06:54:12 +0000

I gave GLM-4.5-Air (106B, open weights) 12 coding tasks through opencode on my RTX 3090. It scored 0% — never edited a single file.

Same model, same GPU, same tasks, but driven by a ~150-line LangGraph agent instead: 93%.

The model was never the problem. The orchestrator was. Here's the benchmark — including the part nobody else measures, the electricity cost per correct task.

Setup

RTX 3090 (24 GB) + 128 GB RAM, models via ollama, Q4 quants, temp 0.2
5 recent open models × 2 orchestrators (opencode vs custom LangGraph ReAct with ollama-native tool-calling)
17 graded tasks (12 coding in Python/JS/C++ + 5 general-agent) with hidden unit tests
Every run priced in GPU watts via my open-source homelab-monitor

Results

Model	tok/s	opencode adh.	LangGraph adh.	LangGraph coding	LangGraph general
Qwen3-Coder 30B-A3B	130	92%	100%	100%	100%
GLM-4.5-Air 106B	5.7	0%	100%	89%	100%
Devstral Small 24B	49	8%	53%	8%	40%
Seed-OSS 36B	9.5	0%	7%	0%	20%
DeepSeek-R1-Distill 32B	6.7	0%	0%	0%	0%

Tool-adherence = % of tasks where the model actually called a tool instead of just printing code in chat. It was the master variable. (GLM's headline "93%" is its blended score across all 17 tasks: 89% coding + 100% general.)

Three takeaways

The framework can matter more than the model. opencode sends a frontier-shaped system prompt + 12 tools over its OpenAI-compat path; most local models fall back to chatting. Native tool-calling through a lean agent fixes that — GLM went 0% → 93%. (Qwen3-Coder is the exception: it's tuned for agentic tool use and aces opencode out of the box.)
Acting ≠ solving. LangGraph made Devstral act (8% → 53% adherence) but not solve (coding stayed 8%). The framework decides whether a model acts; the model decides whether it's right.
The wattmeter ranks honestly. Qwen solved tasks at ~0.0005 BGN each; the models that scored zero still burned 10–30× more energy for nothing. On a home rig, the cheapest model is the fast, correct one — and MoE (Qwen activates ~3B of 30B per token) wins twice.

Bonus: 128 GB RAM let me run the 106B GLM (23 GB VRAM + 27 GB spilled to RAM) — it works, at 5.7 tok/s. Great for fire-and-forget batch jobs, not interactive coding.

The recipe for reliable local agents

Pick a tool-use-tuned model (Qwen3-Coder 30B-A3B is the all-weather winner) → use native tool-calling, not an OpenAI-compat path → keep the harness lean → use RAM for reach, not speed → measure correctness per kWh.

📖 Full write-up with methodology, charts, and the deeper "why" → [https://medium.com/@arsen.apostolov/local-llm-agents-on-an-rtx-3090-i-benchmarked-5-models-2-frameworks-and-the-orchestrator-f5fd600ca221]

⭐ Every number was priced in watts by homelab-monitor — my open-source tool that turns your GPU's power draw into per-task cost. Star it if you want the same receipts for your own rig. Harness + tasks + leaderboard code are reproducible.

How to Rank Local LLMs by Cost per Correct Answer (Measured GPU Energy, 8 Ollama Models)

Arsen Apostolov — Tue, 23 Jun 2026 18:11:23 +0000

TL;DR: I priced 8 local Ollama models by € per 1,000 correct answers — metered GPU energy ÷ correct answers, on one RTX 3090. gemma4:26b won at 96.9% accuracy for €0.013/1k-correct. The most expensive model (qwen3:8b-fp16) cost €0.239/1k and scored worse (66.7%). Reasoning tokens and full precision both cost a lot and bought nothing here. Every cost comes from real metered kWh via the open-source HomeLab Monitor.

This is the short, copy-pasteable version. The narrative writeup is on Medium.

The metric

€ per correct answer = (metered GPU energy cost over the eval window) ÷ (number of correct answers)

Tokens-per-euro flatters whichever model talks the most. Cost-per-correct only rewards being right cheaply — which is the thing you actually pay for.

The signal

Model                  VRAM     Acc     Tok/task  Tok/s  Wh/pass  €/1k correct (day)
gemma4:26b             16.9 GB  96.9%   68        86     4.5      €0.013   ← winner
gemma3:1b              0.9 GB   82.1%   125       133    3.8      €0.013
gemma3:27b             17.1 GB  100.0%  119       36     16.3     €0.046
qwen3:30b-a3b   (MoE)  18.4 GB  83.3%   555       186    14.1     €0.048
qwen3:8b (Q4_K_M) 🧠   5.4 GB   64.8%   626       126    22.7     €0.100
qwen3:8b          🧠   5.4 GB   64.8%   626       126    23.6     €0.104
qwen3:8b (Q8_0)   🧠   8.7 GB   61.1%   672       88     33.5     €0.156
qwen3:8b (fp16)   🧠   15.5 GB  66.7%   664       53     56.2     €0.239   ← most expensive

🧠 = reasoning/thinking mode on. Night tariff knocks ~40% off every row.

Three things the numbers say

1. The value champion is mid-size, not max-size. gemma4:26b hit 96.9% for €0.013 per 1,000 correct — cheapest-per-correct on the whole bench and near-perfect, ~18× cheaper per correct answer than qwen3:8b-fp16. gemma3:27b is the only 100% model but costs ~3.5× more (slower, 36 tok/s).

2. The thinking tax is real and didn't pay off. qwen3 reasoning models emit 555–672 tokens/task vs the gemmas' 68–125 (5–9×). Tokens are energy. On these 54 deterministic tasks that extra reasoning bought no correctness — the priciest model scored lower than one 18× cheaper. (Caveat: this suite is arithmetic / executable code / format-following. On open-ended hard problems, reasoning earns its tokens. On structured agent work, it was dead weight.)

3. The quantization paradox. Same qwen3:8b at three precisions:

            Accuracy   Energy/pass   Throughput
Q4_K_M      64.8%      22.7 Wh       126 tok/s
Q8_0        61.1%      33.5 Wh       88  tok/s
fp16        66.7%      56.2 Wh       53  tok/s
            └ flat ┘   └ 2.5× ┘      └ halved ┘

Higher precision cost 2.5× the energy and half the throughput for accuracy that's flat-and-noisy. On a 3090, aggressive quant was the correct call, not a compromise.

Methodology (so you can trust the ranking)

54 deterministic tasks, mechanically graded — no LLM judge. Reasoning 15 (GSM8K-style numeric extraction), code 12 (HumanEval-style, executed asserts in a sandbox), factual 12 (keyword), instruct 15 (format predicates). Grader selftest 11/11.
Controls identical across all 8 models: temperature 0, seed 42, num_ctx 4096, num_predict 1024, identical prompts.
Warm-up discarded → model-load energy excluded (pricing inference, not cold starts).
3 passes each, ranges reported.
Idle baseline = 38 W, measured as a control.
qwen3 thinking left on (realistic); thinking tokens counted for energy, stripped before grading.
Honest determinism caveat: Ollama is not bit-exact at temp 0. gemma3:1b drifted 81–83% across passes; gemma3:27b was 100% on all three; qwen3 runs were identical. Report ranges, not point claims.
CPU/DRAM not metered (no RAPL on this host), so true wall-plug cost is a bit higher — but the ranking holds because every model paid the same un-metered overhead.

The currency gotcha (measure twice)

Costs are EUR from measured kWh × Bulgarian dual tariff (€0.1534 day / €0.0920 night). While building this I caught my own dashboard mislabeling BGN as EUR: the tariff read 0.30/0.18 EUR, but those are leva. Bulgaria joined the euro on 2026-01-01 at fixed 1 EUR = 1.95583 BGN; €0.30/kWh would be German-tier, implausible for the EU's cheapest household power. Converted: 0.30 / 1.95583 = €0.1534, 0.18 / 1.95583 = €0.0920. Lesson: don't trust the dashboard's € field — compute from physical kWh and your verified tariff.

How to reproduce the energy tracking

Every cost above came from HomeLab Monitor (MIT, one container) — its Experiments tab integrates real GPU power over a run's window into kWh and money. Bring it up:

docker compose up -d        # port 9800

Grab the one-file homelab_run.py client, mint an ingest key, and wrap your eval — the run comes back priced:

import homelab_run as homelab
homelab.configure(url="http://<your-host>:9800", key="hlm_…")

with homelab.run("gemma4:26b", tags=["llm-cost-bench"]) as r:
    for _ in range(PASSES):
        run_graded_eval(model)          # all inference inside the run
priced = homelab.pull(r.id)             # energy_kwh, cost, avg_w, peak_util — from real power

That's the whole instrumentation. Divide the priced energy by your grader's correct count and you've got cost-per-correct for your own roster. Docs · docker pull sikamikaniko123/homelab-monitor.

What I deliberately did NOT do

No LLM-as-judge — mechanical grading only.
No cold-start energy in the numbers — warm-up discarded on purpose.
No trusting the dashboard's € field — costs recomputed from measured kWh.
No single-run claims — 3 passes, ranges where they exist.
No CPU/DRAM cost claim — only the GPU is metered, and I say so.

Over to you

Bigger and full-precision lost. A 26B model did near-perfect work for a rounding error; an fp16 reasoning model charged 18× as much to be wrong more often.

So when you reach for a local model — accuracy, speed, or cost per answer it actually gets right? And have you ever measured the third one? Drop your own cost-per-correct numbers in the comments.

How Much Does It Actually Cost to Run a Local LLM? (€ per Million Tokens, Measured)

Arsen Apostolov — Mon, 22 Jun 2026 18:33:31 +0000

"It runs on my own GPU, so it's basically free." I believed that until I put a meter on it. So I ran a controlled benchmark on one box — an openSUSE machine with a single RTX 3090 — driving three local models through ollama under an identical fixed workload (256-token generations in a loop for ~4 minutes each), while my open-source dashboard priced every run by the real GPU energy it burned: power sampled from nvidia-smi every 10 s, integrated over each run's exact window, multiplied by my actual day/night tariff. One number per model, in euros per million output tokens.

Here's the part that made me re-run it. The tiny gemma3:1b came out at €0.118 / 1M tokens — about 5× cheaper than a hosted Flash-class API (~€0.55). But gemma3:27b's electricity alone was €0.706 / 1M — more expensive per token than just paying the cloud, and that's before a single cent of the GPU's purchase price. "Local" didn't make it cheaper; it made it cost more and I own the depreciation. The mechanism is one line: each token costs watts ÷ throughput, and a big dense model is both slow and thirsty. A newer mid-size architecture (gemma4:26b) bought a lot of that back, landing at €0.272.

The full guide is methodology-first and reproducible end to end — minting an ingest key, the stdlib-only client, the exact ollama loop that reads eval_count/eval_duration for real tokens-per-second, reading each run back priced, and the honest caveats (this is marginal GPU energy only — not capex, idle, or cooling — and the absolute numbers round to fractions of a cent; the shape is the finding).

Read the full guide on Medium → https://medium.com/@arsen.apostolov/how-much-does-it-actually-cost-to-run-a-local-llm-per-million-tokens-measured-4a90a7f31a48

How to Fix Watchtower Not Updating Containers on Docker 29

Arsen Apostolov — Mon, 22 Jun 2026 13:42:12 +0000

You push a newer image to your registry, Watchtower wakes up on schedule, scans, reports a clean run — and your container keeps serving last week's image. No error, no restart loop, nothing red to chase. It just silently stops recreating. If that started happening around the time you landed on Docker 29, you've found the cause.

The classic containrrr/watchtower image — the one in basically every tutorial — is effectively unmaintained, with its last release back in 2023. It's a Docker API client, and when the Engine's API moved forward, the part that lists and compares images kept working while the part that actually recreates containers quietly fell off. So you get the worst failure mode there is: a tool that reports success while doing nothing.

The fix is a one-line image swap to the maintained community fork, nickfedor/watchtower:latest — a drop-in replacement with the same labels and env vars that tracks the current Docker Engine. On my Docker 29.2.1 box it recreates cleanly again (scanned=10 updated=1 failed=0), and I let homelab-monitor surface every recreate by its reset uptime — so I can see which container the auto-updater just touched, and get a push alert if a recreate ever comes up unhealthy.

The full guide has the exact compose, the bring-up commands, the real recreate log lines, and how to opt containers in safely with the label.

Read the full guide on Medium → https://medium.com/@arsen.apostolov/how-to-fix-watchtower-not-updating-containers-on-docker-29-a891217c6db2

How to Fix Docker Networking After a firewalld Reload

Arsen Apostolov — Sun, 21 Jun 2026 06:25:38 +0000

You edit a firewalld zone, run the one command you always run — sudo firewall-cmd --reload — and it returns success. Then, hours later, a backup didn't run. Your containers are still Up in docker ps. The host has internet. The containers have none.

Here's the seam: firewalld and Docker both write the same netfilter tables, and neither knows about the other. firewall-cmd --reload flushes the whole ruleset and re-applies only firewalld's config — wiping the DOCKER / DOCKER-USER chains and the NAT masquerade that dockerd installed at startup. Docker doesn't get told its rules vanished, so it never re-adds them. Result: running containers lose outbound internet while still reporting healthy. The 10-second manual fix is sudo systemctl restart docker (dockerd re-installs its chains on start). The permanent fix is a small systemd unit that restarts Docker automatically whenever firewalld reloads — so it self-heals before you notice.

The full guide has the exact docker-firewalld-watch.service unit file, the enable --now commands, and a copy-paste test that breaks egress on purpose and proves it heals. It also covers how I make this failure visible across a whole fleet — because "alive but isolated" is invisible by design — using homelab-monitor: one Docker container, polls every host over SSH (no agents), shows fleet-wide container/service health, and pushes edge-triggered alerts to Discord, ntfy.sh and Telegram the moment a container flips red.

Read the full guide on Medium → https://medium.com/@arsen.apostolov/528889d3eca1

How to Get Disk-Full Alerts Across Linux and Windows

Arsen Apostolov — Sun, 21 Jun 2026 04:03:06 +0000

My fleet doesn't agree on anything: an openSUSE hub, an Ubuntu box, a Windows 11 workstation, a Windows 10 VM. Different shells, different disk-checking habits — which is how that Windows 10 VM ended up at C: 99.2% full, 39.1 of 39.4 GB, about 0.3 GB from the wall, with me none the wiser. It wasn't alone: the Windows 11 box's G: was at 94.3%, the Ubuntu box at 83.1%, the hub at 76%.

df only fires on the box you're logged into, on the mornings you remember — and Windows doesn't speak it at all. What I actually wanted was boring: one table with every mount on every host, and a ping the moment one crosses a line.

I get both from a single container — HomeLab Monitor (open source, MIT). It polls every host over SSH (Linux and Windows, no agents), shows every disk worst-first, and pushes edge-triggered alerts to Discord, ntfy.sh or Telegram with a disk-usage threshold you set in the UI — no env vars, no config files. So a 99.2% disk taps you on the shoulder instead of quietly taking down a VM.

The full guide is a four-step walkthrough, with screenshots: bring up the container, add your Linux and Windows hosts over SSH (three clicks each), see every disk in one place worst-first, and set the disk threshold + alert channel — then fire a test ping to prove it works.

Read the full guide on Medium → https://medium.com/@arsen.apostolov/how-to-get-disk-full-alerts-across-linux-and-windows-262fb69fa2e7

I want to let an AI agent roam my homelab — looking for someone to build the MCP server

Arsen Apostolov — Sun, 07 Jun 2026 08:46:32 +0000

I maintain a small open-source tool called HomeLab Monitor — one dashboard for every box in my homelab: host vitals, containers, systemd services, GPU, and which AI model servers are loaded right now.

It's good at being a pair of human eyes. The next thing I want is to make it a source of context for an AI agent.

So the idea: give it an MCP server. Model Context Protocol is the thing that lets an agent like Claude call tools and read resources. If the monitor speaks MCP, an agent can connect and explore the whole fleet — "which container is leaking RAM?", "the GPU's been pinned for an hour, who's driving it?", "this host wants a reboot and an OS upgrade, what order is safe?" — and start helping with the maintenance instead of me squinting at graphs.

The fun part for whoever builds it: it's mostly a thin wrapper over a REST API that already exists. The monitor already serves clean, read-only JSON (/api/data, /api/fleet, /api/host_data/<name>, /metrics). MCP just adds the semantics — tools and resources with names an LLM can reason about instead of a raw blob. Read-only to start; any future write tool stays opt-in.

It's genuinely weekend-sized if you've wrapped an MCP server around an API before — and a great first one if you haven't and want to learn.

Repo: https://github.com/SikamikanikoBG/homelab-monitor
The idea + a suggested first PR: https://github.com/SikamikanikoBG/homelab-monitor/issues/70

If wiring this sounds fun, come say hi on the issue — I'll help scope the first commit.

The homelab box you forgot you own is probably 47 updates behind — here’s the safe fix

Arsen Apostolov — Sun, 07 Jun 2026 06:01:16 +0000

TL;DR: My homelab monitor flagged my Plex/Pi-hole box 47 packages and a kernel behind — and I'd forgotten the machine existed. Here's the 5-minute non-interactive fix, and the one upgrade I deliberately didn't run.

This is the dev.to short version of the Medium write-up. Same dashboard that caught a service hoarding 16GB of VRAM last week — different, more boring villain.

The signal

The overview wore one small badge: ⚠ 1 host behind. Not my GPU box that I touch daily — cloudy, the Plex / Pi-hole / Samba box that just works and therefore never gets looked at.

The monitor also flags a release upgrade as available — I'm deferring that one regardless of which version it lands on (more below).

UPDATES column: 47 pending · ⬆ 26.04 available.

The diagnosis

$ ssh anakin@cloudy
$ lsb_release -ds && uname -r
Ubuntu 22.04.5 LTS
5.15.0-179-generic          # running — but 5.15.0-181 was already installed, waiting on a reboot

$ apt list --upgradable 2>/dev/null | grep -c upgradable
47
$ cat /var/run/reboot-required
*** System restart required ***

Nothing was broken — Plex streamed, Pi-hole resolved, shares mounted. That's the trap: a box that's 47 behind doesn't tell you. Among the 47: systemd, snapd, apparmor, nftables, cloud-init, linux-firmware, openldap. Plenty of it security-relevant.

The fix (non-interactive, config-preserving)

sudo -i
export DEBIAN_FRONTEND=noninteractive NEEDRESTART_MODE=a
apt-get update
apt-get -o Dpkg::Options::="--force-confold" \
        -o Dpkg::Options::="--force-confdef" \
        -y full-upgrade
apt-get -y autoremove --purge

--force-confold → keep my existing config files, don't stop to ask.
NEEDRESTART_MODE=a → let needrestart restart affected services itself instead of showing the blue full-screen menu that hangs an unattended run.
Result: 45 upgraded, 2 newly installed, 0 removed. Clean.

Then activate the kernel/systemd the box had been holding:

$ reboot              # ~90s of no DNS for the LAN — an on-purpose action, not a background one
$ uname -r
5.15.0-181-generic    # back on the tailnet, now on the staged kernel

Before / after

47 → 0. The package badge cleared.

What I deliberately did NOT run

The monitor also flags a full Ubuntu release upgrade waiting. do-release-upgrade on a remote, headless, house-critical box is a scheduled-window job — with a backup and a console in reach — not an unattended one. The dashboard surfacing it is the win; choosing to defer it is the right call. So I left it flagged, on purpose.

The point

I'm not disciplined about my boring boxes — nobody is. The only reason this got caught is one badge in one dashboard I already look at. The tool is HomeLab Monitor — one container, MIT, no Prometheus/Grafana to stand up:

docker compose up -d --build
# github.com/SikamikanikoBG/homelab-monitor

When did you last log into your most reliable box, and how would you find out it was a month behind? Mine used a badge. What's watching yours — a cron apt list --upgradable, unattended-upgrades mail you actually read, or nothing? Genuinely curious which holds up for people.