Rob

Posted on May 7 • Originally published at vibescoder.dev

Model Showdown Round 2: Adding Gemma, Kimi, and 579 GB of Stubborn Optimism

#ai #llm #benchmark #homelab

At the end of Round 1, we promised a rematch. More models. Fixed settings. Harder questions about what "local inference" really means when you push past what fits in VRAM.

This is that rematch.

We added two models that the Coder dev team specifically requested: Gemma 4 from Google (27B parameters, fits comfortably on the RTX 5090) and Kimi K2 from Moonshot AI (1 trillion parameters, does not fit in anything reasonable). We also reran every model from Round 1 with fixes for the configuration issues that tripped up three of them.

The results changed the leaderboard significantly.

What We Fixed from Round 1

Round 1 had three avoidable failures:

Qwen hit the token limit — scored 28/100 because the output was capped at 4,096 tokens and the code got truncated mid-f-string. The model was generating at 1,510 tok/s. It wasn't slow. We just cut it off.
Codestral and DeepSeek built interactive menus — both interpreted "commands: add, list, complete, delete" as while True: input() loops instead of CLI argument parsers. The code worked perfectly if you used it interactively. Our automated test suite couldn't.
Context windows varied — each model had different settings, making the comparison uneven.

For Round 2:

Setting	Round 1	Round 2
`num_predict` (max output tokens)	4,096	16,384
`num_ctx` (context window)	Varied	16,384 for all
Prompt clarity	"Commands: add, list, complete, delete"	"using argparse or sys.argv, NOT interactive input"
Model management	Random loading	Auto-unload previous, preload next

Same prompt. Same task. Same validation. Just fair settings this time.

Adding Gemma 4

Google released Gemma 4 while we were writing the Round 1 results. The 27B parameter model downloads as a 9.6 GB file through Ollama — the smallest of our serious contenders.

ollama pull gemma4

That's it. Model pulled, loaded onto the 5090 in seconds, registered in Coder's admin panel as another OpenAI-compatible model on the existing Ollama provider. The entire setup was one command and two form fields.

After Round 1's configuration adventure with five different models, this felt almost anticlimactic. In the best possible way.

Adding Kimi K2 (The Hard Way)

Kimi K2 is a different story entirely.

The numbers: 1 trillion total parameters, 32 billion active per token (Mixture of Experts architecture), 256K context window. The quantized model (Q4_K_M) is 579 GB across 13 shard files. Our RTX 5090 has 32 GB of VRAM.

We knew this going in. Round 1's post explicitly said Kimi would need API testing because it's too large for local. But this blog is about pushing boundaries with consumer hardware, and "it probably won't work" isn't a reason not to try. It's the reason to try.

Step 1: Getting llama.cpp Built

Ollama doesn't offer Kimi K2 for local inference — only a cloud-hosted variant. So we went to llama.cpp, the C++ inference engine that supports loading models larger than VRAM via memory-mapped NVMe offloading.

Building it required installing half of Ubuntu's dev toolchain:

sudo apt install -y cmake build-essential nvidia-cuda-toolkit
cd ~ && git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

First roadblock: cmake wasn't installed. Fixed with apt.

Second roadblock: CUDA toolkit not found. Fixed with nvidia-cuda-toolkit.

Third roadblock: nvcc fatal: Unsupported gpu architecture 'compute_120a'. The RTX 5090 is Blackwell architecture (compute 12.0), but Ubuntu's apt CUDA toolkit is version 12.0 — too old to know about it. The fix was targeting an older compatible architecture:

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89

Compute capability 89 (Ada Lovelace) runs fine on the 5090 via backward compatibility. Not ideal, but it builds.

Step 2: Downloading 579 GB

Next: the Hugging Face CLI. Which required pip. Which was externally managed. Which required --break-system-packages. Which installed but wasn't on PATH. Which turned out to be deprecated in favor of the hf CLI. Which required python3.12-venv. Which left behind a broken virtual environment that needed manual cleanup.

sudo apt install -y python3-pip python3.12-venv
pip install huggingface-hub[cli] --break-system-packages
rm -rf ~/.hf-cli
curl -LsSf https://hf.co/cli/install.sh | bash
source ~/.bashrc

Then the actual download:

~/.local/bin/hf download unsloth/Kimi-K2-Instruct-GGUF --include "*Q4_K_M*" --local-dir ~/models/kimi-k2

The download started reporting 384 GB, then revised upward to 432 GB, then 481 GB, then settled at 529 GB. The HF CLI discovers shards progressively — it didn't know the full file list upfront.

Kimi K2 mid-download — 327 GB down, revising the total upward as new shards are discovered.

3 hours and 27 minutes later, 13 shard files totaling 579 GB sat on the NVMe. At ~370 Mbps sustained throughput.

Step 3: The VRAM Math

First attempt: 10 GPU layers. Tried to allocate 94 GB on a 32 GB card. Dead.

The math: 94 GB / 10 layers ≈ 9.4 GB per layer. With 32 GB of VRAM, that's roughly 3 layers maximum. MoE architectures make each layer massive because every expert's weights live in the same layer.

We settled on 2 GPU layers (confirmed working, 3 was borderline). That means ~18 GB on the GPU, the remaining ~560 GB paging from NVMe via memory-mapped I/O. The OS's virtual memory system handles the page faults — when inference needs weights that aren't in RAM, it reads them from the NVMe on demand.

Step 4: The Conversation Mode Bug

Here's where it got interesting. llama.cpp's llama-cli has a --no-conversation flag that's supposed to run a single prompt and exit. It doesn't work. Every run dropped into an interactive > prompt, waiting for input. Our benchmark script would hang indefinitely.

We tried:

--no-conversation flag (ignored)
--no-display-prompt flag (still conversational)
Piping prompt via -p with -e flag (still conversational)

llama-cli ignoring --no-conversation and dropping into an interactive prompt, hanging the benchmark script.

Three benchmark attempts. Three hangs. The script captured zero timing data from Kimi because it was waiting for a conversation that would never end.

Step 5: The Fix — llama-server

Instead of fighting the CLI, we ditched it. llama.cpp ships with llama-server, which exposes an OpenAI-compatible HTTP API — the exact same interface Ollama uses. We wrote a standalone benchmark script that:

Starts llama-server as a background process
Polls /health until the 579 GB model finishes loading
Sends the benchmark prompt to /v1/chat/completions with streaming
Captures every metric programmatically — TTFT, total time, tokens, tok/s
Runs the full validation suite
Shuts down the server

No conversation mode. No stopwatch. No manual intervention.

server_cmd = [
    LLAMA_SERVER,
    "-m", MODEL_PATH,
    "--n-gpu-layers", str(N_GPU_LAYERS),
    "--mmap",
    "-c", str(CTX_SIZE),
    "--port", str(PORT),
]
server_proc = subprocess.Popen(server_cmd, ...)

# Wait for 579 GB to load into memory
wait_for_server(PORT, timeout=900)

# Hit the same API as Ollama
url = f"http://127.0.0.1:{PORT}/v1/chat/completions"

It worked on the first try. The model loaded in 375 seconds (6.3 minutes), then generation began.

Kimi K2 generating code at 0.6 tokens per second. Every character is paging through 579 GB of weights on an NVMe drive.

The Results: Performance

Raw benchmark output — Qwen finishing its run and Kimi K2 kicking off next.

Model	TTFT	Total Time	Output Tokens	Tok/s	Lines
Codestral 22B	1.75s	10.01s	826	82.5	80
Devstral 24B	2.11s	9.97s	703	70.5	98
Gemma 4 27B	3.92s	11.77s	1,966	167.1	171
DeepSeek R1 14B	7.21s	12.44s	1,451	116.7	84
Qwen 3.5 MoE 35B	27.00s	35.23s	5,020	142.5	144
Kimi K2 1T	68.90s	1,140.94s	686	0.6	87

Gemma 4 immediately stands out. 167 tok/s is the fastest generation speed of any model we've tested that also scored perfectly — faster than Sonnet 4.6's 104 tok/s from Round 1. It wrote 1,966 tokens (171 lines) in under 12 seconds.

Devstral remains the wall-clock champion at 9.97 seconds total, though Codestral edges it on TTFT (1.75s vs 2.11s).

Kimi K2 is in a different universe. 68.9 seconds before the first token appeared (that's prompt evaluation at 1.5 tok/s across 579 GB of weights). Then 19 minutes of generation at 0.6 tok/s. Total wall clock including model load: 25 minutes.

The Results: Quality

Model	Syntax	Features (X/10)	Functional (X/7)	Score
Gemma 4 27B	Yes	10/10	7/7	100
Devstral 24B	Yes	10/10	7/7	100
DeepSeek R1 14B	Yes	10/10	7/7	100
Qwen 3.5 MoE 35B	Yes	10/10	7/7	100
Codestral 22B	Yes	9/10	7/7	94
Kimi K2 1T	Yes	10/10	6/7	94

Five out of six models scored 100. That's up from three in Round 1.

The Fixes Worked

Qwen: 28 → 100. With the token limit raised from 4,096 to 16,384, Qwen wrote 5,020 tokens — a complete 144-line program with ANSI color codes, proper error handling, and clean argparse subparsers. The speed is still absurd (142.5 tok/s with a 27-second cold start), but now it finishes what it starts.

DeepSeek R1: 60 → 100. The clarified prompt ("NOT interactive input") worked. DeepSeek built an argparse-based CLI with proper subparsers, colorama integration, and structured error handling. It still uses <think> blocks (the 1,451 tokens include reasoning), but the final code is correct.

Codestral: 60 → 94. Also switched to argparse, passing all 7 functional tests. But it missed error handling entirely — no try/except blocks, no input validation. Its complete command also silently deletes the record instead of marking it done. Functional but sloppy.

The New Models

Gemma 4 wrote the most polished code of any model in either round. 171 lines with a dedicated Colors class for ANSI escape codes, emoji status indicators (✅, ⏳, 🎉, 🗑️), full try/except/finally blocks on every database operation, and a clean argparse architecture. It writes like a senior developer who actually cares about user experience.

Kimi K2 wrote clean, minimal code — 87 lines with with context managers for database connections (the most Pythonic approach of any model), proper sys.exit(1) on errors, and a formatted table output. It scored 94 instead of 100 because one functional test failed: the delete command reported "Task 2 not found" due to the model storing its database at ~/.todo.db (a global path) instead of a relative path. Stale data from an earlier test run interfered. The code logic is correct — it's a test isolation issue, not a bug.

Style Comparison: How Each Model Writes

The code style differences are telling:

Gemma 4 (171 lines): Enterprise polish. ANSI color class, emoji, docstrings on every function, defensive error handling everywhere. The code you'd put in a demo.

Qwen 3.5 (144 lines): Also polished — ANSI codes, structured table output, exit-on-error patterns. More defensive than Gemma but less decorative.

Devstral (98 lines): Minimal and correct. Flat functions, no class, CURRENT_TIMESTAMP in SQL. The code you'd actually ship.

Kimi K2 (87 lines): Even more minimal. Context managers everywhere, zero waste. Reads like it was written by someone who's read a lot of production Python.

DeepSeek R1 (84 lines): Compact with colorama dependency — the only model that imported an external library. Risky in an isolated test environment.

Codestral (80 lines): The shortest, and it shows. No error handling, buggy complete command. Brevity at the cost of correctness.

The Speed Tiers

Round 2 reveals three distinct performance tiers for local inference:

Tier 1: VRAM-Native (~10-35 seconds)

Models that fit entirely in the RTX 5090's 32 GB VRAM. Response times competitive with cloud APIs.

Model	Size	Total Time	Tok/s
Devstral 24B	14 GB	9.97s	70.5
Codestral 22B	12 GB	10.01s	82.5
Gemma 4 27B	9.6 GB	11.77s	167.1
DeepSeek R1 14B	9 GB	12.44s	116.7
Qwen 3.5 MoE 35B	23 GB	35.23s	142.5

Tier 2: NVMe-Offloaded (~19 minutes)

Models too large for VRAM, paging from NVMe via mmap. Functional but glacial.

Model	Size	Total Time	Tok/s
Kimi K2 1T	579 GB	1,141s	0.6

The gap between tiers is ~100x. Gemma 4 at 167 tok/s vs Kimi K2 at 0.6 tok/s. Both wrote correct code. One took 12 seconds, the other took 19 minutes.

This isn't a criticism of Kimi K2 — it's a 1 trillion parameter model running on hardware that costs less than a month of cloud API credits. The fact that it works at all is the story. The fact that it wrote correct, clean, well-structured code is the punchline.

Round 1 vs Round 2: Combined Leaderboard

Model	Round	Size	Tok/s	Score
Gemma 4 27B	R2	9.6 GB	167.1	100
Sonnet 4.6	R1	Cloud	104.2	100
Devstral 24B	R2	14 GB	70.5	100
Opus 4.6	R1	Cloud	74.3	100
Qwen 3.5 MoE 35B	R2	23 GB	142.5	100
DeepSeek R1 14B	R2	9 GB	116.7	100
Codestral 22B	R2	12 GB	82.5	94
Kimi K2 1T	R2	579 GB	0.6	94

Gemma 4 is now the fastest model with a perfect score — local or cloud. A 9.6 GB model running on consumer hardware, outperforming Anthropic's Sonnet 4.6 on raw throughput while matching it on code quality.

The local-vs-cloud gap hasn't just closed. On this task, local won.

What We Learned

Configuration matters more than model selection. Three models went from failing to perfect with two setting changes. If your local models are underperforming, check your token limits and prompt clarity before blaming the model.

The prompt is still the variable. Round 1's "ambiguous CLI" issue was a prompt problem, not a model problem. Six words ("NOT interactive input") fixed two models.

VRAM is the cliff. The performance difference between "fits in VRAM" and "doesn't fit in VRAM" is 100x. There's no gradual degradation — you're either generating at 70-167 tok/s or you're at 0.6. If your model fits, you're competitive with cloud. If it doesn't, you're watching paint dry.

Big models can still write good code slowly. Kimi K2 at 0.6 tok/s is impractical for interactive coding. But for batch processing, overnight code generation, or "I need an answer and I don't care when" use cases, a 1T model on consumer NVMe is a real option that didn't exist a year ago.

Gemma 4 is the new default. Fastest throughput, perfect score, smallest download, most polished output. If you're running a homelab with a single GPU, it's the model to install first.

What's Next: Gemma vs Opus — A Real Fight

Round 1 tested a toy todo app. Round 2 fixed the settings and added models. Both rounds answered a useful question: can local models write correct code for a well-defined task?

The answer is yes. Five out of six scored perfect. That question is settled.

The next question is harder: can a local model replace my daily driver on a real task?

My daily driver is Opus 4.6. It's what I use for everything on vibescoder.dev — features, refactors, debugging, the works. It's also a cloud model with per-token costs, rate limits, and a dependency on someone else's infrastructure.

Gemma 4 just beat every model in the benchmark on speed and matched the best on quality. It runs locally on my 5090 at 167 tok/s with zero API costs. The obvious question: can it actually do the job?

Round 3 will be a head-to-head. Gemma 4 vs Opus 4.6, same task, but not a toy. We're going to pick a real feature from the vibescoder.dev backlog — something that touches multiple files, requires architectural decisions, and has enough ambiguity to separate a good model from a great one. The kind of task I'd normally hand to Opus without thinking.

If Gemma holds up, local-first AI coding isn't just viable for benchmarks. It's viable for production.

By the Numbers

6 local models benchmarked (up from 4 local + 2 cloud in Round 1)
5 perfect scores (up from 3)
579 GB downloaded over 3 hours 27 minutes for Kimi K2
375 seconds to load 579 GB into memory-mapped NVMe
68.9 seconds for Kimi K2's first token
1,140 seconds (19 minutes) for Kimi K2's total generation
9.6 GB for Gemma 4 — smallest model, highest score + speed
167.1 tok/s from Gemma 4 — fastest perfect-scoring model across both rounds
0.6 tok/s from Kimi K2 — slowest, but correct
16,384 token limit that saved Qwen from another truncation
2 GPU layers out of ~60+ that fit in VRAM for Kimi K2
3 Round 1 bugs fixed by configuration changes, not model changes
1 llama-cli conversation mode bug worked around with llama-server
0 API costs for everything

DEV Community