DEV Community

Rob
Rob

Posted on • Originally published at vibescoder.dev

Model Showdown Round 2: Adding Gemma, Kimi, and 579 GB of Stubborn Optimism

At the end of Round 1, we promised a rematch. More models. Fixed settings. Harder questions about what "local inference" really means when you push past what fits in VRAM.

This is that rematch.

We added two models that the Coder dev team specifically requested: Gemma 4 from Google (27B parameters, fits comfortably on the RTX 5090) and Kimi K2 from Moonshot AI (1 trillion parameters, does not fit in anything reasonable). We also reran every model from Round 1 with fixes for the configuration issues that tripped up three of them.

The results changed the leaderboard significantly.

What We Fixed from Round 1

Round 1 had three avoidable failures:

  1. Qwen hit the token limit — scored 28/100 because the output was capped at 4,096 tokens and the code got truncated mid-f-string. The model was generating at 1,510 tok/s. It wasn't slow. We just cut it off.

  2. Codestral and DeepSeek built interactive menus — both interpreted "commands: add, list, complete, delete" as while True: input() loops instead of CLI argument parsers. The code worked perfectly if you used it interactively. Our automated test suite couldn't.

  3. Context windows varied — each model had different settings, making the comparison uneven.

For Round 2:

Setting Round 1 Round 2
num_predict (max output tokens) 4,096 16,384
num_ctx (context window) Varied 16,384 for all
Prompt clarity "Commands: add, list, complete, delete" "using argparse or sys.argv, NOT interactive input"
Model management Random loading Auto-unload previous, preload next

Same prompt. Same task. Same validation. Just fair settings this time.

Adding Gemma 4

Google released Gemma 4 while we were writing the Round 1 results. The 27B parameter model downloads as a 9.6 GB file through Ollama — the smallest of our serious contenders.

ollama pull gemma4
Enter fullscreen mode Exit fullscreen mode

That's it. Model pulled, loaded onto the 5090 in seconds, registered in Coder's admin panel as another OpenAI-compatible model on the existing Ollama provider. The entire setup was one command and two form fields.

After Round 1's configuration adventure with five different models, this felt almost anticlimactic. In the best possible way.

Adding Kimi K2 (The Hard Way)

Kimi K2 is a different story entirely.

The numbers: 1 trillion total parameters, 32 billion active per token (Mixture of Experts architecture), 256K context window. The quantized model (Q4_K_M) is 579 GB across 13 shard files. Our RTX 5090 has 32 GB of VRAM.

We knew this going in. Round 1's post explicitly said Kimi would need API testing because it's too large for local. But this blog is about pushing boundaries with consumer hardware, and "it probably won't work" isn't a reason not to try. It's the reason to try.

Step 1: Getting llama.cpp Built

Ollama doesn't offer Kimi K2 for local inference — only a cloud-hosted variant. So we went to llama.cpp, the C++ inference engine that supports loading models larger than VRAM via memory-mapped NVMe offloading.

Building it required installing half of Ubuntu's dev toolchain:

sudo apt install -y cmake build-essential nvidia-cuda-toolkit
cd ~ && git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
Enter fullscreen mode Exit fullscreen mode

First roadblock: cmake wasn't installed. Fixed with apt.

Second roadblock: CUDA toolkit not found. Fixed with nvidia-cuda-toolkit.

Third roadblock: nvcc fatal: Unsupported gpu architecture 'compute_120a'. The RTX 5090 is Blackwell architecture (compute 12.0), but Ubuntu's apt CUDA toolkit is version 12.0 — too old to know about it. The fix was targeting an older compatible architecture:

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
Enter fullscreen mode Exit fullscreen mode

Compute capability 89 (Ada Lovelace) runs fine on the 5090 via backward compatibility. Not ideal, but it builds.

Step 2: Downloading 579 GB

Next: the Hugging Face CLI. Which required pip. Which was externally managed. Which required --break-system-packages. Which installed but wasn't on PATH. Which turned out to be deprecated in favor of the hf CLI. Which required python3.12-venv. Which left behind a broken virtual environment that needed manual cleanup.

sudo apt install -y python3-pip python3.12-venv
pip install huggingface-hub[cli] --break-system-packages
rm -rf ~/.hf-cli
curl -LsSf https://hf.co/cli/install.sh | bash
source ~/.bashrc
Enter fullscreen mode Exit fullscreen mode

Then the actual download:

~/.local/bin/hf download unsloth/Kimi-K2-Instruct-GGUF --include "*Q4_K_M*" --local-dir ~/models/kimi-k2
Enter fullscreen mode Exit fullscreen mode

The download started reporting 384 GB, then revised upward to 432 GB, then 481 GB, then settled at 529 GB. The HF CLI discovers shards progressively — it didn't know the full file list upfront.

Terminal showing Kimi K2 download progress at 327 GB of 432 GB with multiple shard progress bars
Kimi K2 mid-download — 327 GB down, revising the total upward as new shards are discovered.

3 hours and 27 minutes later, 13 shard files totaling 579 GB sat on the NVMe. At ~370 Mbps sustained throughput.

Step 3: The VRAM Math

First attempt: 10 GPU layers. Tried to allocate 94 GB on a 32 GB card. Dead.

The math: 94 GB / 10 layers ≈ 9.4 GB per layer. With 32 GB of VRAM, that's roughly 3 layers maximum. MoE architectures make each layer massive because every expert's weights live in the same layer.

We settled on 2 GPU layers (confirmed working, 3 was borderline). That means ~18 GB on the GPU, the remaining ~560 GB paging from NVMe via memory-mapped I/O. The OS's virtual memory system handles the page faults — when inference needs weights that aren't in RAM, it reads them from the NVMe on demand.

Step 4: The Conversation Mode Bug

Here's where it got interesting. llama.cpp's llama-cli has a --no-conversation flag that's supposed to run a single prompt and exit. It doesn't work. Every run dropped into an interactive > prompt, waiting for input. Our benchmark script would hang indefinitely.

We tried:

  • --no-conversation flag (ignored)
  • --no-display-prompt flag (still conversational)
  • Piping prompt via -p with -e flag (still conversational)

Terminal showing llama-cli loading Kimi K2 with conversation flag and dropping into interactive prompt
llama-cli ignoring --no-conversation and dropping into an interactive prompt, hanging the benchmark script.

Three benchmark attempts. Three hangs. The script captured zero timing data from Kimi because it was waiting for a conversation that would never end.

Step 5: The Fix — llama-server

Instead of fighting the CLI, we ditched it. llama.cpp ships with llama-server, which exposes an OpenAI-compatible HTTP API — the exact same interface Ollama uses. We wrote a standalone benchmark script that:

  1. Starts llama-server as a background process
  2. Polls /health until the 579 GB model finishes loading
  3. Sends the benchmark prompt to /v1/chat/completions with streaming
  4. Captures every metric programmatically — TTFT, total time, tokens, tok/s
  5. Runs the full validation suite
  6. Shuts down the server

No conversation mode. No stopwatch. No manual intervention.

server_cmd = [
    LLAMA_SERVER,
    "-m", MODEL_PATH,
    "--n-gpu-layers", str(N_GPU_LAYERS),
    "--mmap",
    "-c", str(CTX_SIZE),
    "--port", str(PORT),
]
server_proc = subprocess.Popen(server_cmd, ...)

# Wait for 579 GB to load into memory
wait_for_server(PORT, timeout=900)

# Hit the same API as Ollama
url = f"http://127.0.0.1:{PORT}/v1/chat/completions"
Enter fullscreen mode Exit fullscreen mode

It worked on the first try. The model loaded in 375 seconds (6.3 minutes), then generation began.

Kimi K2 generating code at 0.6 tokens per second. Every character is paging through 579 GB of weights on an NVMe drive.

The Results: Performance

Terminal output showing Qwen benchmark results completing and Kimi K2 benchmark starting
Raw benchmark output — Qwen finishing its run and Kimi K2 kicking off next.

Model TTFT Total Time Output Tokens Tok/s Lines
Codestral 22B 1.75s 10.01s 826 82.5 80
Devstral 24B 2.11s 9.97s 703 70.5 98
Gemma 4 27B 3.92s 11.77s 1,966 167.1 171
DeepSeek R1 14B 7.21s 12.44s 1,451 116.7 84
Qwen 3.5 MoE 35B 27.00s 35.23s 5,020 142.5 144
Kimi K2 1T 68.90s 1,140.94s 686 0.6 87

Gemma 4 immediately stands out. 167 tok/s is the fastest generation speed of any model we've tested that also scored perfectly — faster than Sonnet 4.6's 104 tok/s from Round 1. It wrote 1,966 tokens (171 lines) in under 12 seconds.

Devstral remains the wall-clock champion at 9.97 seconds total, though Codestral edges it on TTFT (1.75s vs 2.11s).

Kimi K2 is in a different universe. 68.9 seconds before the first token appeared (that's prompt evaluation at 1.5 tok/s across 579 GB of weights). Then 19 minutes of generation at 0.6 tok/s. Total wall clock including model load: 25 minutes.

The Results: Quality

Model Syntax Features (X/10) Functional (X/7) Score
Gemma 4 27B Yes 10/10 7/7 100
Devstral 24B Yes 10/10 7/7 100
DeepSeek R1 14B Yes 10/10 7/7 100
Qwen 3.5 MoE 35B Yes 10/10 7/7 100
Codestral 22B Yes 9/10 7/7 94
Kimi K2 1T Yes 10/10 6/7 94

Five out of six models scored 100. That's up from three in Round 1.

The Fixes Worked

Qwen: 28 → 100. With the token limit raised from 4,096 to 16,384, Qwen wrote 5,020 tokens — a complete 144-line program with ANSI color codes, proper error handling, and clean argparse subparsers. The speed is still absurd (142.5 tok/s with a 27-second cold start), but now it finishes what it starts.

DeepSeek R1: 60 → 100. The clarified prompt ("NOT interactive input") worked. DeepSeek built an argparse-based CLI with proper subparsers, colorama integration, and structured error handling. It still uses <think> blocks (the 1,451 tokens include reasoning), but the final code is correct.

Codestral: 60 → 94. Also switched to argparse, passing all 7 functional tests. But it missed error handling entirely — no try/except blocks, no input validation. Its complete command also silently deletes the record instead of marking it done. Functional but sloppy.

The New Models

Gemma 4 wrote the most polished code of any model in either round. 171 lines with a dedicated Colors class for ANSI escape codes, emoji status indicators (✅, ⏳, 🎉, 🗑️), full try/except/finally blocks on every database operation, and a clean argparse architecture. It writes like a senior developer who actually cares about user experience.

Kimi K2 wrote clean, minimal code — 87 lines with with context managers for database connections (the most Pythonic approach of any model), proper sys.exit(1) on errors, and a formatted table output. It scored 94 instead of 100 because one functional test failed: the delete command reported "Task 2 not found" due to the model storing its database at ~/.todo.db (a global path) instead of a relative path. Stale data from an earlier test run interfered. The code logic is correct — it's a test isolation issue, not a bug.

Style Comparison: How Each Model Writes

The code style differences are telling:

Gemma 4 (171 lines): Enterprise polish. ANSI color class, emoji, docstrings on every function, defensive error handling everywhere. The code you'd put in a demo.

Qwen 3.5 (144 lines): Also polished — ANSI codes, structured table output, exit-on-error patterns. More defensive than Gemma but less decorative.

Devstral (98 lines): Minimal and correct. Flat functions, no class, CURRENT_TIMESTAMP in SQL. The code you'd actually ship.

Kimi K2 (87 lines): Even more minimal. Context managers everywhere, zero waste. Reads like it was written by someone who's read a lot of production Python.

DeepSeek R1 (84 lines): Compact with colorama dependency — the only model that imported an external library. Risky in an isolated test environment.

Codestral (80 lines): The shortest, and it shows. No error handling, buggy complete command. Brevity at the cost of correctness.

The Speed Tiers

Round 2 reveals three distinct performance tiers for local inference:

Tier 1: VRAM-Native (~10-35 seconds)

Models that fit entirely in the RTX 5090's 32 GB VRAM. Response times competitive with cloud APIs.

Model Size Total Time Tok/s
Devstral 24B 14 GB 9.97s 70.5
Codestral 22B 12 GB 10.01s 82.5
Gemma 4 27B 9.6 GB 11.77s 167.1
DeepSeek R1 14B 9 GB 12.44s 116.7
Qwen 3.5 MoE 35B 23 GB 35.23s 142.5

Tier 2: NVMe-Offloaded (~19 minutes)

Models too large for VRAM, paging from NVMe via mmap. Functional but glacial.

Model Size Total Time Tok/s
Kimi K2 1T 579 GB 1,141s 0.6

The gap between tiers is ~100x. Gemma 4 at 167 tok/s vs Kimi K2 at 0.6 tok/s. Both wrote correct code. One took 12 seconds, the other took 19 minutes.

This isn't a criticism of Kimi K2 — it's a 1 trillion parameter model running on hardware that costs less than a month of cloud API credits. The fact that it works at all is the story. The fact that it wrote correct, clean, well-structured code is the punchline.

Round 1 vs Round 2: Combined Leaderboard

Model Round Size Tok/s Score
Gemma 4 27B R2 9.6 GB 167.1 100
Sonnet 4.6 R1 Cloud 104.2 100
Devstral 24B R2 14 GB 70.5 100
Opus 4.6 R1 Cloud 74.3 100
Qwen 3.5 MoE 35B R2 23 GB 142.5 100
DeepSeek R1 14B R2 9 GB 116.7 100
Codestral 22B R2 12 GB 82.5 94
Kimi K2 1T R2 579 GB 0.6 94

Gemma 4 is now the fastest model with a perfect score — local or cloud. A 9.6 GB model running on consumer hardware, outperforming Anthropic's Sonnet 4.6 on raw throughput while matching it on code quality.

The local-vs-cloud gap hasn't just closed. On this task, local won.

What We Learned

Configuration matters more than model selection. Three models went from failing to perfect with two setting changes. If your local models are underperforming, check your token limits and prompt clarity before blaming the model.

The prompt is still the variable. Round 1's "ambiguous CLI" issue was a prompt problem, not a model problem. Six words ("NOT interactive input") fixed two models.

VRAM is the cliff. The performance difference between "fits in VRAM" and "doesn't fit in VRAM" is 100x. There's no gradual degradation — you're either generating at 70-167 tok/s or you're at 0.6. If your model fits, you're competitive with cloud. If it doesn't, you're watching paint dry.

Big models can still write good code slowly. Kimi K2 at 0.6 tok/s is impractical for interactive coding. But for batch processing, overnight code generation, or "I need an answer and I don't care when" use cases, a 1T model on consumer NVMe is a real option that didn't exist a year ago.

Gemma 4 is the new default. Fastest throughput, perfect score, smallest download, most polished output. If you're running a homelab with a single GPU, it's the model to install first.

What's Next: Gemma vs Opus — A Real Fight

Round 1 tested a toy todo app. Round 2 fixed the settings and added models. Both rounds answered a useful question: can local models write correct code for a well-defined task?

The answer is yes. Five out of six scored perfect. That question is settled.

The next question is harder: can a local model replace my daily driver on a real task?

My daily driver is Opus 4.6. It's what I use for everything on vibescoder.dev — features, refactors, debugging, the works. It's also a cloud model with per-token costs, rate limits, and a dependency on someone else's infrastructure.

Gemma 4 just beat every model in the benchmark on speed and matched the best on quality. It runs locally on my 5090 at 167 tok/s with zero API costs. The obvious question: can it actually do the job?

Round 3 will be a head-to-head. Gemma 4 vs Opus 4.6, same task, but not a toy. We're going to pick a real feature from the vibescoder.dev backlog — something that touches multiple files, requires architectural decisions, and has enough ambiguity to separate a good model from a great one. The kind of task I'd normally hand to Opus without thinking.

If Gemma holds up, local-first AI coding isn't just viable for benchmarks. It's viable for production.

By the Numbers

  • 6 local models benchmarked (up from 4 local + 2 cloud in Round 1)
  • 5 perfect scores (up from 3)
  • 579 GB downloaded over 3 hours 27 minutes for Kimi K2
  • 375 seconds to load 579 GB into memory-mapped NVMe
  • 68.9 seconds for Kimi K2's first token
  • 1,140 seconds (19 minutes) for Kimi K2's total generation
  • 9.6 GB for Gemma 4 — smallest model, highest score + speed
  • 167.1 tok/s from Gemma 4 — fastest perfect-scoring model across both rounds
  • 0.6 tok/s from Kimi K2 — slowest, but correct
  • 16,384 token limit that saved Qwen from another truncation
  • 2 GPU layers out of ~60+ that fit in VRAM for Kimi K2
  • 3 Round 1 bugs fixed by configuration changes, not model changes
  • 1 llama-cli conversation mode bug worked around with llama-server
  • 0 API costs for everything

Top comments (0)