Project Kiwi Reborn: Migrating to Ubuntu Server and Breaking the 16k Context Barrier

#ai #python #pgaichallenge

■ The Great Migration: Why Ubuntu?
For a while, my Minecraft AI agent, Kiwi, was running its "brain" (a Gemma 4 31B model) on Windows 10. While it worked, we were hitting a hard ceiling. Between the "Windows VRAM tax" and the overhead of a desktop OS, we were stuck at an 8,192 (8k) context limit. In the world of AI agents, memory is everything. If the context window is too small, the bot forgets its goals and makes repetitive mistakes.

To give Kiwi the "long-term memory" it deserves, I decided to migrate the entire inference engine to Ubuntu Server 24.04.

■ The Segmentation Fault Nightmare
Migration wasn't as simple as "plug and play." My rig is a bit of a "Chimera"—a Core i9-10900X paired with four different GPUs (RTX 3060, 3050, GTX 1660 Super, and 1660 Ti).

Setting up a multi-GPU hybrid inference on Linux led to a brutal cycle of Segmentation Faults.

The Layer Trap: Offloading too many layers to the GPUs would cause instant crashes during the initial memory allocation.
The Memory Map Wall: The default mmap behavior tried to reserve massive amounts of virtual address space (over 500GB!), which triggered OS-level protections.
The 16k Hurdle: Moving from 8k to 16k context exponentially increases memory pressure, especially without Flash Attention (due to the older GTX 1660 cards).

■ Finding the "Golden Recipe"
After hours of tweaking, crashing, and logs-reading, I finally found the stable configuration that keeps the server healthy while utilizing almost every megabyte of my 48GB of RAM and 29GB of VRAM.

The Winning Command:

llm-server ~/model/google_gemma-4-31B-it-Q4_K_M.gguf \
  --ctx-size 16384 \
  --port 8081 \
  --n-gpu-layers 48 \
  --no-mmap \
  --batch-size 512 \
  --ubatch-size 512 \
  --backend llama

Key Technical Takeaways:

--n-gpu-layers 48: By offloading 48 of the 60 layers to the GPUs and leaving the rest to the 10900X, I created just enough "breathing room" in the VRAM for the massive 16k KV cache.
--no-mmap: This was the silver bullet. It forced the server to allocate physical RAM directly rather than relying on virtual memory mapping, which stopped the Segmentation Faults in their tracks.
Automatic Quantization: The server was smart enough to automatically compress the KV cache to Q8_0, allowing the 16k context to fit even on my mixed-generation GPU setup.

■ Wiping the Slate: Starting from Zero
With the server finally stable, I noticed a new problem: Kiwi’s old memories were haunting it. Because I expanded the context to 16k, the bot was now vividly "remembering" all the buggy code and syntax errors it made back on Windows. It was stuck in a loop of referencing its own past failures.

To fix this "AI trauma," I’ve decided to go for a total reset:

Wiping the ChromaDB (Long-term memory).

Deleting all learned skills.

Resetting the Minecraft world.

We are starting from Day 1 with a brand-new brain and a massive 16,384-token memory. While our real-world inference speed is a humble 2.51 tok/s due to the hybrid CPU/GPU (relay), the stability and intelligence of a 31B model with 16k context is a massive upgrade.

Time to see what a "clean" Kiwi can do.

DEV Community

Project Kiwi Reborn: Migrating to Ubuntu Server and Breaking the 16k Context Barrier

Top comments (0)