This article was originally published on runaihome.com
GGML_ASSERT(n_inputs < GGML_SCHED_MAX_SPLIT_INPUTS) Error with Gemma4:e2b
The crash occurs when llama.cpp's scheduler encounters more tensor split operations than its compile-time limit allows. This happens when WSL 2's memory constraints force excessive fragmentation of the model's tensors across the GPU and system memory, exceeding GGML_SCHED_MAX_SPLIT_INPUTS (typically 16). Gemma4's architecture with its extended context window compounds this issue on constrained systems.
Fix 1: Limit GPU Layer Offloading
Reduce the number of model layers loaded to GPU to minimize tensor splitting requirements:
# Set GPU layers before running (adjust 24-28 based on your GPU VRAM)
export OLLAMA_GPU_LAYERS=24
ollama run gemma4:e2b
For NVIDIA GPUs with 8GB VRAM, start with OLLAMA_GPU_LAYERS=20. For 12GB+ VRAM, try OLLAMA_GPU_LAYERS=28. If the error persists, decrease by 4 layers until stable.
Fix 2: Reduce Context Window Size
The extended 8K+ context window in Gemma4:e2b forces tensor operations that trigger the scheduler limit. Cap the context at 2048 tokens:
# Run with explicit context limit
ollama run gemma4:e2b --context 2048
Alternatively, set the environment variable permanently:
export OLLAMA_CONTEXT_SIZE=2048
ollama run gemma4:e2b
This prevents the model from attempting operations that exceed the scheduler's input threshold.
Top comments (0)