DEV Community

Cover image for Run Codex CLI with Local LLM - Gemma4 with llama.cpp on WSL2
0xkoji
0xkoji

Posted on

Run Codex CLI with Local LLM - Gemma4 with llama.cpp on WSL2

requirements

  • llama.cpp
  • nodejs (if you use npm)

I'm using NVIDIA GeForce RTX 3070.

Step 1. Install codex

First install codex on WSL.
If Node.js isn’t installed yet, I recommend installing it with mise.


npm install -g @openai/codex@latest

# or use curl
curl -fsSL https://chatgpt.com/codex/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

CLI – Codex | OpenAI Developers

Pair with Codex in your terminal

favicon developers.openai.com

Step 2. Create .codex folder

We need to create config.toml to use local llm with llama.cpp. First we need to run codex

codex
Enter fullscreen mode Exit fullscreen mode

You don't need to set up anything here. You just need to hit ctrl + c.

Step 3. Create config.toml

Once you run Codex, your WSL will have .codex folder.
You can use whatever you like.

vim ~/.codex/config.toml
Enter fullscreen mode Exit fullscreen mode

config.toml

[model_providers.llama]
name = "llama.cpp"
base_url = "http://localhost:8080/v1"
wire_api = "responses"
stream_idle_timeout_ms = 10000000
Enter fullscreen mode Exit fullscreen mode

Step 4. Run llama.server to run Gemma-4

If llama.cpp isn't build/installed yet, you will need to build by yourself or install via homebrew.


For this article, I used google--gemma-4-12B-it-Q4_K_M.gguf

baxin/quantized-models at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

My folder structure

drwxr-xr-x 30 root root 4096 Jun  8 23:00 llama.cpp
drwxr-xr-x  6 root root 4096 Jun  8 02:21 quantization
Enter fullscreen mode Exit fullscreen mode
ls -l quantization/
total 276
-rw-r--r-- 1 root root      0 Jun  6 13:00 README.md
drwxr-xr-x 2 root root   4096 Jun  8 23:29 gguf <-- google--gemma-4-12B-it-Q4_K_M.gguf is here
-rw-r--r-- 1 root root     90 Jun  6 13:00 main.py
-rw-r--r-- 1 root root     26 Jun  6 13:17 mise.toml
drwxr-xr-x 3 root root   4096 Jun  8 02:33 models
-rw-r--r-- 1 root root    307 Jun  6 13:00 pyproject.toml
-rwxr-xr-x 1 root root   4060 Jun  8 02:21 quantize.sh
-rw-r--r-- 1 root root 254959 Jun  6 13:00 uv.lock
Enter fullscreen mode Exit fullscreen mode

We need to run llama.server
-c: context size should be bigger than 7959. I set -c 4096 first time and got the following error.
{"error":{"code":400,"message":"request (7959 tokens) exceeds the available context size (4096 tokens), try increasing it","type":"exceed_context_size_error","n_prompt_tokens":7959,"n_ctx":4096}}

cd llama.cpp
./build/bin/llama-server -m ../quantization/gguf/gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 100000 --port 8080
Enter fullscreen mode Exit fullscreen mode

Step 5. Run Codex

Now, it's a time to run Codex
Open new tab/session and run Codex

codex --model ./quantization/gguf/gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c model_provider=llama --search --dangerously-bypass-approvals-and-sandbox
Enter fullscreen mode Exit fullscreen mode

I sent a very simple prompt and got the following app.
The app allows me to add a new task, check a task, and delete a task.

Can you create a simple todo app with reactjs and typescript
Enter fullscreen mode Exit fullscreen mode

todo app

Top comments (0)