0xkoji

Posted on Jun 11 • Edited on Jun 13

WSL2 + llama.cpp + Codex CLI = Local LLM Power

#ai #llm #gemma

requirements

llama.cpp
nodejs (if you use npm)

I'm using NVIDIA GeForce RTX 3070.

Step 1. Install codex

First install codex on WSL.
If Node.js isn’t installed yet, I recommend installing it with mise.

0xkoji

Dec 22 '25

Migrating from asdf to mise without the headaches

#asdf #mise #toolchain #cli

3 min read

npm install -g @openai/codex@latest

# or use curl
curl -fsSL https://chatgpt.com/codex/install.sh | sh

CLI – Codex | OpenAI Developers

Pair with Codex in your terminal

developers.openai.com

Step 2. Create `.codex` folder

We need to create config.toml to use local llm with llama.cpp. First we need to run codex

codex

You don't need to set up anything here. You just need to hit ctrl + c.

Step 3. Create config.toml

Once you run Codex, your WSL will have .codex folder.
You can use whatever you like.

vim ~/.codex/config.toml

config.toml

[model_providers.llama]
name = "llama.cpp"
base_url = "http://localhost:8080/v1"
wire_api = "responses"
stream_idle_timeout_ms = 10000000

Step 4. Run llama.server to run Gemma-4

If llama.cpp isn't build/installed yet, you will need to build by yourself or install via homebrew.

0xkoji

Jun 6

Run Gemma-4 12B on WSL2 with llama.cpp

#llm #ai #linux #gemma4

2 min read

For this article, I used google--gemma-4-12B-it-Q4_K_M.gguf

baxin/quantized-models at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

My folder structure

drwxr-xr-x 30 root root 4096 Jun  8 23:00 llama.cpp
drwxr-xr-x  6 root root 4096 Jun  8 02:21 quantization

ls -l quantization/
total 276
-rw-r--r-- 1 root root      0 Jun  6 13:00 README.md
drwxr-xr-x 2 root root   4096 Jun  8 23:29 gguf <-- google--gemma-4-12B-it-Q4_K_M.gguf is here
-rw-r--r-- 1 root root     90 Jun  6 13:00 main.py
-rw-r--r-- 1 root root     26 Jun  6 13:17 mise.toml
drwxr-xr-x 3 root root   4096 Jun  8 02:33 models
-rw-r--r-- 1 root root    307 Jun  6 13:00 pyproject.toml
-rwxr-xr-x 1 root root   4060 Jun  8 02:21 quantize.sh
-rw-r--r-- 1 root root 254959 Jun  6 13:00 uv.lock

We need to run llama.server
-c: context size should be bigger than 7959. I set -c 4096 first time and got the following error.
{"error":{"code":400,"message":"request (7959 tokens) exceeds the available context size (4096 tokens), try increasing it","type":"exceed_context_size_error","n_prompt_tokens":7959,"n_ctx":4096}}

cd llama.cpp
./build/bin/llama-server -m ../quantization/gguf/gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c 100000 --port 8080

Step 5. Run Codex

Now, it's a time to run Codex
Open new tab/session and run Codex

codex --model ./quantization/gguf/gemma-4-12B-it-qat-UD-Q4_K_XL.gguf -c model_provider=llama --search --dangerously-bypass-approvals-and-sandbox

I sent a very simple prompt and got the following app.
The app allows me to add a new task, check a task, and delete a task.

Can you create a simple todo app with reactjs and typescript

DEV Community

WSL2 + llama.cpp + Codex CLI = Local LLM Power

requirements

Step 1. Install codex

Migrating from asdf to mise without the headaches

CLI – Codex | OpenAI Developers

Step 2. Create `.codex` folder

Step 3. Create config.toml

Step 4. Run llama.server to run Gemma-4

Run Gemma-4 12B on WSL2 with llama.cpp

baxin/quantized-models at main

Step 5. Run Codex

Top comments (0)

requirements

Step 1. Install codex

Migrating from asdf to mise without the headaches

CLI – Codex | OpenAI Developers

Step 2. Create .codex folder

Step 3. Create config.toml

Step 4. Run llama.server to run Gemma-4

Run Gemma-4 12B on WSL2 with llama.cpp

baxin/quantized-models at main

Step 5. Run Codex

Step 2. Create `.codex` folder