Every query to Claude Code means sending my source code to Anthropic's servers. For proprietary codebases, that's a non-starter. With vLLM and LiteLLM, I can point Claude Code at my own hardware - keeping my code on my network while maintaining the same workflow.
The Architecture
The trick is that Claude Code expects the Anthropic Messages API, but local inference servers speak OpenAI's API format. LiteLLM bridges this gap. It accepts Anthropic-formatted requests and translates them to OpenAI format for my local vLLM instance.
The stack looks like this:
Claude Code → LiteLLM (port 4000) → vLLM (port 8000) → Local GPU
One environment variable makes it work:
export ANTHROPIC_BASE_URL="http://localhost:4000"
Claude Code now sends all requests to my local LiteLLM proxy, which forwards them to vLLM running my model of choice.
The vLLM Configuration
I'm running Qwen3-Coder 30B A3B, a Mixture of Experts model with 30 billion total parameters but only 3 billion active per forward pass. The AWQ quantization brings memory requirements down enough to split it across my dual MI60 GPUs using tensor parallelism:
services:
vllm:
image: nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3
container_name: vllm
devices:
- /dev/kfd:/dev/kfd
- /dev/dri/card1:/dev/dri/card1
- /dev/dri/card2:/dev/dri/card2
- /dev/dri/renderD128:/dev/dri/renderD128
- /dev/dri/renderD129:/dev/dri/renderD129
shm_size: 16g
environment:
- HIP_VISIBLE_DEVICES=0,1
command:
- python
- -m
- vllm.entrypoints.openai.api_server
- --model
- QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
- --tensor-parallel-size
- "2"
- --max-model-len
- "65536"
- --gpu-memory-utilization
- "0.9"
- --enable-auto-tool-choice
- --tool-call-parser
- qwen3_coder
The --enable-auto-tool-choice and --tool-call-parser qwen3_coder flags are essential for agentic use. They let the model emit tool calls that Claude Code expects.
The LiteLLM Translation Layer
LiteLLM maps Claude model names to the local vLLM endpoint. The wildcard pattern catches any model Claude Code requests:
model_list:
- model_name: claude-*
litellm_params:
model: hosted_vllm/QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
api_base: http://vllm:8000/v1
api_key: "not-needed"
model_info:
max_tokens: 65536
max_input_tokens: 57344
max_output_tokens: 8192
litellm_settings:
drop_params: true
request_timeout: 600
modify_params: true
general_settings:
disable_key_check: true
A few settings to note:
-
drop_params: truesilently ignores Anthropic-specific parameters that don't translate to OpenAI format -
modify_params: trueallows LiteLLM to adjust parameters as needed for the target API -
disable_key_check: trueskips API key validation since we're running locally
Practical Usage
With everything running, Claude Code works exactly as normal:
export ANTHROPIC_BASE_URL="http://localhost:4000"
cd my-project
claude
The experience is nearly identical to using Anthropic's API, with a few caveats:
- Token throughput: My dual MI60 setup does roughly 25-30 tokens/second with ~175ms time-to-first-token. No rate limiting, no queue times, no network latency.
- Context limits: I cap at 64K tokens. Claude Opus can handle 200K.
- Model capability: Qwen3-Coder is excellent for coding tasks, but Claude has broader knowledge and better instruction following.
The upside is obvious: zero API costs, complete data sovereignty, and the ability to run Claude Code on air-gapped networks.
Agentic File Creation
The real test of Claude Code compatibility isn't chat. It's whether the model can create files, run commands, and iterate on a codebase. The --tool-call-parser qwen3_coder flag handles the translation between Qwen's XML-style tool calls and the OpenAI tool format that LiteLLM expects.
To verify this works end-to-end, I asked Claude Code to build a complete Flask application:
export ANTHROPIC_BASE_URL="http://localhost:4000"
cd /tmp && mkdir flask-test && cd flask-test
claude --dangerously-skip-permissions -p \
"Build a Flask todo app with SQLite persistence, \
modern UI with gradients and animations, \
mobile responsive design, and full CRUD operations."
The model created a complete project structure:
flask_todo_app/
├── app.py # Flask routes and SQLite setup
├── requirements.txt # Dependencies
├── run_app.sh # Launch script
├── static/
│ ├── css/
│ │ └── style.css # Gradients, animations, hover effects
│ └── js/
│ └── script.js # Client-side interactions
└── templates/
└── index.html # Jinja2 template with responsive layout
The generated app.py includes proper SQLite initialization:
from flask import Flask, render_template, request, redirect, url_for
import sqlite3
app = Flask(__name__)
def init_db():
conn = sqlite3.connect('todos.db')
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS todos
(id INTEGER PRIMARY KEY AUTOINCREMENT,
task TEXT NOT NULL,
completed BOOLEAN DEFAULT FALSE)''')
conn.commit()
conn.close()
init_db()
@app.route('/')
def index():
conn = sqlite3.connect('todos.db')
c = conn.cursor()
c.execute('SELECT id, task, completed FROM todos ORDER BY id DESC')
todos = c.fetchall()
conn.close()
return render_template('index.html', todos=todos)
The CSS includes gradients, glass-morphism effects, and animations:
body {
font-family: 'Poppins', sans-serif;
background: linear-gradient(135deg, #667eea, #764ba2);
min-height: 100vh;
padding: 20px;
}
.container {
max-width: 800px;
margin: 0 auto;
}
.header {
text-align: center;
padding: 40px 0;
color: white;
text-shadow: 0 2px 4px rgba(0,0,0,0.1);
}
After activating the venv and running the app, everything works. Add a task, toggle it complete, delete it. The database persists across restarts.
cd flask_todo_app
source venv/bin/activate
python app.py
# Visit http://localhost:5000
The full generation took about five minutes across multiple agentic iterations. Each file is a separate tool call: the model generates, Claude Code executes, the result comes back, and the model plans the next step. The 91% prefix cache hit rate shows vLLM efficiently reusing context across the multi-turn loop.
This confirms the agentic workflow functions correctly. The model reads the prompt, plans a file structure, emits tool calls to create directories and write files, and produces a functional application. All inference happens locally on the MI60s. No code leaves my network.
I have not yet tested this on a larger codebase. A small Flask app is one thing; a multi-thousand-line refactor is another. The 64K context limit will eventually become a constraint, and I expect the model to struggle with complex architectural decisions that the real Claude handles gracefully. For now, this works well for focused, scoped tasks.
Choosing a Model
For Claude Code compatibility, you want:
- Strong tool use: The model must emit structured tool calls reliably
- Code focus: Qwen3-Coder works well; DeepSeek Coder and CodeLlama variants should also be viable
- Sufficient context: I used 64K; smaller context windows may work but I haven't tested them
In my testing, Qwen3-Coder-30B-A3B handles straightforward coding tasks well. For complex refactoring or architectural decisions, the real Claude API is still the better choice.
If you don't have 64GB of VRAM, smaller models like Qwen2.5-Coder-7B or Qwen3-8B should fit on a single 16GB or 24GB card. I haven't tested these configurations, so I can't speak to their context limits or how well they handle Claude Code's agentic workflows.
In any case, the key is adjusting your workflow: instead of broad "refactor this module" prompts, break work into tighter, more focused requests. More prompts of narrower scope plays to a smaller model's strengths.
Running the Stack
The full configuration lives in a single compose file:
services:
vllm:
image: nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3
container_name: vllm
restart: unless-stopped
ports:
- "8000:8000"
devices:
- /dev/kfd:/dev/kfd
- /dev/dri/card1:/dev/dri/card1
- /dev/dri/card2:/dev/dri/card2
- /dev/dri/renderD128:/dev/dri/renderD128
- /dev/dri/renderD129:/dev/dri/renderD129
group_add:
- "44"
- "992"
shm_size: 16g
volumes:
- /mnt/cache/huggingface:/root/.cache/huggingface:rw
environment:
- HIP_VISIBLE_DEVICES=0,1
command:
- python
- -m
- vllm.entrypoints.openai.api_server
- --model
- QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
- --tensor-parallel-size
- "2"
- --max-model-len
- "65536"
- --gpu-memory-utilization
- "0.9"
- --host
- "0.0.0.0"
- --port
- "8000"
- --enable-auto-tool-choice
- --tool-call-parser
- qwen3_coder
litellm:
image: litellm/litellm:v1.80.15-stable
container_name: litellm
restart: unless-stopped
ports:
- "4000:4000"
volumes:
- ./litellm-config.yaml:/app/config.yaml:ro
command:
- --config
- /app/config.yaml
- --port
- "4000"
- --host
- "0.0.0.0"
depends_on:
- vllm
Start it with nerdctl (or docker):
nerdctl compose -f coder.yaml up -d
From any machine on my network, I can point Claude Code at Feynman (my GPU workstation) and get local inference. When I'm done, tear it down with:
nerdctl compose -f coder.yaml down
The Verdict
This setup won't replace the Claude API for everyone. If you need maximum capability, Anthropic's hosted models are still the best option. But for those of us who care about where our code goes, local inference means complete data sovereignty. Proprietary code never leaves my network. Plus there's something satisfying about seeing your own GPUs light up every time you ask Claude Code a question.

Top comments (0)