DEV Community

Donald Cruver
Donald Cruver

Posted on • Originally published at hullabalooing.cruver.ai

Running Claude Code with Local LLMs via vLLM and LiteLLM

Every query to Claude Code means sending my source code to Anthropic's servers. For proprietary codebases, that's a non-starter. With vLLM and LiteLLM, I can point Claude Code at my own hardware - keeping my code on my network while maintaining the same workflow.

The Architecture

The trick is that Claude Code expects the Anthropic Messages API, but local inference servers speak OpenAI's API format. LiteLLM bridges this gap. It accepts Anthropic-formatted requests and translates them to OpenAI format for my local vLLM instance.

The stack looks like this:

Claude Code → LiteLLM (port 4000) → vLLM (port 8000) → Local GPU
Enter fullscreen mode Exit fullscreen mode

One environment variable makes it work:

export ANTHROPIC_BASE_URL="http://localhost:4000"
Enter fullscreen mode Exit fullscreen mode

Claude Code now sends all requests to my local LiteLLM proxy, which forwards them to vLLM running my model of choice.

The vLLM Configuration

I'm running Qwen3-Coder 30B A3B, a Mixture of Experts model with 30 billion total parameters but only 3 billion active per forward pass. The AWQ quantization brings memory requirements down enough to split it across my dual MI60 GPUs using tensor parallelism:

services:
  vllm:
    image: nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3
    container_name: vllm
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri/card1:/dev/dri/card1
      - /dev/dri/card2:/dev/dri/card2
      - /dev/dri/renderD128:/dev/dri/renderD128
      - /dev/dri/renderD129:/dev/dri/renderD129
    shm_size: 16g
    environment:
      - HIP_VISIBLE_DEVICES=0,1
    command:
      - python
      - -m
      - vllm.entrypoints.openai.api_server
      - --model
      - QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
      - --tensor-parallel-size
      - "2"
      - --max-model-len
      - "65536"
      - --gpu-memory-utilization
      - "0.9"
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder
Enter fullscreen mode Exit fullscreen mode

The --enable-auto-tool-choice and --tool-call-parser qwen3_coder flags are essential for agentic use. They let the model emit tool calls that Claude Code expects.

The LiteLLM Translation Layer

LiteLLM maps Claude model names to the local vLLM endpoint. The wildcard pattern catches any model Claude Code requests:

model_list:
  - model_name: claude-*
    litellm_params:
      model: hosted_vllm/QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
      api_base: http://vllm:8000/v1
      api_key: "not-needed"
    model_info:
      max_tokens: 65536
      max_input_tokens: 57344
      max_output_tokens: 8192

litellm_settings:
  drop_params: true
  request_timeout: 600
  modify_params: true

general_settings:
  disable_key_check: true
Enter fullscreen mode Exit fullscreen mode

A few settings to note:

  • drop_params: true silently ignores Anthropic-specific parameters that don't translate to OpenAI format
  • modify_params: true allows LiteLLM to adjust parameters as needed for the target API
  • disable_key_check: true skips API key validation since we're running locally

Practical Usage

With everything running, Claude Code works exactly as normal:

export ANTHROPIC_BASE_URL="http://localhost:4000"

cd my-project
claude
Enter fullscreen mode Exit fullscreen mode

The experience is nearly identical to using Anthropic's API, with a few caveats:

  • Token throughput: My dual MI60 setup does roughly 25-30 tokens/second with ~175ms time-to-first-token. No rate limiting, no queue times, no network latency.
  • Context limits: I cap at 64K tokens. Claude Opus can handle 200K.
  • Model capability: Qwen3-Coder is excellent for coding tasks, but Claude has broader knowledge and better instruction following.

The upside is obvious: zero API costs, complete data sovereignty, and the ability to run Claude Code on air-gapped networks.

Agentic File Creation

The real test of Claude Code compatibility isn't chat. It's whether the model can create files, run commands, and iterate on a codebase. The --tool-call-parser qwen3_coder flag handles the translation between Qwen's XML-style tool calls and the OpenAI tool format that LiteLLM expects.

To verify this works end-to-end, I asked Claude Code to build a complete Flask application:

export ANTHROPIC_BASE_URL="http://localhost:4000"

cd /tmp && mkdir flask-test && cd flask-test
claude --dangerously-skip-permissions -p \
  "Build a Flask todo app with SQLite persistence, \
   modern UI with gradients and animations, \
   mobile responsive design, and full CRUD operations."
Enter fullscreen mode Exit fullscreen mode

The model created a complete project structure:

flask_todo_app/
├── app.py              # Flask routes and SQLite setup
├── requirements.txt    # Dependencies
├── run_app.sh          # Launch script
├── static/
│   ├── css/
│   │   └── style.css   # Gradients, animations, hover effects
│   └── js/
│       └── script.js   # Client-side interactions
└── templates/
    └── index.html      # Jinja2 template with responsive layout
Enter fullscreen mode Exit fullscreen mode

The generated app.py includes proper SQLite initialization:

from flask import Flask, render_template, request, redirect, url_for
import sqlite3

app = Flask(__name__)

def init_db():
    conn = sqlite3.connect('todos.db')
    c = conn.cursor()
    c.execute('''CREATE TABLE IF NOT EXISTS todos
                 (id INTEGER PRIMARY KEY AUTOINCREMENT,
                  task TEXT NOT NULL,
                  completed BOOLEAN DEFAULT FALSE)''')
    conn.commit()
    conn.close()

init_db()

@app.route('/')
def index():
    conn = sqlite3.connect('todos.db')
    c = conn.cursor()
    c.execute('SELECT id, task, completed FROM todos ORDER BY id DESC')
    todos = c.fetchall()
    conn.close()
    return render_template('index.html', todos=todos)
Enter fullscreen mode Exit fullscreen mode

The CSS includes gradients, glass-morphism effects, and animations:

body {
    font-family: 'Poppins', sans-serif;
    background: linear-gradient(135deg, #667eea, #764ba2);
    min-height: 100vh;
    padding: 20px;
}

.container {
    max-width: 800px;
    margin: 0 auto;
}

.header {
    text-align: center;
    padding: 40px 0;
    color: white;
    text-shadow: 0 2px 4px rgba(0,0,0,0.1);
}
Enter fullscreen mode Exit fullscreen mode

After activating the venv and running the app, everything works. Add a task, toggle it complete, delete it. The database persists across restarts.

Flask Todo App generated by Claude Code with local LLM

cd flask_todo_app
source venv/bin/activate
python app.py
# Visit http://localhost:5000
Enter fullscreen mode Exit fullscreen mode

The full generation took about five minutes across multiple agentic iterations. Each file is a separate tool call: the model generates, Claude Code executes, the result comes back, and the model plans the next step. The 91% prefix cache hit rate shows vLLM efficiently reusing context across the multi-turn loop.

This confirms the agentic workflow functions correctly. The model reads the prompt, plans a file structure, emits tool calls to create directories and write files, and produces a functional application. All inference happens locally on the MI60s. No code leaves my network.

I have not yet tested this on a larger codebase. A small Flask app is one thing; a multi-thousand-line refactor is another. The 64K context limit will eventually become a constraint, and I expect the model to struggle with complex architectural decisions that the real Claude handles gracefully. For now, this works well for focused, scoped tasks.

Choosing a Model

For Claude Code compatibility, you want:

  • Strong tool use: The model must emit structured tool calls reliably
  • Code focus: Qwen3-Coder works well; DeepSeek Coder and CodeLlama variants should also be viable
  • Sufficient context: I used 64K; smaller context windows may work but I haven't tested them

In my testing, Qwen3-Coder-30B-A3B handles straightforward coding tasks well. For complex refactoring or architectural decisions, the real Claude API is still the better choice.

If you don't have 64GB of VRAM, smaller models like Qwen2.5-Coder-7B or Qwen3-8B should fit on a single 16GB or 24GB card. I haven't tested these configurations, so I can't speak to their context limits or how well they handle Claude Code's agentic workflows.

In any case, the key is adjusting your workflow: instead of broad "refactor this module" prompts, break work into tighter, more focused requests. More prompts of narrower scope plays to a smaller model's strengths.

Running the Stack

The full configuration lives in a single compose file:

services:
  vllm:
    image: nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3
    container_name: vllm
    restart: unless-stopped
    ports:
      - "8000:8000"
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri/card1:/dev/dri/card1
      - /dev/dri/card2:/dev/dri/card2
      - /dev/dri/renderD128:/dev/dri/renderD128
      - /dev/dri/renderD129:/dev/dri/renderD129
    group_add:
      - "44"
      - "992"
    shm_size: 16g
    volumes:
      - /mnt/cache/huggingface:/root/.cache/huggingface:rw
    environment:
      - HIP_VISIBLE_DEVICES=0,1
    command:
      - python
      - -m
      - vllm.entrypoints.openai.api_server
      - --model
      - QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
      - --tensor-parallel-size
      - "2"
      - --max-model-len
      - "65536"
      - --gpu-memory-utilization
      - "0.9"
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder

  litellm:
    image: litellm/litellm:v1.80.15-stable
    container_name: litellm
    restart: unless-stopped
    ports:
      - "4000:4000"
    volumes:
      - ./litellm-config.yaml:/app/config.yaml:ro
    command:
      - --config
      - /app/config.yaml
      - --port
      - "4000"
      - --host
      - "0.0.0.0"
    depends_on:
      - vllm
Enter fullscreen mode Exit fullscreen mode

Start it with nerdctl (or docker):

nerdctl compose -f coder.yaml up -d
Enter fullscreen mode Exit fullscreen mode

From any machine on my network, I can point Claude Code at Feynman (my GPU workstation) and get local inference. When I'm done, tear it down with:

nerdctl compose -f coder.yaml down
Enter fullscreen mode Exit fullscreen mode

The Verdict

This setup won't replace the Claude API for everyone. If you need maximum capability, Anthropic's hosted models are still the best option. But for those of us who care about where our code goes, local inference means complete data sovereignty. Proprietary code never leaves my network. Plus there's something satisfying about seeing your own GPUs light up every time you ask Claude Code a question.

Top comments (0)