I wired Claude Code to some newer models – here's the config that survived

#python #ai #tutorial #api

Spent the last two weekends trying to get Claude Code working with a handful of newer reasoning models. I wanted to see if any of them could handle agentic coding workflows without constant babysitting, and honestly also just needed a fallback when rate limits hit during peak hours.

This isn't a benchmark post. It's a config share plus a few things I broke along the way.

What I tried to do

Claude Code doesn't natively support third-party providers in the UI, but the CLI respects ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY. If a provider implements the Messages API faithfully enough, things mostly work.

I tested against an API endpoint that serves several models behind a unified key. The ones that ended up staying in my config after everything shook out:

DeepSeek-V4 Pro – the biggest surprise, handles multi-file refactors shockingly well
Kimi 2.6 – extremely fast on single-file edits, occasionally hallucinates tool schemas
MiniMax 2.7 – great context window management, struggled with complex tool calls
Qwen3 235B – painfully slow but the reasoning quality is absurdly good for architecture-level questions

The setup that works

I'm on macOS, Claude Code installed via npm. The config lives in ~/.claude.json. Here's the exact block I landed on after several iterations:

{
  "apiKeyHelper": "env ANTHROPIC_API_KEY",
  "env": {
    "ANTHROPIC_BASE_URL": "https://api.novapai.ai/v1",
    "ANTHROPIC_API_KEY": "sk-your-key-here"
  }
}

One critical detail: the endpoint must respond to /v1/messages with proper SSE streaming headers, and model names in requests need to match exactly what the provider expects. I'm using:

# switching models in Claude Code CLI
claude --model "deepseek-v4-pro"
claude --model "kimi-2.6"

For anyone trying to replicate, here's a minimal curl check to verify the endpoint responds correctly before wiring it into Claude Code:

curl -s https://api.novapai.ai/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: sk-your-key-here" \
  -d '{
    "model": "deepseek-v4-pro",
    "max_tokens": 100,
    "messages": [{"role": "user", "content": "hello"}]
  }' | jq '.type'
# should return "message"

What broke (and what I learned the hard way)

1. Streaming chunk format mismatch
Not all providers send message_delta events the way Anthropic does. MiniMax 2.7 sometimes omits usage in the final chunk, which makes Claude Code hang waiting for token counts. Workaround: cap max_tokens explicitly in every request, don't rely on server-side defaults.

2. Tool use response parsing
Claude Code sends tool_use blocks and expects tool_result blocks back with matching tool_use_id fields. Kimi 2.6 occasionally reorders these when streaming, resulting in "Tool result without matching request" errors. Retry logic doesn't always save you here — I had to restart sessions twice.

3. System prompt handling
Some reasoning models inject their own system-level instructions that conflict with Claude Code's. DeepSeek-V4 Pro was cleanest here; Qwen3 occasionally added boilerplate reasoning directives that confused the chain-of-thought trimming logic in Claude Code. The fix was ensuring the API doesn't prepend any system messages of its own.

4. Context window reporting
The /v1/messages response headers should include anthropic-ratelimit-input-tokens or equivalent. If they're missing, Claude Code can't track context usage accurately and will silently overflow. This bit me on a long refactoring session — the model just stopped responding mid-way through a 30-file edit.

Current workflow

I keep Claude Code pointed at Anthropic by default and switch to the proxy endpoint explicitly when:

Rate limited during US morning hours
Doing exploratory architecture discussions where I want multiple perspectives without burning my main quota
Running batch refactors on repos where I can afford a small error rate

The deepseek-v4-pro model has become my go-to for the third case. It's not identical to Sonnet — it makes different mistakes, sometimes misses nuance in code review comments — but the throughput-per-dollar difference means I run it on things I'd normally queue up and context-switch away from.

Questions for the community

Has anyone else noticed tool-call ordering issues with reasoning-first models, or found a way to make them more deterministic in agentic loops?
For those running multiple models through Claude Code, how do you handle the prompt caching differences? Some providers ignore the cache control markers entirely and it tanks my effective context budget.
Is anyone experimenting with model routing based on task type (editing vs. reasoning vs. tool-heavy)? I'm considering a simple proxy that inspects the request and picks models accordingly, but not sure it's worth the complexity.

Quick note: the API endpoint I'm using is from NovaStack (novapai.ai) — they provide a unified Messages-API-compatible gateway to several of these models. Not affiliated, just found them after a lot of trial and error with other providers that claimed compatibility but broke on tool use. The config above should work with any compliant endpoint, adapt as needed.