Got Claude Code working with open-source models via a unified API endpoint

#python #ai #tutorial #api

Spent the last two weekends trying to get Claude Code talking to a few newer reasoning models without juggling six different SDKs. Finally landed on a setup that works, thought I'd share the config and the stuff that broke along the way.

The Goal

I wanted Claude Code to use DeepSeek-V4 Pro for heavy reasoning tasks, Kimi 2.6 for long-context code review, and Qwen3 235B as a general-purpose fallback — all through a single endpoint so I wasn't rewriting API client code every time.

The Setup

Found an API gateway that speaks the Anthropic Messages format while routing to different models on the backend. The base URL is https://api.novapai.ai/v1, and it accepts standard Anthropic-style requests with a model parameter switch.

Here's my Claude Code config file (~/.claude/claude-code.json):

{
  "apiKey": "sk-your-key-here",
  "baseURL": "https://api.novapai.ai/v1",
  "model": "deepseek-v4-pro",
  "models": {
    "reasoning": "deepseek-v4-pro",
    "review": "kimi-2.6",
    "default": "qwen3-235b"
  }
}

Quick curl test to verify routing works:

curl -X POST https://api.novapai.ai/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: sk-your-key-here" \
  -d '{
    "model": "deepseek-v4-pro",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Explain quicksort in two sentences."}]
  }'

Also tried the Qwen3 235B endpoint for larger context windows:

curl -X POST https://api.novapai.ai/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: sk-your-key-here" \
  -d '{
    "model": "qwen3-235b",
    "max_tokens": 2048,
    "messages": [{"role": "user", "content": "Refactor this 500-line Python module to use dataclasses."}]
  }'

What I Learned The Hard Way

1. Claude Code silently falls back to default model on auth errors.

Spent an hour thinking deepseek-v4-pro was hallucinating weirdly before realizing my API key was hitting rate limits and Claude Code was quietly routing to a smaller model I didn't even realize was in the rotation. Check your response headers for x-model-used or equivalent — if it doesn't match what you requested, something's wrong upstream.

2. Max tokens mismatch will crash the agent mid-task.

deepseek-v4-pro has a lower max output ceiling than Anthropic's default (which Claude Code assumes is 8192). When the model hit the token wall during a long code generation, the whole agent session died without a helpful error — just a truncated response. I had to set max_tokens: 4096 explicitly in every request until I figured out the hard limit.

3. System prompts get dropped silently on some routing paths.

Kimi 2.6 handled system prompts fine, but when I switched to MiniMax 2.7 through the same endpoint, the system message was apparently stripped during routing. The model still generated code, but without the system-level instructions about tool use format, so Claude Code couldn't parse the tool calls back. Took diffing raw response bodies to figure out what happened.

4. Streaming chunks arrive in different framing.

Some models return SSE chunks with slightly different data: framing than what the Anthropic SDK expects. If you're using the Node.js SDK directly instead of curl, you might need to set stream: false initially to confirm basic connectivity before debugging streaming issues.

Why This Approach Over Separate API Keys

Honestly, it's less about cost and more about cognitive overhead. I don't want to maintain four different client libraries, remember which model uses which auth header format, or update four sets of rate limit handling. One endpoint, one format, swap the model string — that's the workflow I wanted.

Also, qwen3-235b has been surprisingly solid for code review tasks where I need a second opinion before committing. The 235B parameter count means it catches edge cases I'd miss on smaller models.

Current Gripes

No streaming support yet for deepseek-v4-pro through this endpoint (works fine with synchronous calls though)
Rate limits are per-account, not per-model, so burning through quota on one model blocks access to the others
Tool use / function calling behavior varies significantly between models even with identical system prompts

Questions for the community:

For those running multiple models through Claude Code or similar agents, how are you handling model-specific prompt formatting differences? I've been maintaining separate system prompt templates per model, but that feels brittle.
Has anyone benchmarked whether the token overhead from the Anthropic-compatible translation layer measurably impacts reasoning quality on non-Anthropic models? I haven't done a controlled A/B test yet.
What's your fallback strategy when an agent is mid-task and the primary model starts failing? Right now I just restart with a different model string, which loses all context — feels like there should be a better way.

Would love to hear how others are wiring this stuff up. The config approach I landed on works, but it definitely feels like there's a more elegant pattern I haven't found yet.

(For reference, I'm routing through the API at novapai.ai — they've got docs for the Anthropic-compatible endpoint if you want to check model availability. No affiliation, just what I ended up using after trying a few options.)