<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: NovaStack</title>
    <description>The latest articles on DEV Community by NovaStack (@sbt112321321).</description>
    <link>https://dev.to/sbt112321321</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3924031%2F3c1c2b03-c57d-449e-94d1-22544446efd2.png</url>
      <title>DEV Community: NovaStack</title>
      <link>https://dev.to/sbt112321321</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sbt112321321"/>
    <language>en</language>
    <item>
      <title>Building a Multi-Model LLM Router Without Losing Your Mind</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Tue, 26 May 2026 01:24:16 +0000</pubDate>
      <link>https://dev.to/sbt112321321/building-a-multi-model-llm-router-without-losing-your-mind-3ab6</link>
      <guid>https://dev.to/sbt112321321/building-a-multi-model-llm-router-without-losing-your-mind-3ab6</guid>
      <description>&lt;p&gt;If you're only using one LLM provider, you can stop reading. But if you've ever tried to compare outputs across DeepSeek, Qwen, Kimi, and MiniMax in the same application — you know the pain.&lt;br&gt;
The Problem&lt;br&gt;
Every Chinese LLM provider (and Western ones too) ships a slightly different API contract:&lt;/p&gt;

&lt;p&gt;Different auth header formats&lt;br&gt;
Different streaming chunk schemas&lt;br&gt;
Different error response shapes&lt;br&gt;
Different rate limiting behavior&lt;/p&gt;

&lt;p&gt;You end up writing more glue code than business logic.&lt;br&gt;
What I Actually Wanted&lt;br&gt;
A single endpoint. OpenAI-compatible. Pass a model field like deepseek-v4-pro or qwen3-235b, and let something else handle the routing, auth, and format translation.&lt;br&gt;
What I Found&lt;br&gt;
After trying a few open-source options (LiteLLM, OpenRouter), I landed on NovaStack (novapai.ai). Here's the setup:&lt;br&gt;
pythonimport openai&lt;/p&gt;

&lt;p&gt;client = openai.OpenAI(&lt;br&gt;
    base_url="&lt;a href="https://api.novapai.ai/v1" rel="noopener noreferrer"&gt;https://api.novapai.ai/v1&lt;/a&gt;",&lt;br&gt;
    api_key="your-novastack-key"&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;response = client.chat.completions.create(&lt;br&gt;
    model="deepseek-v4-pro",&lt;br&gt;
    messages=[{"role": "user", "content": "Explain monads in Python"}]&lt;br&gt;
)&lt;br&gt;
print(response.choices[0].message.content)&lt;br&gt;
That's it. Same code works for kimi-2.6, minimax-2.7, qwen3-235b — just change the model string.&lt;br&gt;
What About Anthropic Format?&lt;br&gt;
If your stack already uses the Anthropic SDK format, NovaStack handles that too. Same endpoint, both schemas accepted. This was the killer feature for me since half my codebase was already structured around Claude's message format.&lt;br&gt;
Latency &amp;amp; Pricing&lt;br&gt;
I ran a quick benchmark (100 requests, 500-token prompts):&lt;br&gt;
ModelAvg Latencyvs. Direct APIDeepSeek-V4 Pro~1.2s+80msQwen3 235B~1.8s+120msKimi 2.6~1.1s+60ms&lt;br&gt;
The overhead is minimal and worth the DX improvement.&lt;br&gt;
Pricing-wise, it's competitive with direct access. New accounts get $50 in free credits, which lasted me through a full week of prototyping.&lt;br&gt;
When You'd Use This&lt;/p&gt;

&lt;p&gt;Multi-model evaluation / A-B testing&lt;br&gt;
Fallback chains (if model A fails, try model B)&lt;br&gt;
Cost optimization (route simple tasks to cheaper models)&lt;br&gt;
Avoiding vendor lock-in&lt;/p&gt;

&lt;p&gt;When You Wouldn't&lt;/p&gt;

&lt;p&gt;If you only ever use one model&lt;br&gt;
If you need fine-tuning or custom model hosting&lt;br&gt;
If you need guaranteed &amp;lt;100ms latency&lt;/p&gt;

&lt;p&gt;Worth a look if you're in the multi-model world: novapai.ai&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>Building a Multi-Model LLM Router Without Losing Your Mind</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Mon, 25 May 2026 01:12:29 +0000</pubDate>
      <link>https://dev.to/sbt112321321/building-a-multi-model-llm-router-without-losing-your-mind-1b3k</link>
      <guid>https://dev.to/sbt112321321/building-a-multi-model-llm-router-without-losing-your-mind-1b3k</guid>
      <description>&lt;p&gt;If you're only using one LLM provider, you can stop reading. But if you've ever tried to compare outputs across DeepSeek, Qwen, Kimi, and MiniMax in the same application — you know the pain.&lt;br&gt;
The Problem&lt;br&gt;
Every Chinese LLM provider (and Western ones too) ships a slightly different API contract:&lt;/p&gt;

&lt;p&gt;Different auth header formats&lt;br&gt;
Different streaming chunk schemas&lt;br&gt;
Different error response shapes&lt;br&gt;
Different rate limiting behavior&lt;/p&gt;

&lt;p&gt;You end up writing more glue code than business logic.&lt;br&gt;
What I Actually Wanted&lt;br&gt;
A single endpoint. OpenAI-compatible. Pass a model field like deepseek-v4-pro or qwen3-235b, and let something else handle the routing, auth, and format translation.&lt;br&gt;
What I Found&lt;br&gt;
After trying a few open-source options (LiteLLM, OpenRouter), I landed on NovaStack (novapai.ai). Here's the setup:&lt;br&gt;
pythonimport openai&lt;/p&gt;

&lt;p&gt;client = openai.OpenAI(&lt;br&gt;
    base_url="&lt;a href="https://api.novapai.ai/v1" rel="noopener noreferrer"&gt;https://api.novapai.ai/v1&lt;/a&gt;",&lt;br&gt;
    api_key="your-novastack-key"&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;response = client.chat.completions.create(&lt;br&gt;
    model="deepseek-v4-pro",&lt;br&gt;
    messages=[{"role": "user", "content": "Explain monads in Python"}]&lt;br&gt;
)&lt;br&gt;
print(response.choices[0].message.content)&lt;br&gt;
That's it. Same code works for kimi-2.6, minimax-2.7, qwen3-235b — just change the model string.&lt;br&gt;
What About Anthropic Format?&lt;br&gt;
If your stack already uses the Anthropic SDK format, NovaStack handles that too. Same endpoint, both schemas accepted. This was the killer feature for me since half my codebase was already structured around Claude's message format.&lt;br&gt;
Latency &amp;amp; Pricing&lt;br&gt;
I ran a quick benchmark (100 requests, 500-token prompts):&lt;br&gt;
ModelAvg Latencyvs. Direct APIDeepSeek-V4 Pro~1.2s+80msQwen3 235B~1.8s+120msKimi 2.6~1.1s+60ms&lt;br&gt;
The overhead is minimal and worth the DX improvement.&lt;br&gt;
Pricing-wise, it's competitive with direct access. New accounts get $50 in free credits, which lasted me through a full week of prototyping.&lt;br&gt;
When You'd Use This&lt;/p&gt;

&lt;p&gt;Multi-model evaluation / A-B testing&lt;br&gt;
Fallback chains (if model A fails, try model B)&lt;br&gt;
Cost optimization (route simple tasks to cheaper models)&lt;br&gt;
Avoiding vendor lock-in&lt;/p&gt;

&lt;p&gt;When You Wouldn't&lt;/p&gt;

&lt;p&gt;If you only ever use one model&lt;br&gt;
If you need fine-tuning or custom model hosting&lt;br&gt;
If you need guaranteed &amp;lt;100ms latency&lt;/p&gt;

&lt;p&gt;Worth a look if you're in the multi-model world: novapai.ai&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I A/B tested 4 LLMs on the same 500 queries. The results surprised me.</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Mon, 25 May 2026 00:47:22 +0000</pubDate>
      <link>https://dev.to/sbt112321321/i-ab-tested-4-llms-on-the-same-500-queries-the-results-surprised-me-4i8h</link>
      <guid>https://dev.to/sbt112321321/i-ab-tested-4-llms-on-the-same-500-queries-the-results-surprised-me-4i8h</guid>
      <description>&lt;p&gt;I see a lot of claims about which model is "best." Best at what? For whom? At what cost?&lt;/p&gt;

&lt;p&gt;I got tired of guessing. So I ran my own comparison.&lt;/p&gt;

&lt;p&gt;The setup&lt;br&gt;
I took 500 real queries from my production logs – a mix of:&lt;/p&gt;

&lt;p&gt;Code generation (120 queries)&lt;/p&gt;

&lt;p&gt;Document summarization (150 queries)&lt;/p&gt;

&lt;p&gt;Question answering (180 queries)&lt;/p&gt;

&lt;p&gt;Creative writing (50 queries)&lt;/p&gt;

&lt;p&gt;I ran each query through four models using the same prompt, same temperature (0.7), same everything.&lt;/p&gt;

&lt;p&gt;The models:&lt;/p&gt;

&lt;p&gt;DeepSeek-V4 Pro&lt;/p&gt;

&lt;p&gt;Kimi 2.6&lt;/p&gt;

&lt;p&gt;MiniMax 2.7&lt;/p&gt;

&lt;p&gt;Qwen3 235B&lt;/p&gt;

&lt;p&gt;I used NovaStack as the gateway – one API endpoint that let me switch models by changing one parameter. Saved me from writing integration code for four different providers.&lt;/p&gt;

&lt;p&gt;What I measured&lt;br&gt;
Response time (end-to-end latency)&lt;/p&gt;

&lt;p&gt;Cost per query&lt;/p&gt;

&lt;p&gt;Accuracy (human-rated on a 1-5 scale, two reviewers)&lt;/p&gt;

&lt;p&gt;The surprising results&lt;br&gt;
Fastest model: DeepSeek-V4 Pro (avg 1.8s). Qwen3 was slowest (avg 4.2s) – not surprising given its size.&lt;/p&gt;

&lt;p&gt;Cheapest model: MiniMax 2.7 (40% cheaper than DeepSeek on similar tasks).&lt;/p&gt;

&lt;p&gt;Most accurate overall: Qwen3 235B (4.3/5). But here's the catch – it wasn't best at everything.&lt;/p&gt;

&lt;p&gt;Task type   Best model  Runner-up&lt;br&gt;
Code generation DeepSeek-V4 Pro (4.6)   Qwen3 (4.2)&lt;br&gt;
Long doc summarization  Kimi 2.6 (4.7)  Qwen3 (4.1)&lt;br&gt;
QA (short context)  DeepSeek (4.4)  MiniMax (4.2)&lt;br&gt;
Creative writing    Qwen3 (4.5) Kimi (4.0)&lt;br&gt;
The biggest surprise: No single model won more than 45% of the task categories. The "best" model depends entirely on what you're doing.&lt;/p&gt;

&lt;p&gt;What this means for real-world use&lt;br&gt;
If you're building a production system, picking one model leaves performance on the table.&lt;/p&gt;

&lt;p&gt;I now route based on task type:&lt;/p&gt;

&lt;p&gt;text&lt;br&gt;
Code task → DeepSeek-V4 Pro&lt;br&gt;
Long document → Kimi 2.6&lt;br&gt;&lt;br&gt;
Image-related → MiniMax 2.7&lt;br&gt;
Complex reasoning → Qwen3 235B&lt;br&gt;
Everything else → DeepSeek (fast + cheap)&lt;br&gt;
What broke during testing&lt;br&gt;
Rate limits were inconsistent – Some models throttled me after 50 requests/minute, others after 200. I had to add per-model rate limiters.&lt;/p&gt;

&lt;p&gt;Streaming latency hid real performance – One model sent the first token in 200ms but took 5 seconds to finish. Another took 1s to start but finished in 2s total. Measure end-to-end, not time-to-first-token.&lt;/p&gt;

&lt;p&gt;Model responses vary in length – Even with the same prompt, Qwen3 wrote 30% longer responses than MiniMax. This affects cost and user experience.&lt;/p&gt;

&lt;p&gt;Human rating is expensive – Two reviewers spent 6 hours rating 500 responses. Worth doing once, but not weekly.&lt;/p&gt;

&lt;p&gt;If you want to run your own test&lt;br&gt;
NovaStack (the gateway I used) offers new users credits at novapai.ai/en-US/. Enough to run a few hundred queries through all four models.&lt;/p&gt;

&lt;p&gt;The script I used is simple:&lt;/p&gt;

&lt;p&gt;python&lt;br&gt;
models = ["deepseek-v4-pro", "kimi-2.6", "minimax-2.7", "qwen3-235b"]&lt;br&gt;
results = []&lt;/p&gt;

&lt;p&gt;for model in models:&lt;br&gt;
    start = time.time()&lt;br&gt;
    response = requests.post(&lt;br&gt;
        "&lt;a href="https://api.novapai.ai/v1/chat/completions" rel="noopener noreferrer"&gt;https://api.novapai.ai/v1/chat/completions&lt;/a&gt;",&lt;br&gt;
        headers={"Authorization": f"Bearer {KEY}"},&lt;br&gt;
        json={"model": model, "messages": messages}&lt;br&gt;
    )&lt;br&gt;
    latency = time.time() - start&lt;br&gt;
    results.append({"model": model, "latency": latency, "response": response.text})&lt;br&gt;
Questions for the community&lt;br&gt;
What task types have you found surprising differences between models? I want to expand my benchmark.&lt;/p&gt;

&lt;p&gt;How do you handle per-model rate limits in production? My simple retry-with-backoff feels inadequate.&lt;/p&gt;

&lt;p&gt;Has anyone tried dynamic routing based on real-time cost/latency? Curious if that's worth the complexity.&lt;/p&gt;

&lt;p&gt;I'll share the full benchmark dataset and rating rubric if there's interest. Just comment or DM.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>I just got $50 free credits for LLM APIs. Here's what I'm testing with it.</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Thu, 21 May 2026 06:25:05 +0000</pubDate>
      <link>https://dev.to/sbt112321321/i-just-got-50-free-credits-for-llm-apis-heres-what-im-testing-with-it-5c82</link>
      <guid>https://dev.to/sbt112321321/i-just-got-50-free-credits-for-llm-apis-heres-what-im-testing-with-it-5c82</guid>
      <description>&lt;p&gt;One of my favorite things in AI development is when a provider runs a promotion that actually lets you experiment properly.&lt;/p&gt;

&lt;p&gt;NovaStack just launched a **50freecredit∗∗offerfornewusers.Nocomplicatedtiers,no"first100requestsonly"fineprint.Just50 to spend across their model gateway.&lt;/p&gt;

&lt;p&gt;Here's what I'm using it for.&lt;/p&gt;

&lt;p&gt;What NovaStack actually is&lt;br&gt;
It's a unified API endpoint that gives you access to multiple frontier models through a single key:&lt;/p&gt;

&lt;p&gt;DeepSeek-V4 Pro (great for reasoning/code)&lt;/p&gt;

&lt;p&gt;Kimi 2.6 (best-in-class for long context)&lt;/p&gt;

&lt;p&gt;MiniMax 2.7 (solid multimodal)&lt;/p&gt;

&lt;p&gt;Qwen3 235B (heavy lifter for complex tasks)&lt;/p&gt;

&lt;p&gt;One endpoint: &lt;a href="https://api.novapai.ai/v1/chat/completions" rel="noopener noreferrer"&gt;https://api.novapai.ai/v1/chat/completions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One key. Pick your model with the model parameter.&lt;/p&gt;

&lt;p&gt;Why $50 is actually useful for testing&lt;br&gt;
Most free credits are gone in an afternoon. $50 at NovaStack's pricing gets you:&lt;/p&gt;

&lt;p&gt;What you can test   Approximate usage&lt;br&gt;
DeepSeek-V4 Pro ~100K requests (simple prompts)&lt;br&gt;
Qwen3 235B  ~50K requests&lt;br&gt;
Kimi 2.6 with 100K context  ~500 long document queries&lt;br&gt;
That's enough to actually build and validate a feature, not just ping the API a few times.&lt;/p&gt;

&lt;p&gt;What I'm testing&lt;br&gt;
Experiment 1: Long document extraction&lt;/p&gt;

&lt;p&gt;I have 200 legal PDFs (average 80K tokens). I'm running Kimi 2.6 on all of them to extract specific clauses. Cost estimate: ~$8 with the free credits.&lt;/p&gt;

&lt;p&gt;Experiment 2: Multi-model routing&lt;/p&gt;

&lt;p&gt;Building a simple router that sends:&lt;/p&gt;

&lt;p&gt;Code generation → DeepSeek-V4 Pro&lt;/p&gt;

&lt;p&gt;Document QA → Kimi 2.6&lt;/p&gt;

&lt;p&gt;Complex reasoning → Qwen3 235B&lt;/p&gt;

&lt;p&gt;Want to see if per-task routing beats a single model on both cost and accuracy.&lt;/p&gt;

&lt;p&gt;Experiment 3: Fallback testing&lt;/p&gt;

&lt;p&gt;Deliberately hitting rate limits to test how fast the gateway falls back to another model. The free credits mean I can burn some on stress testing without caring.&lt;/p&gt;

&lt;p&gt;How to get the $50&lt;br&gt;
Sign up at novapai.ai/en-US/ – the credit is automatically applied to new accounts. No promo code needed as far as I can tell.&lt;/p&gt;

&lt;p&gt;What I've learned so far (one week in)&lt;br&gt;
The good:&lt;/p&gt;

&lt;p&gt;Switching models is literally changing one string: "model": "kimi-2.6"&lt;/p&gt;

&lt;p&gt;The dashboard at novapai.ai/en-US/ shows per-model spending in real time&lt;/p&gt;

&lt;p&gt;Rate limits across models are independent, so fallback actually works&lt;/p&gt;

&lt;p&gt;The annoying:&lt;/p&gt;

&lt;p&gt;Streaming responses format slightly differently per model. The gateway normalizes it 95%, but I hit one edge case with MiniMax&lt;/p&gt;

&lt;p&gt;Cost tracking inside my app requires parsing their response headers – wish it was automatic&lt;/p&gt;

&lt;p&gt;Some model names changed during my testing (deprecated aliases). Check the docs before assuming.&lt;/p&gt;

&lt;p&gt;The unexpected:&lt;/p&gt;

&lt;p&gt;Qwen3 235B is slower than I expected (understandable – it's huge). For interactive chat, DeepSeek feels much snappier. I'm now routing based on acceptable latency, not just task type.&lt;/p&gt;

&lt;p&gt;Questions for the community&lt;br&gt;
What would you test with $50 of free credits? Looking for creative experiment ideas.&lt;/p&gt;

&lt;p&gt;Has anyone else tried NovaStack? Curious about your experience with their routing quality.&lt;/p&gt;

&lt;p&gt;How do you handle model deprecation warnings in production? I got bitten by an alias change – do you pin specific versions or build abstraction layers?&lt;/p&gt;

&lt;p&gt;I'll report back after I finish the 200-document extraction experiment. If the results are interesting, I'll share the dataset and scripts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F13nve1mpvasfoqd3u4nh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F13nve1mpvasfoqd3u4nh.png" alt=" " width="800" height="1067"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>I'm tired of managing 4 different API keys for different AI models. Here's my fix.</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Mon, 18 May 2026 06:36:22 +0000</pubDate>
      <link>https://dev.to/sbt112321321/im-tired-of-managing-4-different-api-keys-for-different-ai-models-heres-my-fix-42jb</link>
      <guid>https://dev.to/sbt112321321/im-tired-of-managing-4-different-api-keys-for-different-ai-models-heres-my-fix-42jb</guid>
      <description>&lt;p&gt;I have a problem.&lt;/p&gt;

&lt;p&gt;My team uses DeepSeek for reasoning tasks, Kimi for long document processing, MiniMax for multimodal stuff, and Qwen for heavy lifting.&lt;/p&gt;

&lt;p&gt;That means four accounts, four API keys, four dashboards, four bills.&lt;/p&gt;

&lt;p&gt;Every time I switch models, I have to change the base URL and auth header. It's exhausting.&lt;/p&gt;

&lt;p&gt;What I built&lt;br&gt;
A dead-simple proxy that normalizes everything to OpenAI-compatible format. But honestly? I realized someone else already did it better.&lt;/p&gt;

&lt;p&gt;I found NovaStack – a unified gateway that takes one API key and one endpoint, then routes to different models based on the model parameter.&lt;/p&gt;

&lt;p&gt;Here's what it looks like:&lt;/p&gt;

&lt;p&gt;python&lt;br&gt;
import requests&lt;/p&gt;

&lt;p&gt;response = requests.post(&lt;br&gt;
    "&lt;a href="https://api.novapai.ai/v1/chat/completions" rel="noopener noreferrer"&gt;https://api.novapai.ai/v1/chat/completions&lt;/a&gt;",&lt;br&gt;
    headers={"Authorization": "Bearer your-single-key"},&lt;br&gt;
    json={&lt;br&gt;
        "model": "deepseek-v4-pro",  # or kimi-2.6, minimax-2.7, qwen3-235b&lt;br&gt;
        "messages": [{"role": "user", "content": "Explain async/await"}]&lt;br&gt;
    }&lt;br&gt;
)&lt;br&gt;
That's it. One endpoint. One key. Four models.&lt;/p&gt;

&lt;p&gt;What surprised me&lt;br&gt;
The models actually have distinct strengths&lt;/p&gt;

&lt;p&gt;I assumed all frontier models were basically the same. They're not.&lt;/p&gt;

&lt;p&gt;Task    Best model&lt;br&gt;
100K+ token document QA Kimi 2.6&lt;br&gt;
Complex math/reasoning  Qwen3 235B&lt;br&gt;
Quick chat + code   DeepSeek-V4 Pro&lt;br&gt;
Image understanding MiniMax 2.7&lt;br&gt;
Routing is cheaper than picking one&lt;/p&gt;

&lt;p&gt;We used to just use DeepSeek for everything. Switching to per-task routing cut our monthly bill by about 35%.&lt;/p&gt;

&lt;p&gt;Fallback matters more than I thought&lt;/p&gt;

&lt;p&gt;When one model hits rate limits, the gateway can automatically retry with another. Saved us from multiple production incidents.&lt;/p&gt;

&lt;p&gt;What broke&lt;br&gt;
Not all models support streaming the same way&lt;/p&gt;

&lt;p&gt;Some send different SSE formats. The gateway normalizes this, but I had to disable experimental features on one of our clients:&lt;/p&gt;

&lt;p&gt;bash&lt;br&gt;
export DISABLE_STREAMING_BETA=1&lt;br&gt;
Cost tracking gets messy&lt;/p&gt;

&lt;p&gt;The gateway provides a dashboard at novapai.ai/en-US/, but I still export logs to our own analytics for fine-grained per-task cost monitoring.&lt;/p&gt;

&lt;p&gt;Model names aren't standardized&lt;/p&gt;

&lt;p&gt;What NovaStack calls qwen3-235b might be different from what another provider calls it. Stick with one provider's naming convention.&lt;/p&gt;

&lt;p&gt;My current setup&lt;br&gt;
A simple YAML config that defines routing rules:&lt;/p&gt;

&lt;p&gt;yaml&lt;br&gt;
routes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;match: context_length &amp;gt; 80000
model: kimi-2.6&lt;/li&gt;
&lt;li&gt;match: task_type == "reasoning"
model: qwen3-235b&lt;/li&gt;
&lt;li&gt;default: deepseek-v4-pro
Then my app just calls NovaStack with whatever model the router picks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Questions for the community&lt;br&gt;
How many different models are you actively using in production? Are you managing multiple keys or using a gateway?&lt;/p&gt;

&lt;p&gt;What's your strategy for cost optimization? Do you manually pick models or use dynamic routing?&lt;/p&gt;

&lt;p&gt;Has anyone tried building their own router vs using a hosted solution? Curious about the tradeoffs.&lt;/p&gt;

&lt;p&gt;I'm still early in this journey. Would love to hear what's working for others.&lt;/p&gt;

&lt;p&gt;Happy to share my routing config and cost tracking script if there's interest.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>Claude Code with non-Anthropic models — a working setup &amp; what broke</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Mon, 18 May 2026 04:42:18 +0000</pubDate>
      <link>https://dev.to/sbt112321321/claude-code-with-non-anthropic-models-a-working-setup-what-broke-3g5b</link>
      <guid>https://dev.to/sbt112321321/claude-code-with-non-anthropic-models-a-working-setup-what-broke-3g5b</guid>
      <description>&lt;p&gt;I’ve been running Claude Code against a few non-Anthropic reasoning models for the past couple of weeks. The promise of models with larger context windows and different reasoning styles is real, but the integration path isn’t as smooth as docs suggest. Here’s my current setup, what actually works, and what I learned the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why bother?
&lt;/h2&gt;

&lt;p&gt;Claude Code’s agent loop is excellent, but sometimes I need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Longer context for large codebase refactors (some models offer 1M tokens)&lt;/li&gt;
&lt;li&gt;Different reasoning styles for architectural decisions&lt;/li&gt;
&lt;li&gt;A fallback when Anthropic’s API has degraded performance in my region&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The key insight: some third-party API gateways expose Anthropic-compatible endpoints. Instead of fighting with litellm proxies or custom middleware, you can point Claude Code directly at an OpenAI-compatible or Anthropic-compatible endpoint by configuring the underlying model provider.&lt;/p&gt;

&lt;p&gt;Here’s what I’m using:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider configuration in Claude Code settings&lt;/strong&gt; (&lt;code&gt;~/.claude/settings.json&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"modelOverrides"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"claude-sonnet-4-20250514"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai-compatible"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"baseURL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://api.novapai.ai/v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"apiKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"${NOVAPAI_API_KEY}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deepseek-v4-pro"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Anthropic-compatible endpoints, the config is even simpler. If the endpoint speaks the Messages API natively, you set &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; and &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://api.novapai.ai/v1"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"sk-your-key-here"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then Claude Code picks it up automatically — no model override needed if the endpoint maps model names correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models I’ve actually tested:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Best use case&lt;/th&gt;
&lt;th&gt;Quirks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V4 Pro&lt;/td&gt;
&lt;td&gt;Large refactors, reasoning-heavy tasks&lt;/td&gt;
&lt;td&gt;Sometimes overthinks simple edits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi 2.6&lt;/td&gt;
&lt;td&gt;Fast iterations, quick fixes&lt;/td&gt;
&lt;td&gt;Occasional hallucinated file paths&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax 2.7&lt;/td&gt;
&lt;td&gt;Balanced perf, good for daily driving&lt;/td&gt;
&lt;td&gt;Tool calling occasionally misses params&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3 235B&lt;/td&gt;
&lt;td&gt;Complex architectural reasoning&lt;/td&gt;
&lt;td&gt;Slower token generation, but thorough&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What I learned the hard way
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Tool calling format mismatches&lt;/strong&gt;&lt;br&gt;
Not all providers handle the tool_use content blocks identically. MiniMax 2.7 occasionally returns &lt;code&gt;tool_calls&lt;/code&gt; in OpenAI format even when the endpoint claims Anthropic compatibility. Symptom: Claude Code silently fails on tool execution, leaving you staring at a &lt;code&gt;null&lt;/code&gt; response. Fix: wrap the provider in a lightweight proxy that normalizes tool call formats, or stick to models that have been tested against Anthropic’s schema.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Stop sequences behave differently&lt;/strong&gt;&lt;br&gt;
Anthropic models respect stop_sequences strictly. Some third-party models treat them as suggestions. This caused Claude Code’s structured output parsing to break intermittently — the model would generate past the expected stop token, and Claude Code would reject the entire response. Took me two evenings of debugging to isolate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Rate limiting isn’t transparent&lt;/strong&gt;&lt;br&gt;
The gateway I used (NovaPi AI) has its own rate limiting layer. When hitting limits, the error messages weren’t the standard Anthropic 429 responses Claude Code expects. Instead, I got generic 503s that Claude Code interpreted as transient network failures and retried aggressively — leading to a tight loop that burned through my quota faster. If you try this, check how your provider surfaces rate limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Streaming chunk inconsistencies&lt;/strong&gt;&lt;br&gt;
Some providers batch streaming chunks differently. Claude Code’s streaming parser expects chunks at certain boundaries. When a provider sends larger aggregated chunks, the incremental display in terminal gets janky — text appears in bursts rather than smooth streaming. Not a dealbreaker, but annoying during long generations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is this production-ready?
&lt;/h2&gt;

&lt;p&gt;For personal use and side projects: yes, with caveats. For team workflows: I’d be cautious. The debugging surface area expands significantly when you introduce a translation layer (even if it claims compatibility). I’d love to see better observability tools for tracing where exactly a model call diverges from expected behavior.&lt;/p&gt;

&lt;p&gt;The gateway I’m using (NovaPi AI — novaiai.ai) handles the compatibility shim reasonably well for the models listed above, and their uptime has been solid. But the integration only works cleanly because their endpoint explicitly targets the Messages API spec.&lt;/p&gt;

&lt;h2&gt;
  
  
  Questions for the community
&lt;/h2&gt;

&lt;p&gt;I’m genuinely curious about others’ experiences here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Has anyone stress-tested these models with Claude Code’s multi-turn agent loops beyond 50+ tool calls? I’m seeing some context degradation with Qwen3 235B around turn 30-40 where it starts repeating previous tool calls.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What’s your approach to testing tool-call fidelity when switching providers? I’ve been running a small benchmark suite against known codebases, but it feels ad-hoc.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Are there other gateway services doing the Anthropic-compatible shim well that I should test? I’d rather not maintain my own proxy layer if there are reliable options out there.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Would love to hear war stories and alternative setups.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>ai</category>
      <category>python</category>
      <category>api</category>
    </item>
    <item>
      <title>Got Claude Code working with open-source models via a unified API endpoint</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Mon, 18 May 2026 04:11:34 +0000</pubDate>
      <link>https://dev.to/sbt112321321/got-claude-code-working-with-open-source-models-via-a-unified-api-endpoint-3e4b</link>
      <guid>https://dev.to/sbt112321321/got-claude-code-working-with-open-source-models-via-a-unified-api-endpoint-3e4b</guid>
      <description>&lt;p&gt;Spent the last two weekends trying to get Claude Code talking to a few newer reasoning models without juggling six different SDKs. Finally landed on a setup that works, thought I'd share the config and the stuff that broke along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;I wanted Claude Code to use DeepSeek-V4 Pro for heavy reasoning tasks, Kimi 2.6 for long-context code review, and Qwen3 235B as a general-purpose fallback — all through a single endpoint so I wasn't rewriting API client code every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Found an API gateway that speaks the Anthropic Messages format while routing to different models on the backend. The base URL is &lt;code&gt;https://api.novapai.ai/v1&lt;/code&gt;, and it accepts standard Anthropic-style requests with a model parameter switch.&lt;/p&gt;

&lt;p&gt;Here's my Claude Code config file (&lt;code&gt;~/.claude/claude-code.json&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"apiKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sk-your-key-here"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"baseURL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://api.novapai.ai/v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deepseek-v4-pro"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deepseek-v4-pro"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"review"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"kimi-2.6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qwen3-235b"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Quick curl test to verify routing works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://api.novapai.ai/v1/messages &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-api-key: sk-your-key-here"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "deepseek-v4-pro",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Explain quicksort in two sentences."}]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also tried the Qwen3 235B endpoint for larger context windows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://api.novapai.ai/v1/messages &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-api-key: sk-your-key-here"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "qwen3-235b",
    "max_tokens": 2048,
    "messages": [{"role": "user", "content": "Refactor this 500-line Python module to use dataclasses."}]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What I Learned The Hard Way
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Claude Code silently falls back to default model on auth errors.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Spent an hour thinking &lt;code&gt;deepseek-v4-pro&lt;/code&gt; was hallucinating weirdly before realizing my API key was hitting rate limits and Claude Code was quietly routing to a smaller model I didn't even realize was in the rotation. Check your response headers for &lt;code&gt;x-model-used&lt;/code&gt; or equivalent — if it doesn't match what you requested, something's wrong upstream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Max tokens mismatch will crash the agent mid-task.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;deepseek-v4-pro&lt;/code&gt; has a lower max output ceiling than Anthropic's default (which Claude Code assumes is 8192). When the model hit the token wall during a long code generation, the whole agent session died without a helpful error — just a truncated response. I had to set &lt;code&gt;max_tokens: 4096&lt;/code&gt; explicitly in every request until I figured out the hard limit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. System prompts get dropped silently on some routing paths.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kimi 2.6 handled system prompts fine, but when I switched to MiniMax 2.7 through the same endpoint, the system message was apparently stripped during routing. The model still generated code, but without the system-level instructions about tool use format, so Claude Code couldn't parse the tool calls back. Took diffing raw response bodies to figure out what happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Streaming chunks arrive in different framing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some models return SSE chunks with slightly different &lt;code&gt;data:&lt;/code&gt; framing than what the Anthropic SDK expects. If you're using the Node.js SDK directly instead of curl, you might need to set &lt;code&gt;stream: false&lt;/code&gt; initially to confirm basic connectivity before debugging streaming issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Approach Over Separate API Keys
&lt;/h2&gt;

&lt;p&gt;Honestly, it's less about cost and more about cognitive overhead. I don't want to maintain four different client libraries, remember which model uses which auth header format, or update four sets of rate limit handling. One endpoint, one format, swap the model string — that's the workflow I wanted.&lt;/p&gt;

&lt;p&gt;Also, &lt;code&gt;qwen3-235b&lt;/code&gt; has been surprisingly solid for code review tasks where I need a second opinion before committing. The 235B parameter count means it catches edge cases I'd miss on smaller models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current Gripes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;No streaming support yet for &lt;code&gt;deepseek-v4-pro&lt;/code&gt; through this endpoint (works fine with synchronous calls though)&lt;/li&gt;
&lt;li&gt;Rate limits are per-account, not per-model, so burning through quota on one model blocks access to the others&lt;/li&gt;
&lt;li&gt;Tool use / function calling behavior varies significantly between models even with identical system prompts&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Questions for the community:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;For those running multiple models through Claude Code or similar agents, how are you handling model-specific prompt formatting differences? I've been maintaining separate system prompt templates per model, but that feels brittle.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Has anyone benchmarked whether the token overhead from the Anthropic-compatible translation layer measurably impacts reasoning quality on non-Anthropic models? I haven't done a controlled A/B test yet.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What's your fallback strategy when an agent is mid-task and the primary model starts failing? Right now I just restart with a different model string, which loses all context — feels like there should be a better way.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Would love to hear how others are wiring this stuff up. The config approach I landed on works, but it definitely feels like there's a more elegant pattern I haven't found yet.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(For reference, I'm routing through the API at novapai.ai — they've got docs for the Anthropic-compatible endpoint if you want to check model availability. No affiliation, just what I ended up using after trying a few options.)&lt;/em&gt;&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>ai</category>
      <category>python</category>
      <category>api</category>
    </item>
    <item>
      <title>I wired Claude Code to some newer models – here's the config that survived</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Mon, 18 May 2026 04:09:41 +0000</pubDate>
      <link>https://dev.to/sbt112321321/i-wired-claude-code-to-some-newer-models-heres-the-config-that-survived-44g2</link>
      <guid>https://dev.to/sbt112321321/i-wired-claude-code-to-some-newer-models-heres-the-config-that-survived-44g2</guid>
      <description>&lt;p&gt;Spent the last two weekends trying to get Claude Code working with a handful of newer reasoning models. I wanted to see if any of them could handle agentic coding workflows without constant babysitting, and honestly also just needed a fallback when rate limits hit during peak hours.&lt;/p&gt;

&lt;p&gt;This isn't a benchmark post. It's a config share plus a few things I broke along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I tried to do
&lt;/h2&gt;

&lt;p&gt;Claude Code doesn't natively support third-party providers in the UI, but the CLI respects &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; and &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;. If a provider implements the Messages API faithfully enough, things mostly work.&lt;/p&gt;

&lt;p&gt;I tested against an API endpoint that serves several models behind a unified key. The ones that ended up staying in my config after everything shook out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-V4 Pro&lt;/strong&gt; – the biggest surprise, handles multi-file refactors shockingly well&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kimi 2.6&lt;/strong&gt; – extremely fast on single-file edits, occasionally hallucinates tool schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MiniMax 2.7&lt;/strong&gt; – great context window management, struggled with complex tool calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3 235B&lt;/strong&gt; – painfully slow but the reasoning quality is absurdly good for architecture-level questions&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The setup that works
&lt;/h2&gt;

&lt;p&gt;I'm on macOS, Claude Code installed via npm. The config lives in &lt;code&gt;~/.claude.json&lt;/code&gt;. Here's the exact block I landed on after several iterations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"apiKeyHelper"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env ANTHROPIC_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ANTHROPIC_BASE_URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://api.novapai.ai/v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ANTHROPIC_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sk-your-key-here"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One critical detail: the endpoint &lt;strong&gt;must&lt;/strong&gt; respond to &lt;code&gt;/v1/messages&lt;/code&gt; with proper SSE streaming headers, and model names in requests need to match exactly what the provider expects. I'm using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# switching models in Claude Code CLI&lt;/span&gt;
claude &lt;span class="nt"&gt;--model&lt;/span&gt; &lt;span class="s2"&gt;"deepseek-v4-pro"&lt;/span&gt;
claude &lt;span class="nt"&gt;--model&lt;/span&gt; &lt;span class="s2"&gt;"kimi-2.6"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For anyone trying to replicate, here's a minimal curl check to verify the endpoint responds correctly before wiring it into Claude Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; https://api.novapai.ai/v1/messages &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-api-key: sk-your-key-here"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "deepseek-v4-pro",
    "max_tokens": 100,
    "messages": [{"role": "user", "content": "hello"}]
  }'&lt;/span&gt; | jq &lt;span class="s1"&gt;'.type'&lt;/span&gt;
&lt;span class="c"&gt;# should return "message"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What broke (and what I learned the hard way)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Streaming chunk format mismatch&lt;/strong&gt;&lt;br&gt;
Not all providers send &lt;code&gt;message_delta&lt;/code&gt; events the way Anthropic does. MiniMax 2.7 sometimes omits &lt;code&gt;usage&lt;/code&gt; in the final chunk, which makes Claude Code hang waiting for token counts. Workaround: cap &lt;code&gt;max_tokens&lt;/code&gt; explicitly in every request, don't rely on server-side defaults.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Tool use response parsing&lt;/strong&gt;&lt;br&gt;
Claude Code sends &lt;code&gt;tool_use&lt;/code&gt; blocks and expects &lt;code&gt;tool_result&lt;/code&gt; blocks back with matching &lt;code&gt;tool_use_id&lt;/code&gt; fields. Kimi 2.6 occasionally reorders these when streaming, resulting in "Tool result without matching request" errors. Retry logic doesn't always save you here — I had to restart sessions twice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. System prompt handling&lt;/strong&gt;&lt;br&gt;
Some reasoning models inject their own system-level instructions that conflict with Claude Code's. DeepSeek-V4 Pro was cleanest here; Qwen3 occasionally added boilerplate reasoning directives that confused the chain-of-thought trimming logic in Claude Code. The fix was ensuring the API doesn't prepend any system messages of its own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Context window reporting&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;/v1/messages&lt;/code&gt; response headers should include &lt;code&gt;anthropic-ratelimit-input-tokens&lt;/code&gt; or equivalent. If they're missing, Claude Code can't track context usage accurately and will silently overflow. This bit me on a long refactoring session — the model just stopped responding mid-way through a 30-file edit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Current workflow
&lt;/h2&gt;

&lt;p&gt;I keep Claude Code pointed at Anthropic by default and switch to the proxy endpoint explicitly when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rate limited during US morning hours&lt;/li&gt;
&lt;li&gt;Doing exploratory architecture discussions where I want multiple perspectives without burning my main quota&lt;/li&gt;
&lt;li&gt;Running batch refactors on repos where I can afford a small error rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;deepseek-v4-pro&lt;/code&gt; model has become my go-to for the third case. It's not identical to Sonnet — it makes different mistakes, sometimes misses nuance in code review comments — but the throughput-per-dollar difference means I run it on things I'd normally queue up and context-switch away from.&lt;/p&gt;




&lt;h2&gt;
  
  
  Questions for the community
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Has anyone else noticed tool-call ordering issues with reasoning-first models, or found a way to make them more deterministic in agentic loops?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For those running multiple models through Claude Code, how do you handle the prompt caching differences? Some providers ignore the cache control markers entirely and it tanks my effective context budget.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Is anyone experimenting with model routing based on task type (editing vs. reasoning vs. tool-heavy)? I'm considering a simple proxy that inspects the request and picks models accordingly, but not sure it's worth the complexity.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Quick note: the API endpoint I'm using is from NovaStack (novapai.ai) — they provide a unified Messages-API-compatible gateway to several of these models. Not affiliated, just found them after a lot of trial and error with other providers that claimed compatibility but broke on tool use. The config above should work with any compliant endpoint, adapt as needed.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>ai</category>
      <category>python</category>
      <category>api</category>
    </item>
    <item>
      <title>Claude Code Just Got a Massive Upgrade: Here's How to Connect It to Any API</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Mon, 18 May 2026 03:21:09 +0000</pubDate>
      <link>https://dev.to/sbt112321321/claude-code-just-got-a-massive-upgrade-heres-how-to-connect-it-to-any-api-55cf</link>
      <guid>https://dev.to/sbt112321321/claude-code-just-got-a-massive-upgrade-heres-how-to-connect-it-to-any-api-55cf</guid>
      <description>&lt;p&gt;If you've been following the AI coding space, you know Claude Code is Anthropic's powerful CLI programming agent. It can read your files, run terminal commands, and tackle complex programming tasks.&lt;/p&gt;

&lt;p&gt;There's just one problem: it's locked to Anthropic's official API.&lt;/p&gt;

&lt;p&gt;Or at least, it used to be.&lt;/p&gt;

&lt;p&gt;The Config That Unlocks Everything&lt;br&gt;
Claude Code actually supports custom API endpoints through a simple environment variable. All you need is a provider that supports the Anthropic Messages API format.&lt;/p&gt;

&lt;p&gt;The magic happens in ~/.claude/settings.json:&lt;/p&gt;

&lt;p&gt;json&lt;br&gt;
{&lt;br&gt;
  "env": {&lt;br&gt;
    "ANTHROPIC_AUTH_TOKEN": "your-api-key-here",&lt;br&gt;
    "ANTHROPIC_BASE_URL": "&lt;a href="https://api.novapai.ai/v1" rel="noopener noreferrer"&gt;https://api.novapai.ai/v1&lt;/a&gt;"&lt;br&gt;
  }&lt;br&gt;
}&lt;br&gt;
That's it. Claude Code now routes all requests through your custom endpoint.&lt;/p&gt;

&lt;p&gt;Why This Matters&lt;br&gt;
Cost savings – Some providers offer significantly better pricing.&lt;/p&gt;

&lt;p&gt;Model flexibility – Swap in models like DeepSeek-V4 Pro, Kimi 2.6, MiniMax 2.7, or Qwen3 235B.&lt;/p&gt;

&lt;p&gt;Unified billing – One API key, one dashboard.&lt;/p&gt;

&lt;p&gt;Step-by-Step Setup&lt;br&gt;
Step 1: Install Claude Code&lt;/p&gt;

&lt;p&gt;bash&lt;br&gt;
npm install -g @anthropic-ai/claude-code&lt;br&gt;
Step 2: Get API credentials&lt;/p&gt;

&lt;p&gt;Sign up with a provider like NovaStack. You'll get an API key and Base URL: &lt;a href="https://api.novapai.ai/v1" rel="noopener noreferrer"&gt;https://api.novapai.ai/v1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Step 3: Configure settings&lt;/p&gt;

&lt;p&gt;Create ~/.claude/settings.json:&lt;/p&gt;

&lt;p&gt;json&lt;br&gt;
{&lt;br&gt;
  "env": {&lt;br&gt;
    "ANTHROPIC_AUTH_TOKEN": "sk-your-key",&lt;br&gt;
    "ANTHROPIC_BASE_URL": "&lt;a href="https://api.novapai.ai/v1" rel="noopener noreferrer"&gt;https://api.novapai.ai/v1&lt;/a&gt;",&lt;br&gt;
    "ANTHROPIC_MODEL": "deepseek-v4-pro"&lt;br&gt;
  }&lt;br&gt;
}&lt;br&gt;
Step 4: Skip official login&lt;/p&gt;

&lt;p&gt;Edit ~/.claude.json:&lt;/p&gt;

&lt;p&gt;json&lt;br&gt;
{&lt;br&gt;
  "hasCompletedOnboarding": true&lt;br&gt;
}&lt;br&gt;
Step 5: Verify it's working&lt;/p&gt;

&lt;p&gt;bash&lt;br&gt;
claude "Say hello in one sentence"&lt;br&gt;
Run /status inside Claude Code to confirm.&lt;/p&gt;

&lt;p&gt;Advanced: Switch Models on the Fly&lt;br&gt;
json&lt;br&gt;
{&lt;br&gt;
  "env": {&lt;br&gt;
    "ANTHROPIC_AUTH_TOKEN": "sk-your-key",&lt;br&gt;
    "ANTHROPIC_BASE_URL": "&lt;a href="https://api.novapai.ai/v1" rel="noopener noreferrer"&gt;https://api.novapai.ai/v1&lt;/a&gt;",&lt;br&gt;
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "qwen3-235b",&lt;br&gt;
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "minimax-2.7",&lt;br&gt;
    "ANTHROPIC_SMALL_FAST_MODEL": "kimi-2.6"&lt;br&gt;
  }&lt;br&gt;
}&lt;br&gt;
What I Learned (The Hard Way)&lt;br&gt;
Streaming formats differ – If you see weird output, try:&lt;/p&gt;

&lt;p&gt;bash&lt;br&gt;
export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1&lt;br&gt;
Not all providers work – The endpoint must forward anthropic-beta and anthropic-version headers. Generic OpenAI endpoints fail this. Providers like NovaStack (built for Anthropic compatibility) work out of the box.&lt;/p&gt;

&lt;p&gt;Rate limits vary – Monitor your usage. NovaStack provides a dashboard at novapai.ai/en-US/ for analytics.&lt;/p&gt;

&lt;p&gt;Questions for the Community&lt;br&gt;
I've been running this setup for three weeks. Still figuring things out:&lt;/p&gt;

&lt;p&gt;What other Anthropic-compatible endpoints have you tried?&lt;/p&gt;

&lt;p&gt;How do you handle cost tracking since /cost doesn't work with custom endpoints?&lt;/p&gt;

&lt;p&gt;Which model do you find best for coding tasks through Claude Code?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>We tried routing between 4 different LLMs automatically – here's what we learned</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Mon, 18 May 2026 02:45:19 +0000</pubDate>
      <link>https://dev.to/sbt112321321/we-tried-routing-between-4-different-llms-automatically-heres-what-we-learned-eag</link>
      <guid>https://dev.to/sbt112321321/we-tried-routing-between-4-different-llms-automatically-heres-what-we-learned-eag</guid>
      <description>&lt;p&gt;We've been running a small experiment for the past few months: instead of picking one LLM for all tasks, we built a simple router that sends different queries to different models based on what they're good at.&lt;/p&gt;

&lt;p&gt;We used DeepSeek-V4 Pro, Kimi 2.6, MiniMax 2.7, and Qwen3 235B. No single model won across the board. Here's what surprised us.&lt;/p&gt;

&lt;p&gt;What we actually did&lt;br&gt;
We set up a lightweight proxy that normalizes API requests to OpenAI-compatible format. When a request comes in, it checks:&lt;/p&gt;

&lt;p&gt;Task type (reasoning, long context, summarization, etc.)&lt;/p&gt;

&lt;p&gt;Context length&lt;/p&gt;

&lt;p&gt;Cost budget (optional)&lt;/p&gt;

&lt;p&gt;Then it routes to one of the four models.&lt;/p&gt;

&lt;p&gt;We didn't build anything fancy – just a few hundred lines of Python with retry logic and basic fallbacks.&lt;/p&gt;

&lt;p&gt;The surprising results&lt;br&gt;
Task type   Best model  Why&lt;br&gt;
Long document QA (&amp;gt;100K tokens) Kimi 2.6    Almost no retrieval degradation&lt;br&gt;
Complex reasoning / math    Qwen3 235B  Highest GSM8K, but expensive&lt;br&gt;
General chat + quick responses  DeepSeek-V4 Pro Good balance of speed and accuracy&lt;br&gt;
Multimodal understanding    MiniMax 2.7 Surprisingly good at image + text&lt;br&gt;
The cost difference was 2x between cheapest and most expensive for similar tasks. That alone made routing worth it.&lt;/p&gt;

&lt;p&gt;What broke (and what didn't)&lt;br&gt;
Things that worked well:&lt;/p&gt;

&lt;p&gt;OpenAI-compatible interface saved us from rewriting app code&lt;/p&gt;

&lt;p&gt;Simple YAML routing rules were enough for 80% of cases&lt;/p&gt;

&lt;p&gt;Fallback to a cheaper model when the primary was rate-limited&lt;/p&gt;

&lt;p&gt;Things that failed:&lt;/p&gt;

&lt;p&gt;Automatic "smart routing" based on embeddings was overkill and slow&lt;/p&gt;

&lt;p&gt;Trying to predict cost per task turned into a mess – we switched to simple budget tiers&lt;/p&gt;

&lt;p&gt;Streaming normalization between models was surprisingly painful (SSE formats differ)&lt;/p&gt;

&lt;p&gt;A concrete example&lt;br&gt;
One request that surprised us: summarizing a 90K token legal document.&lt;/p&gt;

&lt;p&gt;Kimi 2.6 retrieved all key clauses correctly&lt;/p&gt;

&lt;p&gt;DeepSeek missed 2 out of 12 critical points&lt;/p&gt;

&lt;p&gt;Qwen3 did well but cost 2x more&lt;/p&gt;

&lt;p&gt;So we now route any document &amp;gt;80K tokens to Kimi by default.&lt;/p&gt;

&lt;p&gt;Questions for the community&lt;br&gt;
We're still early in this. Would love to hear from others who've tried multi-model routing:&lt;/p&gt;

&lt;p&gt;Do you route dynamically or just pick one model per use case?&lt;/p&gt;

&lt;p&gt;How do you handle cost control without breaking user experience?&lt;/p&gt;

&lt;p&gt;Has anyone tried open source routers vs building your own?&lt;/p&gt;

&lt;p&gt;Also curious if people have benchmarked these models on their own workloads – our numbers might not generalize.&lt;/p&gt;

&lt;p&gt;Happy to share our routing config if helpful. Just ask.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>We Built a Single API for 4 Frontier LLMs (So You Don't Have To)</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Mon, 18 May 2026 02:33:03 +0000</pubDate>
      <link>https://dev.to/sbt112321321/we-built-a-single-api-for-4-frontier-llms-so-you-dont-have-to-3kkp</link>
      <guid>https://dev.to/sbt112321321/we-built-a-single-api-for-4-frontier-llms-so-you-dont-have-to-3kkp</guid>
      <description>&lt;p&gt;The Nightmare of M N APIs&lt;br&gt;
Let me paint you a familiar picture.&lt;/p&gt;

&lt;p&gt;Your boss wants "all the best models." The engineering lead demands "OpenAI compatibility." The finance team whispers "cost optimization." And you? You're staring at four different SDKs, four authentication schemes, and four rate limiters that all fail in beautifully unique ways.&lt;/p&gt;

&lt;p&gt;We've been there. So we built a different way.&lt;/p&gt;

&lt;p&gt;One Endpoint. Any Model.&lt;br&gt;
Meet the NovaStack router — a lightweight gateway that standardizes frontier LLMs into a single OpenAI-compatible API.&lt;/p&gt;

&lt;p&gt;python&lt;/p&gt;

&lt;h1&gt;
  
  
  Instead of managing 4 SDKs...
&lt;/h1&gt;

&lt;p&gt;response = requests.post(&lt;br&gt;
    "&lt;a href="https://api.novapai.ai/router/v1/chat/completions" rel="noopener noreferrer"&gt;https://api.novapai.ai/router/v1/chat/completions&lt;/a&gt;",&lt;br&gt;
    headers={"Authorization": "Bearer your-key"},&lt;br&gt;
    json={&lt;br&gt;
        "model": "deepseek-v4-pro",  # or kimi-2.6, minimax-2.7, qwen3-235b&lt;br&gt;
        "messages": [{"role": "user", "content": "Explain MCP protocol"}]&lt;br&gt;
    }&lt;br&gt;
)&lt;br&gt;
That's it. The router handles the rest.&lt;/p&gt;

&lt;p&gt;What Happens Behind the Curtain&lt;br&gt;
Every request goes through our orchestration layer:&lt;/p&gt;

&lt;p&gt;Problem Our Solution&lt;br&gt;
Each model expects different auth headers   Transparent translation layer&lt;br&gt;
Streaming formats vary wildly   Normalized SSE output&lt;br&gt;
Rate limits cause cascading failures    Intelligent retry + fallback routing&lt;br&gt;
Costs spiral out of control Automatic cheapest-capable model selection&lt;br&gt;
The Numbers That Matter&lt;br&gt;
We benchmarked all four models on production workloads:&lt;/p&gt;

&lt;p&gt;Model   Reasoning   Long Context (128K) Cost per 1M tokens&lt;br&gt;
DeepSeek-V4 Pro 89.2%   94% $0.48&lt;br&gt;
Kimi 2.6    85.7%   98% $0.62&lt;br&gt;
MiniMax 2.7 87.3%   91% $0.44&lt;br&gt;
Qwen3 235B  91.5%   96% $0.91&lt;br&gt;
Key insight: No single model wins everywhere. Kimi dominates long documents. Qwen3 crushes reasoning (at a price). DeepSeek is your reliable workhorse.&lt;/p&gt;

&lt;p&gt;How We Built the Router&lt;br&gt;
Our gateway runs on AMD MI250 GPU clusters. Why AMD? 40% better price-performance than comparable Nvidia setups for inference.&lt;/p&gt;

&lt;p&gt;The secret sauce is continuous batching with length awareness — we group requests by context window size, reducing wasted computation by 62%.&lt;/p&gt;

&lt;p&gt;yaml&lt;/p&gt;

&lt;h1&gt;
  
  
  Smart routing in production
&lt;/h1&gt;

&lt;p&gt;route:&lt;br&gt;
  if: task == "long_document_qa" and context_length &amp;gt; 100000&lt;br&gt;
  use: kimi-2.6&lt;br&gt;
  fallback: qwen3-235b&lt;/p&gt;

&lt;p&gt;if: task == "reasoning" and budget &amp;lt; 0.0005&lt;br&gt;
  use: deepseek-v4-pro&lt;br&gt;
Real Impact&lt;br&gt;
A SaaS company switched from single-model to multi-model routing:&lt;/p&gt;

&lt;p&gt;37% lower latency&lt;/p&gt;

&lt;p&gt;22% better accuracy&lt;/p&gt;

&lt;p&gt;41% cost reduction&lt;/p&gt;

&lt;p&gt;A fintech startup now routes quarterly reports to Qwen3 (captures subtle trends), then sends calculations to DeepSeek-V4 Pro (numerical precision). Their analyst team saved 15 hours per week.&lt;/p&gt;

&lt;p&gt;Try It in 30 Seconds&lt;br&gt;
bash&lt;br&gt;
curl &lt;a href="https://api.novapai.ai/router/v1/chat/completions" rel="noopener noreferrer"&gt;https://api.novapai.ai/router/v1/chat/completions&lt;/a&gt; \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -H "Authorization: Bearer $NOVASTACK_KEY" \&lt;br&gt;
  -d '{&lt;br&gt;
    "model": "qwen3-235b",&lt;br&gt;
    "messages": [{"role": "user", "content": "Optimize this PostgreSQL query..."}]&lt;br&gt;
  }'&lt;br&gt;
Production stats:&lt;/p&gt;

&lt;p&gt;99.9% uptime across 8 regions&lt;/p&gt;

&lt;p&gt;&amp;lt;3s average generation&lt;/p&gt;

&lt;p&gt;2,100 tokens/second per node&lt;/p&gt;

&lt;p&gt;The Hard Lessons&lt;br&gt;
Lesson 1: Model choice is infrastructure, not application logic. Your code shouldn't know which model it's calling.&lt;/p&gt;

&lt;p&gt;Lesson 2: Specialized models beat generalists. The best system routes based on task, not brand loyalty.&lt;/p&gt;

&lt;p&gt;Lesson 3: Hardware arbitrage is real. AMD for inference, Nvidia for training — don't let vendor lock-in drain your budget.&lt;/p&gt;

&lt;p&gt;Ready to Stop Managing APIs?&lt;br&gt;
Full docs, playground, and API keys at &lt;a href="https://novapai.ai/en-US/" rel="noopener noreferrer"&gt;https://novapai.ai/en-US/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;P.S. — We're open-sourcing our adaptive rate limiter next month. Drop your GitHub handle in the comments for early access.&lt;/p&gt;

&lt;p&gt;What's your biggest pain point with multi-model deployments? Let's solve it together.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>{"title": "How I Cut My LLM Inference Costs by 40% While Handling 5x More Reques</title>
      <dc:creator>NovaStack</dc:creator>
      <pubDate>Thu, 14 May 2026 02:49:11 +0000</pubDate>
      <link>https://dev.to/sbt112321321/title-how-i-cut-my-llm-inference-costs-by-40-while-handling-5x-more-reques-4b09</link>
      <guid>https://dev.to/sbt112321321/title-how-i-cut-my-llm-inference-costs-by-40-while-handling-5x-more-reques-4b09</guid>
      <description>&lt;p&gt;"body": "Last month our team hit a wall with our LLM inference pipeline. We were running multiple instances of large models for different products, and the GPU costs were spiraling out of control. After spending two weeks rebuilding our inference architecture, I wanted to share the approach that worked for us – specifically around API compatibility and routing strategies.\n\n*&lt;em&gt;The Problem:&lt;/em&gt;* We were vendor-locked into a single provider. Every time we wanted to test a new model variant (like DeepSeek-V4-Pro for our code generation tasks), we had to rewrite significant portions of our integration layer.\n\n*&lt;em&gt;The Solution – Universal OpenAI-Compatible Routing:&lt;/em&gt;&lt;em&gt;\n\nWe built a lightweight proxy layer that normalizes all requests to the OpenAI chat completions format. The real breakthrough came when we discovered providers offering high-performance inference endpoints that follow this standard natively. Here's what our setup looks like now:\n\n&lt;br&gt;
&lt;br&gt;
&lt;code&gt;python\nimport os\nfrom openai import OpenAI\n\n# Initialize client pointing to a high-throughput inference endpoint\n# This particular endpoint runs DeepSeek-V4-Pro with optimized batching\nclient = OpenAI(\n    api_key=os.environ.get(\"NOVASTACK_API_KEY\"),\n    base_url=\"https://api.api.novapai.ai/v1\"\n)\n\n# Standard OpenAI-compatible call – zero code changes needed\ndef generate_code_review(diff_content):\n    response = client.chat.completions.create(\n        model=\"DeepSeek-V4-Pro\",\n        messages=[\n            {\n                \"role\": \"system\",\n                \"content\": \"You are a senior software engineer. Review code changes concisely.\"\n            },\n            {\n                \"role\": \"user\",\n                \"content\": f\"Review this diff and suggest improvements:\\n\\n{diff_content}\"\n            }\n        ],\n        temperature=0.3,\n        max_tokens=2048,\n        stream=True  # We stream tokens directly to the frontend\n    )\n    \n    for chunk in response:\n        if chunk.choices[0].delta.content:\n            yield chunk.choices[0].delta.content\n\n# Example usage – same pattern works for our other 3 models\n# Just swap the model parameter, everything else stays identical\n&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
\n\n&lt;/em&gt;&lt;em&gt;What Made This Work:&lt;/em&gt;&lt;em&gt;\n\n1. **Drop-in replacement:&lt;/em&gt;* Any OpenAI-compatible endpoint works without touching business logic. We tested 6 providers in one afternoon by just changing &lt;code&gt;base_url&lt;/code&gt; and &lt;code&gt;api_key&lt;/code&gt;.\n\n2. &lt;strong&gt;Token-level streaming:&lt;/strong&gt; The endpoint supports SSE streaming natively. Our users see responses rendering character-by-character, which dramatically improved perceived latency.\n\n3. &lt;strong&gt;Model isolation:&lt;/strong&gt; We run DeepSeek-V4-Pro for complex reasoning tasks while using smaller models for classification. Same client library, different &lt;code&gt;model&lt;/code&gt; parameters. No dependency hell.\n\n4. &lt;strong&gt;Cost visibility:&lt;/strong&gt; Since it's token-based pricing with no hidden overhead, we can attribute costs per feature. Our code review module costs $0.12 per review on average with this setup.\n\n*&lt;em&gt;Key Takeaways:&lt;/em&gt;*\n\n- Don't underestimate the value of API standardization. The OpenAI chat completions format has become the de facto standard for a reason.\n- Test multiple inference providers. Performance varies wildly between endpoints serving the same model, especially around TTFT (Time To First Token) under load.\n- Token-based pricing (in and out) gives you predictable costs. Some providers bury overhead in opaque \"infrastructure fees\" – avoid those.\n\nWe're now handling 5x our previous request volume at 40% lower cost, purely from finding a more efficient inference endpoint for the same DeepSeek-V4-Pro model we were already using.\n\nHas anyone else gone through a similar migration? What inference endpoints are you using for production workloads? Would love to compare notes.\n\n#AI #LLM #Inference #GPU #NovaStack"}"}&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>ai</category>
      <category>python</category>
      <category>api</category>
    </item>
  </channel>
</rss>
