DEV Community: Novita AI

Best Text-to-Speech APIs in 2026: 8 Providers Compared

Novita AI — Wed, 06 May 2026 14:12:21 +0000

Kimi K2.6 on Novita AI: API Pricing ($0.95/$4.00), SWE-Bench & Agentic Coding

Novita AI — Wed, 29 Apr 2026 14:59:50 +0000

Kimi K2.6: Open-Source Agent for 13-Hour Coding Sessions

Your coding agent halts after 20 minutes, burns through context, and leaves you with a half-finished PR. You switch to a closed frontier model — it lasts longer but costs 5× more per run. Kimi K2.6, Moonshot AI's newly open-sourced model, is built specifically to break that trade-off. Across 4,000+ tool calls and 13-hour autonomous sessions, it delivered 58.6% on SWE-Bench Pro — edging out GPT-5.4 (57.7%) and outperforming Claude Opus 4.6 (53.4%) — at a fraction of the closed-model price. (Benchmarks sourced from kimi.com/blog/kimi-k2-6.)

TL;DR: Kimi K2.6 is now on Novita AI. 1T MoE open-source model, 256K context, 58.6% SWE-Bench Pro — built for long-horizon agentic coding. Try free via OpenAI-compatible API.

Kimi K2.6 is now available on Novita AI via OpenAI-compatible API.

In short: Kimi K2.6 is a 1-trillion-parameter open-source MoE model (32B activated) from Moonshot AI, specialized for agentic coding, long-horizon task execution, and multi-agent coordination — with a 256K context window and OpenAI-compatible API access on Novita AI.

Try Kimi K2.6 on Novita AI →

What Is Kimi K2.6?

Kimi K2.6 is an open-source, native multimodal agentic model released by Moonshot AI in April 2026. It is a direct evolution of Kimi K2.5 — the same MoE architecture, now significantly improved for real-world long-horizon tasks, coding-driven UI generation, and coordinated multi-agent execution.

At its core, K2.6 is a 1-trillion-parameter Mixture-of-Experts (MoE) model with only 32B parameters activated per token — giving it frontier-class reasoning at compute costs closer to a dense 30B model. The architecture uses Multi-head Latent Attention (MLA), SwiGLU activations, 384 experts with 8 selected per token, and a 256K-token context window. The model is released under a modified MIT license.

Key capabilities at a glance:

Long-horizon coding — sustained autonomous execution across hours and thousands of tool calls
Multi-language generalization — strong performance in Rust, Go, Python, and niche languages like Zig
Coding-driven design — turns prompts and visual inputs into production-ready front-end interfaces
Agent Swarm scaling — coordinates up to 300 sub-agents across 4,000 parallel steps
Native multimodal — processes images and text natively via the MoonViT vision encoder
Function calling & structured output — OpenAI-compatible tool use, ideal for building agent pipelines and RAG systems

What Makes Kimi K2.6 Different from Other Open-Source Models?

Long-Horizon Coding

Most LLMs degrade after a few hundred tool calls. K2.6 was explicitly trained for multi-hour, multi-thousand-call sessions. In one benchmark task, it deployed a local Qwen3.5-0.8B model on a Mac, rewrote its inference engine in Zig over 12 hours and 4,000+ tool calls, and improved throughput from ~15 to ~193 tokens/sec — roughly 20% faster than LM Studio. In another, it autonomously refactored an 8-year-old financial matching engine (exchange-core) across a 13-hour session, executing 12 optimization strategies and modifying 4,000+ lines of code for a 185% throughput gain.

Kimi Code Bench: K2.6 scores 68.2 vs K2.5's 57.4 (+19%). [Source: Kimi Official Blog]

According to Moonshot AI's launch blog, beta partners including Baseten, Blackbox.ai, Factory.ai, and Fireworks.ai noted that K2.6 maintains "architectural integrity over extended coding sessions" and surfaces "non-obvious bugs that would normally take significant developer time to uncover."

Coding-Driven Design

K2.6 can generate structured front-end layouts, interactive elements, scroll-triggered animations, and lightweight full-stack workflows — authentication, session management, database operations — from a simple text or image prompt. Moonshot AI's internal Kimi Design Bench, covering Visual Input Tasks, Landing Page Construction, Full-Stack App Development, and General Creative Programming, shows K2.6 competitive with Google AI Studio across all four categories.

Kimi Design Bench: K2.6 (47.5%) outperforms Google AI Studio (31.4%) on UI generation tasks. [Source: Kimi Official Blog]

Elevated Agent Swarm

K2.6 scales the agent swarm architecture from K2.5's 100 sub-agents / 1,500 steps to 300 sub-agents executing across 4,000 coordinated steps simultaneously. The coordinator dynamically assigns tasks to agents based on skill profiles, detects failures, reassigns work, and manages the full lifecycle from initiation to validation. Outputs span documents, websites, slides, and spreadsheets — produced in a single autonomous run. Moonshot AI's own marketing team uses a K2.6-backed Claw Group internally, with specialized agents for demo creation, benchmarking, social media, and video production all coordinated by K2.6.

Kimi Claw Bench: K2.6 scores 65.5 vs K2.5's 59.6 (+9.9%) on multi-step agent tasks. [Source: Kimi Official Blog]

Proactive Background Agents

One of the more striking K2.6 use cases from Moonshot's own RL infrastructure team: a K2.6-backed agent ran autonomously for 5 days, handling monitoring, incident response, and system operations — persistent context, multi-threaded task management, and full-cycle execution from alert to resolution, without human intervention. This kind of persistent, 24/7 background agent is a specific design target for K2.6.

How Does Kimi K2.6 Perform on Agentic Coding Benchmarks?

K2.6 competes directly with top closed models. It leads on the benchmarks most relevant to agentic coding workflows:

Coding Benchmarks (Last verified: 2026-04-21, source: kimi.com/blog/kimi-k2-6)

Benchmark	Kimi K2.6	GPT-5.4 (xhigh)	Claude Opus 4.6 (max)	Gemini 3.1 Pro (thinking)	Kimi K2.5
SWE-Bench Pro	58.6	57.7	53.4	54.2	50.7
SWE-Bench Verified	80.2	—	80.8	80.6	76.8
SWE-Bench Multilingual	76.7	—	77.8	76.9	73.0
Terminal-Bench 2.0	66.7	65.4	65.4	68.5	50.8
LiveCodeBench (v6)	89.6	—	88.8	91.7	85.0

Agentic Benchmarks (Last verified: 2026-04-21)

Benchmark	Kimi K2.6	GPT-5.4 (xhigh)	Claude Opus 4.6 (max)	Gemini 3.1 Pro	Kimi K2.5
HLE-Full w/ tools	54.0	52.1	53.0	51.4	50.2
DeepSearchQA (f1-score)	92.5	78.6	91.3	81.9	89.0
BrowseComp	83.2	82.7	83.7	85.9	74.9
OSWorld-Verified	73.1	75.0	72.7	—	63.3
Toolathlon	50.0	54.6	47.2	48.8	27.8

The headline: K2.6 leads all models on SWE-Bench Pro (58.6%) and outperforms GPT-5.4 and Claude Opus 4.6 on Terminal-Bench 2.0 and DeepSearchQA by a notable margin. Gemini 3.1 Pro edges it on Terminal-Bench (68.5 vs. 66.7) and LiveCodeBench. Its reasoning scores (AIME 2026: 96.4%, GPQA-Diamond: 90.5%) are competitive but trail Gemini and GPT-5.4 — this is a coding-first model, not a math olympiad specialist.

How to Use Kimi K2.6 on Novita AI

Option 1: Playground

Navigate to Kimi K2.6 on Novita AI and click Try in Playground. No API key needed to start.

Option 2: API (Python)

Kimi K2.6 is fully OpenAI-compatible. Swap in the Novita base URL and your API key:

pip install openai

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_NOVITA_API_KEY",
    base_url="https://api.novita.ai/v3/openai",
)

response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Your prompt here"}
    ],
    max_tokens=8192,
    temperature=0.7,
)

print(response.choices[0].message.content)

Get your API key at novita.ai/settings.

Option 3: Third-Party Tools

Because Novita's API is OpenAI-compatible, Kimi K2.6 works out of the box with LangChain, LlamaIndex, OpenWebUI, and coding assistants like Cursor or Continue. Point the base URL to https://api.novita.ai/v3/openai and set the model name to moonshotai/kimi-k2.6.

When Should You Use Kimi K2.6 Instead of GPT-4o or Claude?

Scenario 1: Long-Running Engineering Agents

K2.6 is well-suited for long-running engineering agents — legacy codebase refactoring, CI/CD pipeline debugging, and infrastructure optimization. Its Kimi Code Bench results and the exchange-core case study show it maintains task coherence across thousands of tool calls without drifting from the original objective.

Scenario 2: Design-to-Code Pipelines

Designers drop a mockup; K2.6 produces a working React/HTML/CSS implementation with animations and responsive layouts. The model's native multimodal input (via MoonViT) means it processes the image reference directly rather than relying on a verbal description. This makes it a strong backbone for AI-assisted UI generation workflows.

Scenario 3: Multi-Agent Orchestration

When you need to coordinate specialized agents in parallel — one scraping data, another writing analysis, a third formatting output — K2.6 acts as the coordinator layer. Its 300-agent / 4,000-step architecture makes it a practical choice for content pipelines, research workflows, or any task where parallel specialization reduces latency compared to sequential single-agent runs.

Scenario 4: Migrating from Claude or GPT-4o Agent Pipelines

If you're running agentic coding workflows on Claude Opus or GPT-4o and looking to cut costs without sacrificing reliability, K2.6 is a strong open-source drop-in. Its SWE-Bench Pro score (58.6%) exceeds both Claude Opus 4.6 (53.4%) and GPT-5.4 (57.7%) on the same benchmark. The OpenAI-compatible API means migration is a one-line change.

How Much Does Kimi K2.6 Cost on Novita AI?

Kimi K2.6 on Novita AI is priced as follows (Last verified: 2026-04-21):

Model	Input ($/M tokens)	Cache Read ($/M tokens)	Output ($/M tokens)	Context
Kimi K2.6	$0.95	$0.16	$4.00	262K
Kimi K2.5	$0.60	$0.10	$3.00	262K

For long-horizon agentic runs where cache hit rates are high, the $0.16/M cache-read price makes extended autonomous sessions materially cheaper than the headline input price suggests.

What Are the Technical Specs of Kimi K2.6?

Property	Value
Architecture	Mixture-of-Experts (MoE)
Total Parameters	1T
Activated Parameters	32B
Number of Layers	61 (incl. 1 dense layer)
Number of Experts	384
Selected Experts per Token	8
Context Length	256K tokens
Attention Mechanism	MLA (Multi-head Latent Attention)
Vision Encoder	MoonViT
Vocabulary Size	160K
License	Modified MIT

Full architecture details, weights, and evaluation code available on the Kimi K2.6 HuggingFace model card. Benchmark methodology published on the Moonshot AI blog.

Is Kimi K2.6 the Right Model for Your Agent Pipeline?

Bottom line: Kimi K2.6 is one of the strongest open-source models for long-horizon agentic coding as of April 2026. Its SWE-Bench Pro score of 58.6% outperforms several closed-source models on these benchmarks, its 256K context and MoE architecture keep inference costs reasonable, making it a compelling alternative to Claude or GPT-4o for agent pipeline developers.

It is not the top reasoning model overall — GPT-5.4 and Gemini 3.1 Pro lead on pure math (AIME, HLE without tools). But for developers building coding agents, design-to-code pipelines, or multi-agent orchestration systems, K2.6 is a strong open-source option available on the Novita AI API today.

Recommended Reading

Try Kimi K2.6 Free →

FAQ

What is Kimi K2.6?

Kimi K2.6 is an open-source, native multimodal agentic model from Moonshot AI, released in April 2026. It is a 1-trillion-parameter Mixture-of-Experts model (32B activated) with a 256K context window, built for long-horizon coding, autonomous agent execution, and multi-agent swarm coordination.

How do I access Kimi K2.6 via API on Novita AI?

Use the OpenAI Python SDK with base_url="https://api.novita.ai/v3/openai" and model ID moonshotai/kimi-k2.6. Get your API key at novita.ai/settings. No special SDK or wrapper required.

How does Kimi K2.6 compare to Claude Opus 4.6 for coding tasks?

On SWE-Bench Pro, Kimi K2.6 scores 58.6% vs. Claude Opus 4.6's 53.4% — a 5-point gap on real-world software engineering tasks. K2.6 also beats Claude on DeepSearchQA (92.5% vs. 91.3%) and Terminal-Bench 2.0 (66.7% vs. 65.4%); Gemini 3.1 Pro tops Terminal-Bench at 68.5%. For pure reasoning benchmarks like AIME or HLE without tools, Claude Opus 4.6 holds a slight edge.

What is the context window for Kimi K2.6?

Kimi K2.6 supports a 256K-token context window (262,144 tokens). On Novita AI, both the context length and max output are set to 262,144 tokens, making it suitable for long-document analysis and sustained multi-turn agentic sessions.

What is the pricing for Kimi K2.6 on Novita AI?

On Novita AI, Kimi K2.6 is priced at $0.95 per million input tokens, $0.16 per million cache-read tokens, and $4.00 per million output tokens. The 256K context window and max output are both included. View current pricing on Novita AI.

Novita AI is an AI & Agent Cloud for developers — offering 200+ models via serverless API alongside Agent Sandbox infrastructure and GPU Cloud. Build, scale, and deploy AI applications without managing infrastructure. Get started at novita.ai.

DeepSeek-V4-Flash on Novita AI: Fast Reasoning at Lower Cost

Novita AI — Wed, 29 Apr 2026 14:58:49 +0000

DeepSeek-V4-Flash backed by Novita AI: 1M Context at $0.14/M Tokens

Most open-source models with reasoning capabilities force a trade-off: small context windows, slow throughput, or prices that climb above $1/M tokens the moment you enable extended thinking. DeepSeek-V4-Flash sidesteps that entirely — 284B parameters, only 13B activated per inference, a native 1,048,576-token context window, and three selectable reasoning modes. At $0.14/M input tokens, it lands in a category where reasoning-capable models rarely compete.

TL;DR: DeepSeek-V4-Flash is now available via Novita AI. 284B MoE model, 1M token context, selectable reasoning modes. $0.14/M input. OpenAI-compatible API.

In short: DeepSeek-V4-Flash is a MoE model from DeepSeek AI that brings 1M-token context and adjustable reasoning depth to developers who need throughput without the closed-model price premium. As of today, it's available through the Novita AI API.

Click Here

What Is DeepSeek-V4-Flash?

DeepSeek-V4-Flash is a Mixture-of-Experts (MoE) language model from DeepSeek AI, released as part of the DeepSeek-V4 series alongside the larger DeepSeek-V4-Pro. The model has 284B total parameters with 13B activated at inference — keeping per-token compute cost low while retaining the parameter capacity of a much larger model.

Key capabilities at a glance:

284B total / 13B activated parameters — MoE architecture, low inference cost
1,048,576-token context window (1M tokens) — enabled by Hybrid Attention Architecture
Three reasoning modes: Non-think (fast), Think (step-by-step), Think Max (maximum reasoning budget)
Function calling support — tool use, structured outputs, JSON mode
Trained on 32T+ tokens with multi-stage post-training (SFT, RL with GRPO, on-policy distillation)
MIT License — weights available for download on HuggingFace; commercial use permitted
FP4 + FP8 mixed precision — MoE expert weights in FP4, remaining layers in FP8

Key Features: Why DeepSeek-V4-Flash Stands Out

Selectable Reasoning Depth Without Switching Models

Most models lock you into a single inference mode: either reasoning-on or reasoning-off. DeepSeek-V4-Flash gives you three distinct operating modes on the same API endpoint:

Mode	Characteristics	Best For
Non-think	Fast, no chain-of-thought	High-volume tasks, chat, summarization
Think	Step-by-step reasoning, balanced	Complex Q&A, code generation, analysis
Think Max	Maximum reasoning budget	Math competitions, hard coding tasks, benchmarks

The gap between modes is significant: on GPQA Diamond, V4-Flash Non-think scores 71.2 vs Think at 87.4 and Think Max at 88.1. On LiveCodeBench, Think Max reaches 91.6 vs Non-think's 55.2. You choose cost vs quality per request — no infrastructure change required.

Hybrid Attention Architecture for 1M-Token Context

Native million-token context is harder than it sounds. DeepSeek-V4-Flash achieves it through a purpose-built Hybrid Attention Architecture that combines two mechanisms:

Compressed Sparse Attention (CSA) — dramatically reduces the attention compute budget for long sequences
Heavily Compressed Attention (HCA) — compresses KV cache footprint for 1M-context inference

The result: inference over 1M-token inputs with manageable FLOP and memory cost. For workloads like codebase analysis, legal document review, or long-session agents, this architecture makes the difference between feasible and prohibitive.

MoE Efficiency: 13B Activated at 284B Scale

The 284B/13B activated ratio is where the cost efficiency comes from. Only 13B parameters are active per forward pass, keeping latency and per-token cost close to a 13B dense model — while the full 284B parameter pool provides knowledge capacity comparable to a much larger dense network. The FP4 + FP8 mixed precision further reduces memory bandwidth pressure on expert weights.

Strong Post-Training Pipeline

DeepSeek-V4-Flash follows a two-stage post-training process: first, domain-specific expert cultivation via SFT and reinforcement learning with GRPO; then, unified model consolidation through on-policy distillation. This produces a single model with differentiated capability profiles across coding, reasoning, and general knowledge — not a generic instruction-follower.

Benchmark Performance

The benchmark story for DeepSeek-V4-Flash is about reasoning mode selection. In Non-think mode, it behaves like an efficient 13B-activated model. Dial up to Think Max and it reaches a different tier entirely.

DeepSeek-V4-Flash performance across modes vs frontier models [Source: DeepSeek AI / HuggingFace]

Performance Across Reasoning Modes

Below are V4-Flash's scores across key benchmarks, comparing all three operating modes:

Benchmark	V4-Flash Non-Think	V4-Flash Think	V4-Flash Think Max
LiveCodeBench (Pass@1)	55.2	88.4	91.6
GPQA Diamond (Pass@1)	71.2	87.4	88.1
HMMT 2026 Feb (Pass@1)	40.8	91.9	94.8
IMOAnswerBench (Pass@1)	41.9	85.1	88.4
Codeforces Rating	—	2816	3052
SWE Verified (Resolved)	73.7	78.6	79.0
MRCR 1M (MMR)	37.5	76.9	78.7
MCPAtlas (Pass@1)	64.0	67.4	69.0
MMLU-Pro (EM)	83.0	86.4	86.2

Last verified: 2026-04-27. Source: DeepSeek-V4 technical report and HuggingFace model card.

How V4-Flash Compares to Competitors

V4-Flash Think Max (79.0 SWE Verified, 91.6 LiveCodeBench) competes with models running at much higher per-token cost. It doesn't top every leaderboard — V4-Pro Max leads on most frontier benchmarks — but for developers looking at cost-per-task rather than raw peak performance, the trade-off is favorable:

Benchmark	V4-Flash Max	V4-Pro Max	Claude Opus 4.6 Max	Gemini 3.1 Pro High
LiveCodeBench (Pass@1)	91.6	93.5	88.8	91.7
GPQA Diamond (Pass@1)	88.1	90.1	91.3	94.3
SWE Verified (Resolved)	79.0	80.6	80.8	80.6
HMMT 2026 Feb (Pass@1)	94.8	95.2	96.2	94.7
MRCR 1M (MMR)	78.7	83.5	92.9	76.3

Last verified: 2026-04-27. Claude Opus 4.6 Max and Gemini 3.1 Pro High figures sourced from the DeepSeek-V4 technical report (V4-Pro frontier comparison table). These scores were not measured head-to-head against V4-Flash in that report.

Notably, V4-Flash Think Max on MRCR 1M (78.7) beats Gemini 3.1 Pro High (76.3) on the long-context retrieval task — the benchmark that most directly maps to 1M-context use cases. On SWE Verified, all four models cluster between 79–81, making V4-Flash competitive in the real-world coding agent category at a fraction of the closed-model price.

How to Use DeepSeek-V4-Flash via Novita AI

Option 1: Playground (No Code)

Test the model directly in your browser at the Novita AI model console. No API key required to start — switch between Non-think, Think, and Think Max modes via the chat interface.

Option 2: API (Python)

DeepSeek-V4-Flash uses the OpenAI-compatible API. Use the model ID deepseek/deepseek-v4-flash with the Novita base URL:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="YOUR_NOVITA_API_KEY",
)

response = client.chat.completions.create(
    model="deepseek/deepseek-v4-flash",
    messages=[{"role": "user", "content": "Your prompt here"}]
)
print(response.choices[0].message.content)

To enable Think or Think Max mode, pass the reasoning parameter in the request body:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="YOUR_NOVITA_API_KEY",
)

# Think Max mode — maximum reasoning budget
response = client.chat.completions.create(
    model="deepseek/deepseek-v4-flash",
    messages=[{"role": "user", "content": "Solve: x^4 - 5x^2 + 4 = 0"}],
    extra_body={"reasoning": {"effort": "high"}}  # "low" = Think, "high" = Think Max
)
print(response.choices[0].message.content)

Get your API key at novita.ai/settings.

Option 3: Third-Party Tools

Because Novita AI exposes an OpenAI-compatible endpoint, DeepSeek-V4-Flash works out of the box with:

LangChain / LlamaIndex — use ChatOpenAI with base_url="https://api.novita.ai/v3/openai"
OpenWebUI — add as a custom OpenAI-compatible endpoint
Continue.dev / Cursor — configure as a custom model with the Novita base URL

Pricing

DeepSeek-V4-Flash is priced consistently across major providers. All figures are per million tokens, as of 2026-04-27:

Provider	Input ($/M)	Output ($/M)	Cache Read ($/M)	Max Context
Novita AI	$0.14	$0.28	$0.028	1,048,576 tokens
DeepSeek Official	$0.14	$0.28	$0.028	131,072 tokens
SiliconFlow	$0.14	$0.28	$0.028	65,536 tokens
DeepInfra	$0.14	$0.28	—	16,384 tokens

The per-token rate is the same everywhere — but max context varies significantly. Novita AI offers the full 1M token context window. DeepInfra caps at 16,384 tokens. If your workload involves long documents, codebases, or multi-turn agents, Novita is the practical choice.

Recommended Use Cases

Autonomous Coding Agents

V4-Flash's 1M context window means an agent can load an entire codebase into context without chunking. Combined with 79.0 SWE Verified in Think Max mode, it handles multi-file refactors and debugging without losing state between turns.

Long-Document QA and RAG

MRCR 1M (Multi-Round Context Retrieval) at 78.7% Think Max — the benchmark measures retrieval accuracy over a genuine 1M-token window. For indexing legal documents, academic papers, or long technical specs, V4-Flash retrieves accurately where most models degrade after 32K tokens.

Math and Science Reasoning

94.8% on HMMT 2026 February (competition math) with Think Max. The budget-thinking mode lets you tune cost vs accuracy — use Think for standard problems, Think Max for the hard ones. A single request doesn't burn a fixed compute budget; you choose.

Production APIs with Caching

At $0.028/M cache reads, repeated system prompts and tool schemas effectively cost nothing at scale. Chatbot products and API wrappers that re-inject the same context on every call benefit from cache read pricing over raw input pricing.

Frequently Asked Questions

What is DeepSeek-V4-Flash?

DeepSeek-V4-Flash is a 284B-parameter Mixture-of-Experts language model developed by DeepSeek AI, released on 2026-04-23. It activates only 13B parameters per forward pass, making it significantly faster and cheaper than dense models of comparable capability. It supports a 1,048,576-token context window and three reasoning modes: Non-thinking (fast), Budget Thinking, and Extended Thinking (Think Max).

How is DeepSeek-V4-Flash different from DeepSeek-V4-Pro?

V4-Flash is the lighter, faster variant optimized for speed and cost. V4-Pro is the flagship model with higher peak benchmark scores (e.g., 93.5 vs 91.6 on LiveCodeBench Think Max). V4-Flash "achieves comparable reasoning performance to the Pro version when given a larger thinking budget" — in practice, V4-Flash Think Max closes most of the gap against V4-Pro Think Max at lower per-token cost.

What does "Flash" mean in the model name?

Flash signals a speed-optimized variant, consistent with how Google uses the term for Gemini Flash. DeepSeek-V4-Flash prioritizes lower latency and cost over raw maximum accuracy, with the thinking modes available when you need to close the performance gap.

Does DeepSeek-V4-Flash support a 1M context window backed by Novita AI?

Yes. Novita AI exposes the full 1,048,576-token context window — the largest available across all current providers for this model. Max completion tokens on Novita is 393,216.

How do I switch reasoning modes via the API?

Pass the extra_body={"reasoning": {"effort": "low"}} parameter for Budget Thinking, or "effort": "high" for Think Max. Omit the parameter entirely for Non-thinking (fast) mode. The API is OpenAI-compatible — no SDK changes required.

What is the pricing for DeepSeek-V4-Flash backed by Novita AI?

As of 2026-04-27: $0.14/M input tokens, $0.28/M output tokens, $0.028/M cache read tokens. This matches DeepSeek's official pricing and is consistent across providers — the differentiator on Novita is the full 1M context window and reliable uptime.

Is DeepSeek-V4-Flash open source?

Yes. The model weights are available on HuggingFace under the MIT License — confirmed in the official DeepSeek-V4 repository. Self-hosting and commercial use are permitted under MIT terms. Using it via Novita AI's API requires no self-hosting at all.

Start Using DeepSeek-V4-Flash Today

DeepSeek-V4-Flash is now available via Novita AI with the full 1M context window, competitive pricing, and zero infrastructure overhead. You pick the reasoning mode; Novita handles the rest.

→ Try DeepSeek-V4-Flash backed by Novita AI

→ Novita AI LLM API documentation

DeepSeek-V4-Pro on Novita AI: 1M Context, #1 LiveCodeBench Score

Novita AI — Wed, 29 Apr 2026 14:58:41 +0000

DeepSeek-V4-Pro: 1M Context, #1 on LiveCodeBench, Open-Source Frontier

You're evaluating open-source models for a production coding agent. You need something that handles large codebases—entire repos, not just single files—and actually resolves GitHub issues without hallucinating tool calls. Every model you try either falls apart beyond 128K tokens or lags behind GPT-4o on the benchmarks that matter for real engineering tasks.

TL;DR: DeepSeek-V4-Pro is a 1.6T-parameter open-source MoE model delivering #1 LiveCodeBench score (93.5) and 1M-token context. Available now via Novita AI.

DeepSeek-V4-Pro changes this calculus. It's a 1.6-trillion-parameter MoE model with a true 1M-token context window, the highest published score on LiveCodeBench (93.5 Pass@1), and Codeforces Rating 3206—both #1 among all evaluated models including closed frontier APIs. In short: it's the best open-source model available today for competitive coding and large-context agentic tasks, released under MIT license. As of today, it's available via Novita AI.

Try DeepSeek-V4-Pro Now →

What Is DeepSeek-V4-Pro?

DeepSeek-V4-Pro is the flagship model in DeepSeek's V4 series, released April 24, 2026. It sits above the lightweight DeepSeek-V4-Flash (284B total / 13B active) and is positioned as a preview of DeepSeek's current frontier capabilities—what they describe as the "best open-source model available today" for knowledge and coding. The model is trained on over 32 trillion tokens and fine-tuned through a two-stage pipeline: domain-expert SFT + GRPO reinforcement learning, followed by on-policy distillation. The full technical details are in DeepSeek's paper DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence.

Key specs at a glance:

Architecture: Mixture-of-Experts (MoE) with Hybrid Attention — Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA)
Parameters: 1.6T total / 49B activated per forward pass
Context window: 1,048,576 tokens (1M)
Precision: FP4 (MoE experts) + FP8 mixed
Reasoning modes: Non-think (fast), Think (standard CoT), Max (maximum reasoning budget)
Capabilities: Function calling, structured outputs, reasoning, 1M-context retrieval
License: MIT

Key Features

Hybrid Attention for Efficient 1M-Token Context

Most models claiming "long context" either truncate silently or degrade sharply beyond 128K tokens. DeepSeek-V4-Pro's Hybrid Attention Architecture—combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) alongside Manifold-Constrained Hyper-Connections (mHC)—is designed from the ground up for efficient million-token processing. In practice: MRCR 1M scores 83.5 (memory recall across 1M context) and CorpusQA 1M hits 62.0, both while maintaining coherent reasoning over the full window. For agents that need to ingest an entire codebase, a day's worth of logs, or a book-length document in a single call, this is the architecture that makes it viable without specialized infrastructure.

#1 on LiveCodeBench and Codeforces — The Coding Model That Actually Competes

DeepSeek-V4-Pro scores 93.5 on LiveCodeBench (Pass@1) and 3206 on Codeforces Rating—both the highest published scores in the comparison table, beating Claude Opus 4.6 Max (88.8 / no rating), Gemini 3.1 Pro High (91.7 / 3052), and GPT-5.4 xHigh (no LCB score / 3168). On SWE-Verified (real-world GitHub issue resolution), it hits 80.6, on par with Claude Opus 4.6 Max (80.8) and Gemini 3.1 Pro (80.6). For teams building coding agents where "can it actually fix the bug" matters more than theoretical MMLU scores, V4-Pro is the open-source option that directly competes with closed frontier APIs.

Three Reasoning Modes — Match Compute to the Task

DeepSeek-V4-Pro exposes three inference modes through the same API endpoint:

Non-think: No chain-of-thought. Fast, low latency—suitable for classification, extraction, structured output tasks where reasoning overhead is wasteful.
Think: Standard CoT reasoning. The default for coding, math, and multi-step tasks.
Max (V4-Pro Max): Extended reasoning budget. Use when accuracy matters more than speed—complex proofs, hard competitive programming problems, deep debugging sessions.

All three modes are accessible via the deepseek/deepseek-v4-pro model ID backed by Novita AI. Switching between them is a prompt-level instruction, not a different endpoint—which means you can implement adaptive mode selection in your application without changing API config.

Agentic and Tool Use Performance

Beyond coding benchmarks, V4-Pro holds its own on agentic evaluations. BrowseComp: 83.4 (vs Claude Opus 83.7, Gemini 85.9—within 2.5 points of the frontier). MCPAtlas Public: 73.6, second only to Claude Opus 4.6 (73.8). Toolathlon: 51.8, third overall. These aren't "leads all models" results, but they confirm that V4-Pro is a capable general-purpose agentic model, not just a benchmark-optimized coding specialist. Combined with native function calling support, it's a practical choice for agents that need to browse, call tools, and reason in a single session.

Benchmark Performance

The table below covers the benchmarks from DeepSeek's official comparison. "V4-Pro" refers to the DeepSeek-V4-Pro Max (extended reasoning) mode—the same model accessible via the deepseek/deepseek-v4-pro API ID on Novita.

DeepSeek-V4-Pro performance across coding, reasoning, and agentic benchmarks. [Source: DeepSeek HuggingFace]

Benchmark	DeepSeek-V4-Pro	Claude Opus 4.6	Gemini 3.1 Pro	GPT-5.4
LiveCodeBench (Pass@1)	93.5 ✓	88.8	91.7	—
Codeforces Rating	3206 ✓	—	3052	3168
SWE-Verified	80.6	80.8	80.6	—
SWE Pro	55.4	57.3	54.2	57.7
BrowseComp	83.4	83.7	85.9	82.7
MCPAtlas Public	73.6	73.8	69.2	67.2
GPQA Diamond	90.1	91.3	94.3	93.0
HLE (Pass@1)	37.7	40.0	44.4	39.8
IMOAnswerBench	89.8	75.3	81.0	91.4
HMMT 2026 Feb	95.2	96.2	94.7	97.7
MRCR 1M (MMR)	83.5	92.9	76.3	—
CorpusQA 1M	62.0	71.7	53.8	—
Terminal Bench 2.0	67.9	65.4	68.5	75.1

✓ = highest published score in this comparison. Last verified: 2026-04-25. Scores reflect "Max" / extended reasoning mode where applicable. Source: DeepSeek HuggingFace model card.

Honest read: On knowledge benchmarks (GPQA Diamond, HLE), Gemini 3.1 Pro and GPT-5.4 are clearly ahead. V4-Pro's edge is in coding—LiveCodeBench and Codeforces are unambiguous #1 scores—and in long-context retrieval over other open-source models. For math reasoning, the gap is mixed: V4-Pro beats GPT-5.4 on IMOAnswerBench (89.8 vs 91.4, close) but trails on HMMT 2026 (95.2 vs 97.7).

How to Use DeepSeek-V4-Pro backed by Novita AI

Option 1: Playground (No Code)

Test directly at novita.ai/models/model-detail/deepseek-deepseek-v4-pro. No API key required to explore. Set the system prompt to activate Think or Non-think mode.

Option 2: API (Python)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="YOUR_NOVITA_API_KEY",
)

# Standard (Think mode)
response = client.chat.completions.create(
    model="deepseek/deepseek-v4-pro",
    messages=[
        {"role": "user", "content": "Implement a Rust async runtime from scratch."}
    ],
)
print(response.choices[0].message.content)

Get your API key at novita.ai/settings. The same model ID works for all three reasoning modes—pass mode instructions in the system prompt or use DeepSeek's documented mode-switching syntax.

Option 3: Third-Party Tools

Since Novita AI is OpenAI-API-compatible, you can drop in deepseek/deepseek-v4-pro as the model ID in Cursor (custom OpenAI provider), Claude Code-compatible setups, LangChain, LlamaIndex, or any OpenAI SDK-based framework. Just point base_url to https://api.novita.ai/v3/openai.

curl https://api.novita.ai/v3/openai/chat/completions \\
  -H "Authorization: Bearer YOUR_NOVITA_API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{"model":"deepseek/deepseek-v4-pro","messages":[{"role":"user","content":"Implement a Rust async runtime."}]}'

Use Cases

Full-codebase analysis and refactoring: With 1M-token context, you can pass an entire medium-sized repository in one call. Ask V4-Pro to find architectural issues, generate migration guides, or refactor patterns across 50+ files simultaneously—without chunking or retrieval hacks.

Competitive programming and hard algorithm problems: Codeforces Rating 3206 puts V4-Pro in the top tier for algorithmic problem solving. Use it for generating solutions to competitive programming challenges, verifying complexity proofs, or stress-testing edge cases in production algorithms.

GitHub issue resolution agents: SWE-Verified 80.6 places V4-Pro on par with Claude Opus 4.6 on real-world bug fixing. Combined with function calling and long context, it can read issue descriptions, browse code history, and generate patches without losing track across large repos.

Long-document reasoning: Legal contracts, research papers, technical specifications, audit logs—V4-Pro's 1M context means you're not forced to summarize or chunk before analysis. CorpusQA 1M (62.0) and MRCR 1M (83.5) confirm retrieval accuracy holds at full context length.

Math and science tutoring / problem generation: IMOAnswerBench 89.8 (beats all closed models except GPT-5.4's 91.4) makes V4-Pro a strong choice for generating competition-level math problems, verifying proofs, or building STEM education tools where mathematical reasoning is the bottleneck.

Pricing

Model	Input ($/M tokens)	Cache Read ($/M tokens)	Output ($/M tokens)
DeepSeek-V4-Pro (Novita)	$1.74	$0.145	$3.48
DeepSeek-V4-Flash (Novita)	$0.10	—	$0.50
Claude Opus 4.6 (Anthropic)	$15.00	$1.50	$75.00
Gemini 3.1 Pro (Google)	$1.25	$0.31	$10.00
GPT-5.4 (OpenAI)	$10.00	$2.50	$40.00

Last verified: 2026-04-25. Novita pricing from novita.ai/pricing. Competitor pricing: Claude from anthropic.com (unverified), Gemini from ai.google.dev (unverified), GPT-5.4 from platform.openai.com (unverified).

Via Novita AI, V4-Pro is roughly 8× cheaper than Claude Opus 4.6 for input tokens, and 21× cheaper for output. Compared to Gemini 3.1 Pro, input pricing is similar but output is 2.9× cheaper. For coding agents with long context and multi-turn sessions—where output tokens dominate costs—the gap compounds fast.

Migrating from DeepSeek-V3 or DeepSeek-R1

If you're currently running DeepSeek-V3 or R1 on Novita, upgrading to V4-Pro is a one-line model ID change. The API is OpenAI-compatible, same endpoint, same request format. V4-Pro's three reasoning modes give you the flexibility to replicate both V3 (non-think mode) and R1-style deep reasoning (Max mode) from a single model—without maintaining separate deployments. If you're migrating from another provider's model (GPT-4o, Claude 3.5, etc.), point your existing OpenAI SDK client to base_url="https://api.novita.ai/v3/openai" and swap the model ID.

Conclusion

Bottom line: DeepSeek-V4-Pro is the strongest open-source model available for coding tasks, with definitive #1 scores on LiveCodeBench and Codeforces, and it's the only model in its tier that handles a genuine 1M-token context window. It doesn't lead every benchmark—Gemini 3.1 Pro holds the edge on knowledge recall, and Claude Opus leads on long-context retrieval—but for teams building coding agents, fixing GitHub issues at scale, or processing massive documents, V4-Pro delivers frontier-class performance at a fraction of closed-model API costs. Now available backed by Novita AI — 200+ model APIs and OpenAI-compatible infrastructure.

Try DeepSeek-V4-Pro via Novita AI →

FAQ

What is DeepSeek-V4-Pro?

DeepSeek-V4-Pro is a 1.6-trillion-parameter Mixture-of-Experts language model from DeepSeek AI, released April 2026. It activates 49B parameters per forward pass, supports 1,048,576 tokens of context, and currently leads all publicly evaluated models on LiveCodeBench (93.5) and Codeforces Rating (3206). It's available under the MIT license and via Novita AI.

How do I access DeepSeek-V4-Pro via API?

Use model ID deepseek/deepseek-v4-pro with base_url="https://api.novita.ai/v3/openai" and your Novita API key from novita.ai/settings. The endpoint is OpenAI SDK-compatible—no custom SDK required.

How does DeepSeek-V4-Pro compare to Claude Opus 4.6 and Gemini 3.1 Pro?

V4-Pro leads on coding: LiveCodeBench 93.5 (vs Opus 4.6 88.8, Gemini 91.7) and Codeforces 3206 (vs Gemini 3052). On knowledge benchmarks like GPQA Diamond and HLE, Gemini 3.1 Pro leads. On long-context retrieval (MRCR 1M), Claude Opus leads. V4-Pro is the best open-source choice for coding-heavy and agentic workloads—closed models maintain edges in raw factual recall.

What is DeepSeek-V4-Pro's context window?

1,048,576 tokens (1M). The model is specifically architected for long-context efficiency using Hybrid Attention (CSA + HCA). MRCR 1M scores 83.5 and CorpusQA 1M hits 62.0, confirming usable retrieval accuracy at full context length.

How much does DeepSeek-V4-Pro cost backed by Novita AI?

$1.74/M input tokens, $3.48/M output tokens, $0.145/M cache read. This makes it approximately 8× cheaper than Claude Opus 4.6 for input and 21× cheaper for output. Last verified: 2026-04-25.

Ling-2.6-1T: The 1T Model That Skips the Reasoning Tax

Novita AI — Wed, 29 Apr 2026 14:49:19 +0000

Most capable open-source models make you choose: raw intelligence or token efficiency. Thinking models burn 3–5× more tokens per request. Smaller non-reasoning models cut costs but cap capability. Ling-2.6-1T is built to break that tradeoff.

Ling-2.6-1T is a trillion-scale comprehensive flagship model from Ant Group (inclusionAI), designed for immediate task execution. Built on MLA + Hybrid Linear Attention architecture, it achieves a superior intelligence-to-token ratio: strong benchmark performance with minimal output token overhead. On AIME26, it significantly outperforms other non-thinking models. On agent execution benchmarks — SWE-bench Verified, BFCLv4, TAU2-Bench, Claw-Eval — it reaches open-source SOTA. Now exclusively backed by Novita AI as the inference provider.

In short: Ling-2.6-1T delivers comprehensive frontier capability for agent workloads — complex reasoning, tool use, multi-step execution, and long-context instruction following — at a fraction of the token cost of thinking models.

TL;DR: Ling-2.6-1T is a 1-trillion-parameter open-source MoE model from Ant Group, available via Novita AI API. Best for agent workloads requiring frontier capability (SWE-bench SOTA) without thinking-model token overhead. Non-reasoning architecture keeps output tokens lean while matching reasoning-model benchmark performance.

Try Ling-2.6-1T backed by Novita AI

What Is Ling-2.6-1T?

Ling-2.6-1T is the latest flagship model from inclusionAI, the AI research arm of Ant Group (AntLingAGI). It’s a 1-trillion-parameter Mixture-of-Experts model — the largest FP8-trained foundation model released to date — trained on 20T+ high-quality tokens with over 40% reasoning-dense data in later stages.

Unlike thinking models (DeepSeek-R1, QwQ) that output long chain-of-thought traces before answering, Ling-2.6-1T uses a “fast thinking” mechanism: it internalizes reasoning without externalizing verbose thought chains. This keeps token output lean while maintaining strong analytical depth. ~50B parameters activate per token, making inference practical at 1T scale.

Architecture: MLA + Hybrid Linear Attention, 1T total parameters, ~50B active params per token
Context window: 262,144 tokens (via YaRN rope scaling), max output 32,768 tokens
Training: FP8 mixed-precision, 20T+ tokens, >40% reasoning-dense data
Paradigm: Fast-thinking — internalized reasoning, no verbose chain-of-thought output
License: MIT — fully open weights
Availability: Exclusively backed by Novita AI (OpenRouter provider)

Key Features: Why Ling-2.6-1T Stands Out

Superior Intelligence-to-Token Ratio

Thinking models produce impressive results but inflate your token bill — hundreds of reasoning tokens before the actual answer. Ling-2.6-1T was trained with Evolutionary Chain-of-Thought (Evo-CoT) in mid-training, internalizing reasoning rather than externalizing it. The result: strong benchmark scores on AIME26 (outperforming other non-thinking models), LiveCodeBench, and Omni-MATH — without paying for the thought process. Per the official model card, it achieves intelligence-output efficiency on par with GPT-5.4 (Non-Reasoning), representing a major leap over its predecessor Ling-1T. For high-throughput production workloads, this directly reduces cost.

Open-Source SOTA on Agent Execution

Agent workloads require more than math and coding in isolation — they require tool use, multi-step execution, and reliable instruction following under real-world conditions. Ling-2.6-1T reaches open-source SOTA across the key agent benchmarks (per inclusionAI model card):

SWE-bench Verified — real-world software engineering task resolution
BFCLv4 — Berkeley Function-Calling Leaderboard v4, complex tool-use
TAU2-Bench — long-horizon agentic task completion
Claw-Eval — multi-turn command execution
PinchBench — composite agent capability evaluation

On LiveCodeBench (Aug 2024–May 2025), it scores 61.68 — outperforming DeepSeek-V3.1 (48.02), Kimi-K2-0905 (48.95), and GPT-5-main (48.57) by 13+ points. For front-end generation, ArtifactsBench score is 59.31 — second only to Gemini-2.5-Pro(lowthink) at 60.28 in this comparison group (per inclusionAI model card).

Long Context + Instruction Following

With 262,144-token context (YaRN rope scaling), Ling-2.6-1T can hold entire codebases, long documents, or extended multi-turn agent conversations in a single call. On the MRCR benchmark (16K–256K context range), it consistently maintains retrieval accuracy — a critical requirement for agent pipelines that process long tool outputs or document corpora. IFBench score is 56.9%, demonstrating strong complex instruction-following under extended context.

Benchmark Performance

Independent measurements from Artificial Analysis place Ling-2.6-1T at an Intelligence Index of 33.6 — better than 73% of 495 measured models, and #2 in the open-weights large non-reasoning class. Below are self-reported scores from the inclusionAI model card (comparing against DeepSeek-V3.1-terminus, Kimi-K2-0905, GPT-5-main, and Gemini-2.5-Pro(lowthink)), followed by independently verified AA scores.

Math & Reasoning (per inclusionAI model card)

Benchmark	Ling-2.6-1T	DeepSeek-V3.1	Kimi-K2-0905	GPT-5-main	Gemini-2.5-Pro*
AIME26	70.42	55.21	50.16	59.43	70.10
Omni-MATH	74.46	64.77	62.42	61.09	72.02
OptMATH	57.68	35.99	35.84	39.16	42.77
FinanceReasoning	87.45	86.44	84.83	86.28	86.65
BBEH	47.34	42.86	34.83	39.75	29.08
KOR-Bench	76.00	73.76	73.20	70.56	59.68
ARC-AGI-1	43.81	14.69	22.19	14.06	18.94

*Gemini-2.5-Pro(lowthink). Source: inclusionAI model card. Last verified: 2026-04-24.

Code Performance (per inclusionAI model card)

Benchmark	Ling-2.6-1T	DeepSeek-V3.1	Kimi-K2-0905	GPT-5-main	Gemini-2.5-Pro*
LiveCodeBench	61.68	48.02	48.95	48.57	45.43
MultiPL-E	77.91	77.68	73.54	76.66	71.48
CodeForces Rating	1901	1582	1574	1120	1675
FullStack Bench	56.55	55.48	54.00	50.92	48.19
ArtifactsBench	59.31	43.29	44.87	41.04	60.28
Aider Code Editing	83.65	88.16	85.34	84.40	89.85

*Gemini-2.5-Pro(lowthink). Source: inclusionAI model card. Last verified: 2026-04-24. Note: model version names (e.g. "gpt-5-main", "DeepSeek-V3.1-terminus") are as reported by inclusionAI and may not correspond to publicly released versions.

Agent Execution Benchmarks (per inclusionAI model card)

Ling-2.6-1T reaches open-source SOTA across agent-specific evaluations. Exact competitor scores are not published for all benchmarks; results listed as reported in the official model card.

Benchmark	What It Measures	Ling-2.6-1T
SWE-bench Verified	Real-world GitHub issue resolution	Open-source SOTA
BFCLv4	Complex multi-step function/tool calling	Open-source SOTA
TAU2-Bench	Long-horizon agent task completion	Open-source SOTA
Claw-Eval	Multi-turn command execution	Open-source SOTA
PinchBench	Composite agent capability	Open-source SOTA
IFBench	Complex instruction following	56.9%

Source: inclusionAI model card. "Open-source SOTA" as claimed by inclusionAI; independent per-score data not yet available. Last verified: 2026-04-24.

Independent Benchmarks (Artificial Analysis)

Metric	Ling-2.6-1T	Notes
AA Intelligence Index	33.6	Better than 73% of 495 models
AA Coding Index	33.0	Better than 78% of models
AA Agentic Index	48.2	Better than 80% of models
GPQA Diamond	75.2%	Graduate-level scientific reasoning
τ²-Bench Telecom	89.8%	Conversational agent tasks
IFBench	56.9%	Instruction-following
Output Speed	67.7 tok/s	Via Novita AI on OpenRouter

Source: Artificial Analysis. Last verified: 2026-04-24.

How to Use Ling-2.6-1T backed by Novita AI

Option 1: Playground (No Code)

Try the model instantly at novita.ai/models/model-detail/inclusionai-ling-2.6-1t — no setup required. Useful for quickly testing prompts before integrating into your app.

Option 2: API (Python)

Ling-2.6-1T is fully OpenAI-compatible. Swap in your Novita API key and the model ID:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="YOUR_NOVITA_API_KEY",
)

response = client.chat.completions.create(
    model="inclusionai/ling-2.6-1t",
    messages=[{"role": "user", "content": "Your prompt here"}],
    temperature=0.7,
    top_p=0.95,
)

print(response.choices[0].message.content)

Get your API key at novita.ai/settings. The model also supports streaming, function calling via tool_use, and structured output.

Option 3: Third-Party Tools

Since Novita AI is OpenAI-compatible, Ling-2.6-1T works with any tool that accepts a custom base URL — including Cursor, Claude Code, OpenWebUI, LangChain, and LlamaIndex. Set base URL to https://api.novita.ai/v3/openai and model to inclusionai/ling-2.6-1t.

Use Cases

Ling-2.6-1T’s combination of 1T-parameter capacity, fast-thinking paradigm, and 262K context makes it a strong fit for:

Coding Agents: With a CodeForces rating of 1901 and strong LiveCodeBench scores, it handles competitive-level programming tasks. Pair it with Novita’s Agent Sandbox for fully isolated code execution without managing infrastructure.
Financial Analysis: 87.45 on FinanceReasoning (#1 in its comparison group per inclusionAI model card) makes it suitable for automated report analysis, earnings summarization, and quantitative research workflows.
Front-End Generation: The Hybrid Syntax–Function–Aesthetics reward in training specifically targets UI code quality. ArtifactsBench score of 59.31 is the second-highest in its comparison group — only 0.97 points behind Gemini-2.5-Pro(lowthink).
Long-Document Processing: 262,144-token context handles multi-hundred-page documents, full repository analysis, or extended legal/research corpora in a single call.
High-Volume Production APIs: Non-reasoning paradigm means predictable token counts and lower latency variance — important when you’re running thousands of requests per day.

Migrating From DeepSeek V3 or Kimi K2?

If you’re currently using DeepSeek V3 or Kimi K2 via another provider, switching to Ling-2.6-1T backed by Novita AI is a one-line change — same OpenAI-compatible API, same request format. The model ID becomes inclusionai/ling-2.6-1t.

On coding tasks, Ling-2.6-1T outperforms both DeepSeek-V3.1 and Kimi-K2-0905 on LiveCodeBench (61.68 vs 48.02 and 48.95), and on math reasoning it leads both on AIME26 and OptMATH. If your workloads are reasoning-heavy but you don’t want chain-of-thought verbosity, this is the cleaner upgrade path versus switching to a thinking model.

Pricing

Model	Input ($/1M tokens)	Output ($/1M tokens)	Context
Ling-2.6-1T (Novita AI)	$0.30	$2.50	262,144
DeepSeek V3.2	$0.28	$0.42	128K
Qwen3-235B-A22B	$0.455	$1.82	131K
Kimi K2 (OpenRouter)	$0.57	$2.30	131K

Novita AI pricing via novita.ai. Competitor pricing via OpenRouter. Last verified: 2026-04-24.

Ling-2.6-1T’s output pricing ($2.50/M) is higher than DeepSeek V3.2 — the tradeoff is meaningfully stronger benchmark performance on reasoning and coding tasks. If token cost per call is the primary constraint, Ling-2.6-flash (104B params, 7.4B active) is the cheaper sibling and also exclusively available via Novita AI.

Free tier: Ling-2.6-1T is available for free via the inclusionai/ling-2.6-1t:free endpoint on OpenRouter, exclusively provided by Novita AI. This free window is time-limited — check current availability at openrouter.ai/inclusionai/ling-2.6-1t:free.

Conclusion

Bottom line: Ling-2.6-1T is currently the strongest open-weight non-reasoning model for competitive math and coding benchmarks, and the strongest open-source option if you need 262K context without paying for chain-of-thought verbosity. It’s not the cheapest option per token, but for complex reasoning tasks where thinking models would inflate your bill, it’s the most practical frontier open-source alternative available today.

Exclusively backed by Novita AI — the only provider offering both Ling-2.6-1T and Ling-2.6-flash on OpenRouter — you get a stable inference endpoint, 99.9% uptime, and OpenAI-compatible API without managing the 32-GPU minimum deployment yourself.

Get Started with Ling-2.6-1T

FAQ

What is Ling-2.6-1T?

Ling-2.6-1T is a 1-trillion-parameter Mixture-of-Experts language model developed by Ant Group (inclusionAI). It activates roughly 50B parameters per token, supports a 262,144-token context window, and is designed as a fast-thinking, non-reasoning model — strong benchmark performance without chain-of-thought overhead. MIT-licensed and fully open weights.

How do I access Ling-2.6-1T via API?

Set base_url="https://api.novita.ai/v3/openai" and model="inclusionai/ling-2.6-1t" in any OpenAI-compatible client. Get your API key at novita.ai/settings. It’s also accessible via OpenRouter using the same model ID.

How does Ling-2.6-1T compare to DeepSeek V3?

On self-reported benchmarks (inclusionAI model card), Ling-2.6-1T outperforms DeepSeek-V3.1 on AIME26 (70.42 vs 55.21), LiveCodeBench (61.68 vs 48.02), and ARC-AGI-1 (43.81 vs 14.69). DeepSeek V3.2 scores higher on the Artificial Analysis Intelligence Index (42 vs 34), but Ling-2.6-1T offers a larger context window (262K vs 128K) at similar pricing ($0.30/M input).

What is Ling-2.6-1T’s context window?

262,144 tokens (extended from 128K native via YaRN rope scaling). Maximum output length is 32,768 tokens.

Is Ling-2.6-1T free to use?

Yes, temporarily. The inclusionai/ling-2.6-1t:free endpoint on OpenRouter is provided exclusively by Novita AI. The free window is time-limited. The paid tier via Novita AI is $0.30/M input and $2.50/M output tokens.

How to Use Kimi-K2 in Claude Code on Windows and Mac

Novita AI — Tue, 15 Jul 2025 09:46:32 +0000

Claude Code offers more powerful agent capabilities than traditional code editors like Cursor. By integrating Kimi-K2 through Novita AI’s platform, developers can access enterprise-grade AI functionality at a fraction of the cost. This guide covers setting up Kimi-K2 with Claude Code on both Windows and Mac systems.

What is Claude Code

Source: http://www.anthropic.com/claude-code

Claude Code is an agentic command line tool that revolutionizes the way developers interact with AI for coding tasks. Unlike traditional code editors, Claude Code offers more powerful agent abilities than Cursor.

This innovative tool enables developers to delegate complex coding tasks directly from their terminal. It transforms natural language descriptions into fully functional code, making it an indispensable asset for modern development workflows.

The tool operates as an interactive session where developers can describe their requirements in plain English. Claude Code intelligently generates, modifies, and optimizes code accordingly. Its advanced understanding of context and project structure allows it to make informed decisions about code architecture, dependencies, and implementation patterns.

Why Use Kimi-K2 in Claude Code

Kimi-K2 presents a compelling alternative to traditional Claude models, offering similar capabilities at significantly reduced costs. The economic advantages are substantial:

Kimi-K2 on Novita AI: $0.57 per 1M input tokens and $2.3 per 1M output tokens
Claude Sonnet: $3 per 1M input tokens and $15 per 1M output tokens

This represents an 81% cost reduction for input tokens and an 85% reduction for output tokens.

Beyond cost savings, Kimi-K2 through Novita AI provides an anthropic-compatible LLM API with higher rate limits than official channels. This compatibility ensures seamless integration with existing Claude Code workflows while offering improved performance and reliability.

The combination delivers enterprise-grade AI capabilities without the premium pricing. This makes advanced AI development accessible to a broader range of developers and organizations.

Getting Your API Key on Novita AI

Sign up for a Novita AI account to get started with free trial credits. Navigate to the Key Management page in your dashboard and click “Create New Key.”

Copy the generated API key immediately and store it securely – it won’t be displayed again. You’ll need this key for the configuration steps below.

Installing Claude Code

Before installing Claude Code, ensure your system meets the minimum requirements. Node.js 18 or higher must be installed on your local environment. You can verify your Node.js version by running node --version in your terminal.

For Windows

Open Command Prompt and execute the following commands:

cmd

npm install -g @anthropic-ai/claude-code

npx win-claude-code@latest

The global installation ensures Claude Code is accessible from any directory on your system. The npx win-claude-code@latest command downloads and runs the latest Windows-specific version.

For Mac

Open Terminal and run:

bash

npm install -g @anthropic-ai/claude-code

Mac users can proceed directly with the global installation without requiring additional platform-specific commands. The installation process automatically configures the necessary dependencies and PATH variables.

Setting Up Environment Variables

Environment variables configure Claude Code to use Kimi-K2 through Novita AI’s API endpoints. These variables tell Claude Code where to send requests and how to authenticate.

For Windows

Open Command Prompt and set the following environment variables:

cmd

set ANTHROPIC_BASE_URL=https://api.novita.ai/anthropic

set ANTHROPIC_AUTH_TOKEN=<Novita API Key*>*

set ANTHROPIC_MODEL=moonshotai/kimi-k2-instruct

set ANTHROPIC_SMALL_FAST_MODEL=moonshotai/kimi-k2-instruct

Replace <Novita API Key> with your actual API key obtained from the Novita AI platform. These variables remain active for the current session and must be reset if you close the Command Prompt.

For Mac

Open Terminal and export the following environment variables:

bash

export ANTHROPIC_BASE_URL="https://api.novita.ai/anthropic"

export ANTHROPIC_AUTH_TOKEN="<Novita API Key>"

export ANTHROPIC_MODEL="moonshotai/kimi-k2-instruct"

export ANTHROPIC_SMALL_FAST_MODEL="moonshotai/kimi-k2-instruct"

Starting Claude Code

With installation and configuration complete, you can now start Claude Code in your project directory. Navigate to your desired project location using the cd command:

cd <your-project-directory*>*

claude .

The dot (.) parameter instructs Claude Code to operate in the current directory. Upon startup, you’ll see the Claude Code prompt appear in an interactive session.

This indicates the tool is ready to receive your instructions. The interface provides a clean, intuitive environment for natural language programming interactions.

Building Your First Project

Claude Code excels at transforming detailed project descriptions into functional applications. After entering your prompt, press Enter to begin the task. Claude Code will analyze your requirements, create the necessary files, implement the functionality, and provide a complete project structure with documentation.

Here’s an example of how to create a Python Flask web app with MBTI personality guessing game:

Using Claude Code in VSCode or Cursor

Claude Code integrates seamlessly with popular development environments. It enhances your existing workflow rather than replacing it.

You can use Claude Code directly in the terminal within VSCode or Cursor. This maintains access to your familiar development tools while leveraging AI assistance.

Additionally, Claude Code plugins are available for both VSCode and Cursor. These plugins provide deeper integration with these editors, offering inline AI assistance, code suggestions, and project management features directly within your IDE interface.

The terminal integration allows you to run Claude Code commands without leaving your development environment. This creates a streamlined workflow for AI-assisted development.

Help and Documentation Resources

Claude Code includes comprehensive help documentation accessible through the /help command. This command displays available commands, usage examples, and troubleshooting information.

The help system is context-aware, providing relevant information based on your current project and session state.

For additional support, Novita AI provides extensive documentation . This covers advanced configuration options, API usage patterns, and best practices.

The Anthropic documentation offers detailed information about Claude Code’s capabilities and features.

Conclusion

Kimi-K2 integration with Claude Code through Novita AI delivers enterprise-grade capabilities at significantly reduced costs. The combination transforms natural language descriptions into functional code, dramatically accelerating development workflows. Start your journey with Kimi-K2 and Claude Code today to experience the future of AI-assisted programming.

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.

Access Free DeepSeek R1 0528 API Now

Novita AI — Thu, 29 May 2025 08:48:55 +0000

We’re excited to announce that DeepSeek AI’s latest model, DeepSeek R1 0528, released today, is officially available in the Novita AI Model Library. We are also the official inference partner for DeepSeek R1 0528 on Hugging Face, supporting the community in bringing advanced models to production.

For a limited time, new users can claim $10 in free credits to explore and build with DeepSeek-R1 0528’s advanced reasoning capabilities.

Here’s the current DeepSeek-R1 0528 pricing on Novita AI:

DeepSeek-R1–0528: $0.7 / M input tokens, $2.5 / M output tokens

How to Access DeepSeek R1 0528 on Novita AI

Getting started with DeepSeek R1 0528 is fast, simple, and risk-free on Novita AI. Thanks to the Referral Program, you’ll receive $10 in free credits — enough to fully explore DeepSeek R1 0528’s power, build prototypes, and even launch your first use case without any upfront cost.

Use the Playground (No Coding Required)

Instant Access: Sign up, claim your free credits, and start experimenting with DeepSeek R1 0528 and other top models in seconds.
Interactive UI: Test prompts, chain-of-thought reasoning, and visualize results in real time.
Model Comparison: Effortlessly switch between Qwen 3, Llama 4, DeepSeek, and more to find the perfect fit for your needs.

Explore DeepSeek R1 0528 Demo Now

Integrate via API (For Developers)

Seamlessly connect DeepSeek R1 0528 to your applications, workflows, or chatbots with Novita AI’s unified REST API — no need to manage model weights or infrastructure. Novita AI offers multi-language SDKs (Python, Node.js, cURL, and more) and advanced parameter controls for power users.

Option 1: Direct API Integration (Python Example)

To get started, simply use the code snippet below:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="session_Ntg-O34ZOS-q5bNnkb3IcixmWnmxEQBxwKWMW3es3CD7KG4PEhFE1yRTRMGS3s8zZ52hrMdz14MmI4oalaDJTw==",
)
model = "deepseek/deepseek-r1-0528"
stream = True # or False
max_tokens = 2048
system_content = ""Be a helpful assistant""
temperature = 1
top_p = 1
min_p = 0
top_k = 50
presence_penalty = 0
frequency_penalty = 0
repetition_penalty = 1
response_format = { "type": "text" }
chat_completion_res = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": system_content,
        },
        {
            "role": "user",
            "content": "Hi there!",
        }
    ],
    stream=stream,
    max_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    presence_penalty=presence_penalty,
    frequency_penalty=frequency_penalty,
    response_format=response_format,
    extra_body={
      "top_k": top_k,
      "repetition_penalty": repetition_penalty,
      "min_p": min_p
    }
  )
if stream:
    for chunk in chat_completion_res:
        print(chunk.choices[0].delta.content or "", end="")
else:
    print(chat_completion_res.choices[0].message.content)

Key Features:

Unified endpoint:/v3/openai supports OpenAI’s Chat Completions API format.
Flexible controls: Adjust temperature, top-p, penalties, and more for tailored results.
Streaming & batching: Choose your preferred response mode.

Option 2: Multi-Agent Workflows with OpenAI Agents SDK

Build advanced multi-agent systems by integrating Novita AI with the OpenAI Agents SDK:

Plug-and-play: Use Novita AI’s LLMs in any OpenAI Agents workflow.
Supports handoffs, routing, and tool use: Design agents that can delegate, triage, or run functions, all powered by Novita AI’s models.
Python integration: Simply point the SDK to Novita’s endpoint (https://api.novita.ai/v3/openai) and use your API key.

Connect DeepSeek R1 0528 API on Third-Party Platforms

Hugging Face: Use DeepSeek R1 0528 in Spaces, pipelines, or with the Transformers library via Novita AI endpoints.
Agent & Orchestration Frameworks: Easily connect Novita AI with partner platforms like Continue, AnythingLLM, LangChain, Dify and Langflow through official connectors and step-by-step integration guides.
OpenAI-Compatible API: Enjoy hassle-free migration and integration with tools such as Cline and Cursor, designed for the OpenAI API standard.

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.

Access Free DeepSeek R1 0528 API Now

Novita AI — Thu, 29 May 2025 08:42:14 +0000

For a limited time, new users can claim $10 in free credits to explore and build with DeepSeek-R1 0528’s advanced reasoning capabilities.

Here’s the current DeepSeek-R1 0528 pricing on Novita AI:

DeepSeek-R1–0528: $0.7 / M input tokens, $2.5 / M output tokens

How to Access DeepSeek R1 0528 on Novita AI
Getting started with DeepSeek R1 0528 is fast, simple, and risk-free on Novita AI. Thanks to the Referral Program, you’ll receive $10 in free credits — enough to fully explore DeepSeek R1 0528’s power, build prototypes, and even launch your first use case without any upfront cost.

Use the Playground (No Coding Required)
Instant Access: Sign up, claim your free credits, and start experimenting with DeepSeek R1 0528 and other top models in seconds.
Interactive UI: Test prompts, chain-of-thought reasoning, and visualize results in real time.
Model Comparison: Effortlessly switch between Qwen 3, Llama 4, DeepSeek, and more to find the perfect fit for your needs.
Explore DeepSeek R1 0528 Demo Now

Integrate via API (For Developers)
Seamlessly connect DeepSeek R1 0528 to your applications, workflows, or chatbots with Novita AI’s unified REST API — no need to manage model weights or infrastructure. Novita AI offers multi-language SDKs (Python, Node.js, cURL, and more) and advanced parameter controls for power users.

Option 1: Direct API Integration (Python Example)
To get started, simply use the code snippet below:

from openai import OpenAI

client = OpenAI(
base_url="https://api.novita.ai/v3/openai",
api_key="session_Ntg-O34ZOS-q5bNnkb3IcixmWnmxEQBxwKWMW3es3CD7KG4PEhFE1yRTRMGS3s8zZ52hrMdz14MmI4oalaDJTw==",
)
model = "deepseek/deepseek-r1-0528"
stream = True # or False
max_tokens = 2048
system_content = ""Be a helpful assistant""
temperature = 1
top_p = 1
min_p = 0
top_k = 50
presence_penalty = 0
frequency_penalty = 0
repetition_penalty = 1
response_format = { "type": "text" }
chat_completion_res = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": system_content,
},
{
"role": "user",
"content": "Hi there!",
}
],
stream=stream,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
presence_penalty=presence_penalty,
frequency_penalty=frequency_penalty,
response_format=response_format,
extra_body={
"top_k": top_k,
"repetition_penalty": repetition_penalty,
"min_p": min_p
}
)
if stream:
for chunk in chat_completion_res:
print(chunk.choices[0].delta.content or "", end="")
else:
print(chat_completion_res.choices[0].message.content)
Key Features:

Unified endpoint:/v3/openai supports OpenAI’s Chat Completions API format.
Flexible controls: Adjust temperature, top-p, penalties, and more for tailored results.
Streaming & batching: Choose your preferred response mode.
Option 2: Multi-Agent Workflows with OpenAI Agents SDK
Build advanced multi-agent systems by integrating Novita AI with the OpenAI Agents SDK:

Plug-and-play: Use Novita AI’s LLMs in any OpenAI Agents workflow.
Supports handoffs, routing, and tool use: Design agents that can delegate, triage, or run functions, all powered by Novita AI’s models.
Python integration: Simply point the SDK to Novita’s endpoint (https://api.novita.ai/v3/openai) and use your API key.
Connect DeepSeek R1 0528 API on Third-Party Platforms
Hugging Face: Use DeepSeek R1 0528 in Spaces, pipelines, or with the Transformers library via Novita AI endpoints.
Agent & Orchestration Frameworks: Easily connect Novita AI with partner platforms like Continue, AnythingLLM, LangChain, Dify and Langflow through official connectors and step-by-step integration guides.
OpenAI-Compatible API: Enjoy hassle-free migration and integration with tools such as Cline and Cursor, designed for the OpenAI API standard.
Showcase
Prompt: Write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically

Prompt: Build a pilot game

Prompt: Build a PDF summary web app + UI concept

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.

Qwen 3 Now Available on Novita AI — Claim Your $10 Free Credits

Novita AI — Fri, 23 May 2025 03:30:00 +0000

We’re excited to announce a strategic partnership with SGLang, a fast serving engine for large language models and vision language models. Through this collaboration, Novita AI will provide high-performance GPU cloud resources for SGLang’s ongoing research, benchmarking, and optimization efforts.

SGLang is a leading inference engine that co-designs a structured generation language with a highly optimized runtime, enabling powerful performance gains such as efficient RadixAttention cache reuse and zero-overhead batch scheduling for large language and vision-language models. By aligning language-level control with backend optimizations, it empowers developers to build complex generation workflows, multi-modal applications, and parallel inference pipelines with reliability and scale. SGLang is supported by leading institutions including NVIDIA, AMD, xAI, Oracle Cloud, Google Cloud, LinkedIn, Cursor, alongside research groups at Stanford, University of California, Berkeley, and University of California, Los Angeles — evidence of strong community engagement and broad industry adoption.

“SGLang’s integration of language-level primitives with runtime optimizations demonstrates the value of aligning software and hardware to unlock new performance levels,” said Junyu Huang, Co-Founder & COO at Novita AI. “By contributing our infrastructure and expertise, we’ve already supported the development of SGLang’s first end-to-end multi-turn reinforcement learning (RL) framework and the Prism multi-large language model serving system, and remain committed to fueling its ongoing innovations for developers everywhere.”

“We’re thrilled to partner with the SGLang team,” added Junyu Huang. “Having supported their RL framework and multi-LLM serving system, we’re excited to see these achievements accelerate their work and bring powerful inference performance to applications across industries.”

Novita AI is also collaborating on SGLang’s large-scale expert parallelism project, an open-source implementation designed to approach the throughput benchmarks detailed in the official DeepSeek blog, partnering to bring this milestone to fruition.

This collaboration reflects Novita AI’s ongoing commitment to advancing an open ecosystem of inference engines and supporting diverse research initiatives through shared infrastructure and joint development efforts.

Through collaborations with pioneering open-source projects like SGLang, Novita AI continues to advance its mission of democratizing AI, making cutting-edge inference capabilities readily available to developers worldwide.

About Novita AI

Novita AI is an AI cloud platform that helps developers easily deploy AI models through a simple API, backed by affordable and reliable GPU cloud infrastructure. By supporting open-source libraries for LLM inference and serving — Novita AI is driving the future of AI and encouraging innovation across the industry.

LLM Dedicated Endpoint on Novita AI: Custom Models, Usage-Based Pricing, and DevOps-Free Scaling

Novita AI — Wed, 14 May 2025 04:00:00 +0000

Want to ship your own fine-tuned LLMs, without babysitting GPUs or racking up idle costs?

Novita AI’s LLM Dedicated Endpoint gives you true flexibility: run your custom models, pay only for tokens used, and let Novita handle deployment and scaling.

Compared to LLM Public APIs, it’s your stack, your way. Compared to raw GPU hosting, you get predictable pricing and a pro team to keep your models running smoothly.

What is an LLM Dedicated Endpoint?

A LLM Dedicated Endpoint is your own private API for running any model you want — fine-tuned, proprietary, or mainstream. No noisy neighbors, no shared resources. Novita AI handles all the infra, you just send requests. Learn more

Key Features

Bring Your Own Model: Deploy your fine-tuned or custom LLMs.
No Idle GPU Bills: Pay only for tokens used (usage-based, not hourly).
Auto-Scales Instantly: Handles spikes, no manual scaling.
Full Isolation: Dedicated compute, your data only.
Enterprise Uptime, Low Latency: SLAs for mission-critical apps.
Zero-DevOps: Monitoring, scaling, and patching done for you.

LLM Public Endpoints vs LLM Dedicated Endpoint

Novita AI offers two LLM API flavors—pick what fits your workflow:

1. LLM Public Endpoints

What:

Plug-and-play APIs for open-source models like Llama, DeepSeek, Qwen, Gemma, and more.
When to use:

Prototyping, hackathons, projects with standard LLMs.
Why:
Fast to integrate
No servers or infra
Scale to production

2. LLM Dedicated Endpoint

What:

Your own API for custom/fine-tuned models, including proprietary LLMs.
When to use:

When you need control, privacy, or custom models (think: internal tools, production SaaS, unique data).
Why:
Private, dedicated resources
Custom SLAs and scaling
Usage-based pricing
Expert deployment and monitoring

TL;DR:

Need standard models, fast? Go Public Endpoints.

Need your own model, full control, and pro support? Go LLM Dedicated Endpoint.

Why Developers Love It

Drop-in API: Keep your code—just update the endpoint URL.
No Cloud Headaches: No need for Dockerfiles, GPU quotas, or on-call alerts.
Transparent Pricing: No surprises. Billed for tokens, with optional daily minimums.
24/7 Support: Hit a snag? Ping Novita’s support team.

How to Get Started

Ready to deploy?

Contact Novita AI Sales
Share your requirements (QPS, latency, model type)
Novita sets up your endpoint—no DevOps needed
Update your API URL and ship!

Conclusion

LLM Dedicated Endpoint on Novita AI is the dev-friendly way to run custom models with no ops, no idle GPU costs, and no guesswork. You focus on building, Novita keeps your models running—secure, scalable, and fast.

Ready to launch your own LLM? Book a Demo.

Frequently Asked Questions

How does Novita handle scaling during traffic spikes?

Resources auto-scale based on real-time demand. You’re only billed for actual usage, not reserved capacity.

Can I migrate from a Novita public API to a Dedicated Endpoint?

Yes—just update the endpoint URL. 100% API compatibility means no code changes are required.

What if I need guaranteed uptime and latency?

Novita offers custom SLAs for uptime, latency, and throughput, tailored to your needs.

How is billing handled?

You pay only for tokens processed, with a minimum daily token commitment. No idle GPU bills.

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.

Qwen 3 Now Available on Novita AI - Claim Your $10 Free Credits

Novita AI — Tue, 29 Apr 2025 10:09:20 +0000

Alibaba’s cutting-edge Qwen 3 large language models are now live on Novita AI’s Model API platform! Instantly access the latest Qwen3–235B-A22B, Qwen3–30B-A3B, and Qwen3–32B models — all with a massive 128,000 context window and industry-leading performance.

For a limited time, new users can claim $10 in free credits to explore and build with Qwen 3.

Here’s the current Qwen 3 lineup and pricing on Novita AI:

Qwen3–235B-A22B: $0.20 / M input tokens, $0.80 / M output tokens
Qwen3–30B-A3B: $0.10 / M input tokens, $0.45 / M output tokens
Qwen3–32B: $0.10 / M input tokens, $0.45 / M output tokens

Power your chatbots, apps, and workflows with state-of-the-art language models — Qwen 3 is just an API call away.

What is Qwen 3?

Qwen 3 is the latest and most advanced family of large language models developed by Alibaba Cloud’s Qwen team. Building on the experience of QwQ and Qwen2.5, Qwen 3 sets a new standard for open-source AI with major improvements in reasoning, multilingualism, and agentic abilities.

Key Features of Qwen 3

Dense and Mixture-of-Experts (MoE) models in various sizes: Qwen 3 is available in both dense and MoE architectures, ranging from lightweight 0.6B and 1.7B models up to large-scale 32B (dense) and flagship 30B-A3B and 235B-A22B (MoE) variants.
Hybrid thinking modes: The model allows seamless switching between thinking mode (for complex, step-by-step logical reasoning, math, and code generation) and non-thinking mode (for fast, efficient, general-purpose chat).
Significantly enhanced reasoning: Qwen 3 surpasses previous Qwen models in mathematics, code generation, and commonsense logical reasoning. It also offers more stable and controllable reasoning budgets for different tasks.
Superior human preference alignment: The model excels in creative writing, role-playing, multi-turn dialogues, and instruction following, resulting in more natural, engaging conversations.
Advanced agentic capabilities: Qwen 3 is designed for agent-based workflows, supporting seamless integration with external tools and precise function calling in both reasoning modes. This enables state-of-the-art performance in complex, agent-driven tasks.
Robust multilingual support: Supporting 119 languages and dialects, Qwen 3 is capable of high-quality multilingual instruction following and translation, opening the door for truly global applications.

Benchmarks and Performance

The Qwen 3 series demonstrates industry-leading performance across a comprehensive suite of AI benchmarks, excelling in coding, mathematics, general reasoning, and multilingual understanding.

Flagship Model: Qwen3–235B-A22B

The flagship model, Qwen3–235B-A22B, consistently achieves top or near-top results when compared with the most advanced models available today, such as DeepSeek-R1, OpenAI-01, OpenAI-o3-mini, Grok-3 Beta, and Gemini-2.5-Pro.

Complex Reasoning: Highest scores on ArenaHard (95.6), outperforming or matching all competitors.
Mathematics: Leading results on AIME’24 (85.7) and AIME’25 (81.5), well ahead of most commercial and open-source models.
Coding: Exceptional performance on LiveCodeBench (70.7) and CodeForces Elo (2056), confirming its strength in software and algorithmic tasks.
Multilingual & General Capabilities: Qwen3–235B-A22B achieves strong results on LiveBench and MultiF, demonstrating robust real-world and multilingual understanding.

Model Efficiency and Scalability

Qwen 3’s architectural innovations also translate to outstanding performance at smaller model sizes:

Qwen3–32B (Dense): Delivers results just behind the flagship, still outperforming most alternative models across all categories.
Qwen3–30B-A3B (MoE): Outperforms QwQ-32B, despite using only a tenth of the activated parameters — showcasing Qwen’s efficiency and smart scaling.
Qwen3–4B (Dense): Even this compact model can rival the performance of much larger models like Qwen2.5–72B-Instruct, especially on reasoning and multilingual tasks.

How to Access Qwen 3 on Novita AI

Getting started with Qwen 3 is fast, simple, and risk-free on Novita AI. Thanks to the Referral Program, you’ll receive $10 in free credits — enough to fully explore Qwen 3’s power, build prototypes, and even launch your first use case without any upfront cost.

Use the Playground (No Coding Required)

Instant Access: Sign up, claim your free credits, and start experimenting with Qwen 3 and other top models in seconds.
Interactive UI: Test prompts, chain-of-thought reasoning, and visualize results in real time.
Model Comparison: Effortlessly switch between Qwen 3, Llama 4, DeepSeek, and more to find the perfect fit for your needs.

Integrate via API (For Developers)

Seamlessly connect Qwen 3 to your applications, workflows, or chatbots with Novita AI’s unified REST API — no need to manage model weights or infrastructure. Novita AI offers multi-language SDKs (Python, Node.js, cURL, and more) and advanced parameter controls for power users.

Option 1: Direct API Integration (Python Example)

To get started, simply use the code snippet below:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    api_key="<YOUR Novita AI API Key>",
)

model = "qwen/qwen3-235b-a22b-fp8"
stream = True # or False
max_tokens = 2048
system_content = """Be a helpful assistant"""
temperature = 1
top_p = 1
min_p = 0
top_k = 50
presence_penalty = 0
frequency_penalty = 0
repetition_penalty = 1
response_format = { "type": "text" }

chat_completion_res = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": system_content,
        },
        {
            "role": "user",
            "content": "Hi there!",
        }
    ],
    stream=stream,
    max_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    presence_penalty=presence_penalty,
    frequency_penalty=frequency_penalty,
    response_format=response_format,
    extra_body={
      "top_k": top_k,
      "repetition_penalty": repetition_penalty,
      "min_p": min_p
    }
  )

if stream:
    for chunk in chat_completion_res:
        print(chunk.choices[0].delta.content or "", end="")
else:
    print(chat_completion_res.choices[0].message.content)

Option 2: Multi-Agent Workflows with OpenAI Agents SDK

Build advanced multi-agent systems by integrating Novita AI with the OpenAI Agents SDK:

Plug-and-play: Use Novita AI’s LLMs in any OpenAI Agents workflow.
Supports handoffs, routing, and tool use: Design agents that can delegate, triage, or run functions, all powered by Novita AI’s models.
Python integration: Simply point the SDK to Novita’s endpoint (https://api.novita.ai/v3/openai) and use your API key.

Connect Qwen 3 API on Third-Party Platforms

Hugging Face: Use Qwen 3 in Spaces, pipelines, or with the Transformers library via Novita AI endpoints.
Agent & Orchestration Frameworks: Easily connect Novita AI with partner platforms like Continue, AnythingLLM, LangChain, Dify and Langflow through official connectors and step-by-step integration guides.
OpenAI-Compatible API: Enjoy hassle-free migration and integration with tools such as Cline and Cursor, designed for the OpenAI API standard.

Best Practices for Optimal Qwen 3 Performance

Sampling Parameter Settings

Thinking Mode

enable_thinking=True

Temperature: 0.6

TopP: 0.95

TopK: 20

MinP: 0

Tip: Avoid greedy decoding to prevent degraded performance or repetitive outputs.

Non-Thinking Mode

enable_thinking=False

Temperature: 0.7

TopP: 0.8

TopK: 20

MinP: 0

Repetition Control

For supported frameworks, adjust presence_penalty between 0 and 2 to reduce repetitions.

Note: Higher values may cause some language mixing or a slight decrease in model performance.

2. Output Length Recommendations

For most queries, set the output length to 32,768 tokens.
For complex benchmarking tasks (such as math or programming competitions), increase the max output length to 38,912 tokens for more comprehensive responses.

3. Standardizing Output Format

Math Problems: Include this in your prompt: “Please reason step by step, and put your final answer within \boxed{}.”
Multiple-Choice Questions: Standardize responses using a JSON field: “Please show your choice in the answer field with only the choice letter, e.g., “answer”: “C”.”

4. Conversation History Management

In multi-turn conversations, include only the final output in the chat history. Omit any intermediate “thinking” content.
If using a Jinja2 chat template, this is handled automatically. For other frameworks, ensure this practice is followed manually.

By following these recommendations, you’ll ensure Qwen 3 consistently delivers accurate, high-quality results across all use cases.

Conclusion

Qwen 3 delivers best-in-class performance for coding, reasoning, and multilingual tasks — no matter the project size. Ready to see it in action?

Try the Qwen 3 demo on Novita AI now and claim your free credits!

Originally published on Novita AI

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.

Earn $500 Free Credits: Build Faster with Deepseek, Llama & Qwen on Novita AI

Novita AI — Thu, 24 Apr 2025 05:46:08 +0000

Novita AI is offering an exclusive, limited-time opportunity! With the Referral Program, you can earn up to $500 in LLM API credits by simply referring your friends. Here’s the best part: both you and your referral will receive $10 in credits, unlocking access to top-tier models like DeepSeek, Llama and Qwen.

These credits can power your next big project, whether you’re working with Hugging Face, Anything LLM, Langflow, Continue, Helicone, Dify, Cursor, LobeChat, and many more.
Don’t miss out on the opportunity to supercharge your AI applications.

👉 Sign up for the Novita AI Referral Program and begin earning credits now.

Why Novita AI is the Trusted Choice
Artificial Analysis, a leading AI model evaluation platform, ranks Novita AI alongside industry leaders such as Together AI and Fireworks AI, reinforcing Novita AI’s reputation as a trusted choice for developers worldwide.

Additionally, OpenRouter recognizes Novita AI as one of the most cost-effective LLM API providers.

Novita AI also serves as the official inference provider on Hugging Face.

4 Easy Steps to Claim $10 in API Credits
Visit the Referral Program Page
Head to the official page to begin.
Enter Your Invite Code
Use either the official invite code 5W10UA or your personal one to get started.
Create Your Novita AI Account
Sign up using your email, Google, Hugging Face or GitHub account.
Verify Your GitHub Account
Complete the verification process to unlock your credits.
3 Ways to Share and Earn Up to $500 in Credits
Earn up to $500 in LLM API credits by referring others. Here’s how you can share and earn:

Copy Your Referral Link:
https://novita.ai/referral?invited_code=xxx
Copy your own Referral Code
Share on Social Media:
Post your referral link on platforms like Twitter (X), LinkedIn, Facebook, or anywhere else developers are hanging out.
The more you share, the more you earn!

LLM API on Novita AI
You can use your credits across the entire range of LLM APIs available on Novita AI. Below is a comprehensive list of all supported LLM APIs on the platform.

LLM API
Integrated Projects & SDKs
Novita AI supports seamless integration with many leading open-source projects and developer tools. Once you get the LLM API credits, you can call the API on the following platforms.

Novita AI & OpenAI Agents SDK
Novita AI & AnythingLLM
Novita AI & Dify
Novita AI & Helicone
Novita AI & Hugging Face
Novita AI & Langflow
Novita AI & Continue
Novita AI & Cursor
Novita AI & LangChain
Novita AI & Skyvern
Novita AI & LobeChat
Novita AI & ai-gradio
Novita AI & Langfuse
Novita AI & Verba
Novita AI & Portkey
Novita AI & DocsGPT
Novita AI & LlamaIndex
Novita AI & LoLLMS WebUI
Novita AI & CodeCompanion.nvim
Novita AI & Page Assist
Novita AI & DeepSearcher
Start Earning and Building with Novita AI Today!
Don’t miss out on the chance to earn up to $500 in credits, unlock powerful LLM API models, and supercharge your projects with Novita AI. Whether you’re building AI-powered tools, developing advanced agents, or creating the next big thing in AI, Novita AI is your trusted partner.

👉 Sign up now, share your link, and start building!

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing an affordable and reliable GPU cloud for building and scaling.

DEV Community: Novita AI

Best Text-to-Speech APIs in 2026: 8 Providers Compared

Kimi K2.6 on Novita AI: API Pricing ($0.95/$4.00), SWE-Bench & Agentic Coding

Kimi K2.6: Open-Source Agent for 13-Hour Coding Sessions

What Is Kimi K2.6?

What Makes Kimi K2.6 Different from Other Open-Source Models?

Long-Horizon Coding

Coding-Driven Design

Elevated Agent Swarm

Proactive Background Agents

How Does Kimi K2.6 Perform on Agentic Coding Benchmarks?

How to Use Kimi K2.6 on Novita AI

Option 1: Playground

Option 2: API (Python)

Option 3: Third-Party Tools

When Should You Use Kimi K2.6 Instead of GPT-4o or Claude?

Scenario 1: Long-Running Engineering Agents

Scenario 2: Design-to-Code Pipelines

Scenario 3: Multi-Agent Orchestration

Scenario 4: Migrating from Claude or GPT-4o Agent Pipelines

How Much Does Kimi K2.6 Cost on Novita AI?

What Are the Technical Specs of Kimi K2.6?

Is Kimi K2.6 the Right Model for Your Agent Pipeline?

FAQ

What is Kimi K2.6?

How do I access Kimi K2.6 via API on Novita AI?

How does Kimi K2.6 compare to Claude Opus 4.6 for coding tasks?

What is the context window for Kimi K2.6?

What is the pricing for Kimi K2.6 on Novita AI?

DeepSeek-V4-Flash on Novita AI: Fast Reasoning at Lower Cost

DeepSeek-V4-Flash backed by Novita AI: 1M Context at $0.14/M Tokens

What Is DeepSeek-V4-Flash?

Key Features: Why DeepSeek-V4-Flash Stands Out

Selectable Reasoning Depth Without Switching Models

Hybrid Attention Architecture for 1M-Token Context

MoE Efficiency: 13B Activated at 284B Scale

Strong Post-Training Pipeline

Benchmark Performance

Performance Across Reasoning Modes

How V4-Flash Compares to Competitors

How to Use DeepSeek-V4-Flash via Novita AI

Option 1: Playground (No Code)

Option 2: API (Python)

Option 3: Third-Party Tools

Pricing

Recommended Use Cases

Autonomous Coding Agents

Long-Document QA and RAG

Math and Science Reasoning

Production APIs with Caching

Frequently Asked Questions

What is DeepSeek-V4-Flash?

How is DeepSeek-V4-Flash different from DeepSeek-V4-Pro?

What does "Flash" mean in the model name?

Does DeepSeek-V4-Flash support a 1M context window backed by Novita AI?

How do I switch reasoning modes via the API?

What is the pricing for DeepSeek-V4-Flash backed by Novita AI?

Is DeepSeek-V4-Flash open source?

Start Using DeepSeek-V4-Flash Today

Recommended Articles

DeepSeek-V4-Pro on Novita AI: 1M Context, #1 LiveCodeBench Score

DeepSeek-V4-Pro: 1M Context, #1 on LiveCodeBench, Open-Source Frontier

What Is DeepSeek-V4-Pro?

Key Features

Hybrid Attention for Efficient 1M-Token Context

#1 on LiveCodeBench and Codeforces — The Coding Model That Actually Competes

Three Reasoning Modes — Match Compute to the Task

Agentic and Tool Use Performance

Benchmark Performance

How to Use DeepSeek-V4-Pro backed by Novita AI

Option 1: Playground (No Code)

Option 2: API (Python)

Option 3: Third-Party Tools

Use Cases

Pricing

Migrating from DeepSeek-V3 or DeepSeek-R1

Conclusion

FAQ

What is DeepSeek-V4-Pro?

How do I access DeepSeek-V4-Pro via API?