Kimi K2.7 Code Review 2026: 1T Open-Weight Coding Model as a Cursor and Cline Backend

#kimi #localllm #cline #cursor

This article was originally published on aicoderscope.com

TL;DR: Kimi K2.7 Code (released June 12) is the cheapest credible agentic coding model on the market — $0.95/$4 per million tokens, open weights, Modified MIT license. It burns ~30% fewer thinking tokens than K2.6 and beats Opus 4.8 on tool-use benchmarks. But "open weight" does not mean "run it on your 4090": this is a 1-trillion-parameter model, so for almost everyone the real product is the API.

	Kimi K2.7 Code	Claude Fable 5	Kimi K2.6
Best for	Cheap agentic coding via API, tool-heavy workflows	The hardest long-horizon refactors	Same niche, slightly cheaper
Price (in / out per M)	$0.95 / $4.00	$10 / $50	$0.60 / ~$2.50
SWE-bench Pro	Not published at launch	80.3%	58.6%
The catch	Always-on thinking mode; not self-hostable on consumer GPUs	10× the token cost	Older; uses ~30% more reasoning tokens

Honest take: If you already route Cline or Claude Code to a cheap API backend, swap in K2.7 Code today — it is the best price-to-capability ratio you can get and the tool-calling is genuinely strong. If you came here hoping to run it locally for privacy, stop now and read the Cline + LM Studio guide instead — a 32B dense model on your own box is the realistic local play.

Moonshot AI dropped Kimi K2.7 Code on June 12, 2026, pushing the full weights to Hugging Face the same day. It is their fifth major model in under a year, and the naming makes the positioning obvious: this is the coding-specialized member of the K2 family, tuned for long-horizon, agentic software work rather than chat.

I have spent the day pointing Cline and Claude Code at it through the Moonshot API. What follows is what the spec sheet says, what the benchmarks actually support, the part nobody wants to say out loud about self-hosting a trillion-parameter model, and the exact config that gets it running as your coding backend.

What actually shipped

The headline numbers are real and worth getting straight:

1 trillion total parameters, Mixture-of-Experts, with 32B active per token across 384 experts. You get frontier-scale capacity but pay roughly 32B-model compute per forward pass.
256K-token context window — enough to hold a mid-size repo plus a long agent trajectory without aggressive truncation.
Modified MIT license — commercial use, fine-tuning, and redistribution allowed. This is the genuinely open part.
Native INT4 quantization baked in via quantization-aware training (QAT), not a lossy afterthought. Moonshot reports roughly 2× faster inference and ~50% lower memory versus the bf16 weights with negligible quality loss.
Always-on thinking mode. K2.7 Code never runs in a non-reasoning mode; it preserves full reasoning content across multi-turn conversations. The upside is better tool decisions; the catch is you cannot turn the reasoning tokens off to save money.

The model is multimodal (text, image, video) with a MoonViT vision encoder, but for coding that is mostly a footnote — you will feed it diffs and stack traces, not screenshots.

The benchmark picture — read this carefully

Here is where the marketing and the verifiable record diverge, so I will separate them.

What Moonshot published are gains relative to K2.6, not absolute frontier comparisons: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on the multi-language MLS Bench Lite, alongside the ~30% cut in reasoning-token usage on equivalent tasks. Those are first-party numbers on first-party benchmarks. Treat the percentages as "K2.7 is meaningfully better than K2.6 at Moonshot's own evals," not as a head-to-head win over Fable 5.

What independent signal exists so far: K2.7 Code posts 81.1% on MCPMark Verified, a benchmark that measures correct tool invocation through the Model Context Protocol — ahead of Claude Opus 4.8's 76.4%. That is the one cross-vendor number I would actually lean on, and it matches my hands-on experience: K2.7's tool-calling discipline is its strongest feature. It rarely hallucinates a function signature and recovers cleanly when a tool returns an error.

What Moonshot did not publish at launch is a SWE-bench Verified or SWE-bench Pro score. That is a conspicuous gap. For reference, the predecessor K2.6 scored 58.6% on SWE-bench Pro back in April, which led the open-weight pack at the time. Until independent SWE-bench Pro numbers land for K2.7, anyone telling you it "beats Claude on coding" is extrapolating.

Model	SWE-bench Pro	Tool use (MCPMark)	Price in / out per M	Open weights
Kimi K2.7 Code	Not published	81.1%	$0.95 / $4.00	Yes (Modified MIT)
Kimi K2.6	58.6%	~76%	$0.60 / ~$2.50	Yes
Claude Fable 5	80.3%	n/a	$10 / $50	No
Claude Opus 4.8	69.2%	76.4%	$5 / $25	No

The story the table tells: K2.7 Code is not chasing Fable 5 on the hardest end-to-end SWE-bench tasks. It is competing on dollars per solved task and on agentic reliability, and on those axes it is excellent. At roughly one-tenth of Fable 5's output price, it does not need to win on raw capability to be the rational default for high-volume agent runs.

"But can I run it locally?" — the honest answer is mostly no

This is the question that brings most people to an open-weight model, so let me be blunt. A 1-trillion-parameter MoE is not a single-GPU model, and INT4 does not change that conclusion.

Even quantized, the weights occupy roughly 500GB of VRAM. The realistic verified configurations for serving K2.7-class models are things like 8× H200, or aggregate VRAM around 640GB — i.e. a multi-GPU server, not a workstation. The "24GB is enough for INT4" claims floating around launch-day blog posts are wrong; they are confusing this model with a small dense model. A trillion parameters at 4 bits is half a terabyte no matter how you slice it.

If your goal is genuine local, air-gapped inference for privacy, K2.7 Code is the wrong tool. The right tool is a dense 14B–32B coding model on your own hardware. For that path, our Cline + LM Studio setup and the hardware tiers in runaihome's best local AI models by VRAM breakdown are where you should be. If you genuinely want to self-host K2.7 at scale, you are renting cloud GPUs — and at that point the math almost never beats Moonshot's own API.

The supported serving stacks, for the record, are vLLM, SGLang, and Docker Model Runner. They work. The question is whether you can afford 8 datacenter GPUs to use them.

The practical path: API and OpenRouter

For 95% of readers, K2.7 Code is an API. Pricing:

Moonshot API: $0.95 / M input, $4.00 / M output, $0.19 / M on cache hits. Model id: kimi-k2.7-code.
OpenRouter: same $0.95 / $4.00, model id moonshotai/kimi-k2.7-code.

The cache-hit price is the underrated number. Agentic coding replays a large, stable system prompt and repo context on every turn; at $0.19/M, a cached prefix makes long sessions dramatically cheaper than the headline rate suggests.

One honest note on the trend: K2.6 was $0.60/M input, so K2.7 is a ~58% price bump at the input tier. Moonshot is charging more for the better model. It is still cheap — Fable 5 input is $10/M — but it is not the rock-bottom price K2.6 set.

Wiring K2.7 Code into Cline

The Moonshot API is both OpenAI- and Anthropic-compatible, which is why a base-URL swap is all you need. For Cline, use the OpenAI-compatible endpoint:

Provider:  OpenAI Compatible
Base URL:  https://api.moonshot.ai/v1
API Key:   <your Moonshot key from platform.moonshot.ai>
Model ID:  kimi-k2.7-code

Drop those into Cline's API settings, reload the extension, and the model picker should show kimi-k2.7-code as available. A working first turn looks like this: ask it to