Juan Torchia

Posted on May 3 • Originally published at juanchi.dev

Kimi K2.6 vs Claude vs GPT-5.5: I ran it against my real coding cases and the numbers surprised me

#english #typescript #produccion #arquitecturasoftware

Kimi K2.6 vs Claude vs GPT-5.5: I ran it against my real coding cases and the numbers surprised me

I was looking at a PR I'd asked Claude Sonnet 3.7 to refactor — a TypeScript data ingestion service with three layers of badly chained async — when I saw the Hacker News thread about Kimi K2.6. The claim was straightforward: Kimi K2.6 beats Claude and GPT-5.5 on coding benchmarks. LiveCodeBench, SWE-bench, the usual suspects.

My first reaction was visceral: here we go again. Every three months there's a new model that "wins" the leaderboards and two weeks later nobody's using it in production. But this time the thread had enough technical substance that I couldn't just dismiss it outright. So I did what I always do: I stopped reading opinions and started measuring.

What I found isn't what I expected. And the conclusion I reached doesn't appear in any viral post.

Kimi K2.6 coding benchmarks: what the leaderboard says (and what it doesn't)

The public numbers that circulated on HN are real in the sense that Moonshot AI published them and they're reproducible on their reference datasets. Kimi K2.6 reports something around 65–68% on LiveCodeBench and competitive numbers on SWE-bench Verified. I'm not going to cite them as exact because benchmarks for these models update constantly and versions change week to week — what matters is the order of magnitude.

The structural problem with all these rankings is the same as always: public benchmarks don't include project context. HumanEval gives you an isolated function. SWE-bench gives you a GitHub issue with its repository, sure, but it's a repository the model probably saw during training. None of them give you your code with your conventions, your architectural decisions made 18 months ago for reasons that are no longer documented anywhere.

My thesis is simple and the experiment backed it up: public benchmarks lie not because the numbers are false, but because real project context is the actual test, and that test doesn't appear on any leaderboard. A model can solve LeetCode Medium in 40 seconds and at the same time not understand why in my codebase UserService inherits from BaseRepository instead of composing it — and that second problem is what costs me real hours.

The experiment: three real tasks, three models, my own numbers

I put together three cases from this week's work. I didn't cherry-pick them to favor any model — I grabbed them from the real backlog, in the order they showed up.

Setup: Kimi K2.6 via API (Moonshot), Claude Sonnet 3.7 via direct API, GPT-5.5 via OpenAI API. Same prompt, same relevant file context pasted in manually, no agent tooling — I wanted to measure pure generation, not orchestration.

Case 1: Async service refactor in TypeScript

The context: a service that processes webhooks with three levels of nested Promise.all, with no partial error handling. I gave it the three relevant files (~400 lines total) and asked for a refactor that would handle individual failures without aborting the entire batch.

// What I had: Promise.all with no partial failure handling
const results = await Promise.all(
  events.map(e => processEvent(e))
  // If one fails, they all fail — learned this the hard way in production
);

// What I asked for: allSettled with per-failure logging
const results = await Promise.allSettled(
  events.map(e => processEvent(e))
);

const failed = results
  .filter((r): r is PromiseRejectedResult => r.status === 'rejected')
  .map(r => r.reason);

if (failed.length > 0) {
  logger.warn(`Partial batch: ${failed.length}/${results.length} failed`, { failed });
}

Claude Sonnet 3.7: Understood the pattern, proposed Promise.allSettled, respected the logger that was defined in another file in the context. Generation time: ~8 seconds. Drop-in integration: yes, no edits needed.
GPT-5.5: Correct solution, but used console.error instead of the project's logger. Adaptation cost: 2 minutes of manual editing.
Kimi K2.6: Correct solution and used the logger. Generation time: ~14 seconds. But it introduced a generic BatchResult<T> type that had no precedent in the codebase — functionally fine, but it breaks the pattern consistency of the project.

Real-world winner: Claude. Not because Kimi was wrong, but because Kimi's solution introduced an implicit design decision I never asked for.

Case 2: SQL query with specific business logic

I have a PostgreSQL query that calculates usage metrics weighted by plan. The weighting logic is ours, not standard — there are comments in the code explaining why the coefficients exist.

-- Weighted score calculation by plan
-- Coefficient 1.4 for PRO plan: decision from 2023-09, see issue #441
SELECT
  u.id,
  u.plan,
  ROUND(
    SUM(e.value) * CASE u.plan
      WHEN 'PRO'   THEN 1.4
      WHEN 'BASIC' THEN 1.0
      ELSE              0.6
    END
  , 2) AS weighted_score
FROM users u
JOIN events e ON e.user_id = u.id
GROUP BY u.id, u.plan;

I asked all three models to extend this query to include a 30-day time window and a region filter, respecting the existing coefficients.

Claude: Extended it correctly, kept the coefficients, added the WHERE with NOW() - INTERVAL '30 days', and commented that the ELSE coefficient might need review if new plans are added. That proactive comment saved me a future conversation.
GPT-5.5: Correct, but changed ROUND(..., 2) to CAST(... AS DECIMAL(10,2)) without being asked. Functionally equivalent, stylistically different from the rest of the code.
Kimi K2.6: Correct, respected everything, no additional comments. The "cleanest" solution in terms of not adding anything unrequested.

This case is interesting: Kimi won on discipline, Claude won on added value. Depends what you need in that moment.

Case 3: Debugging a type error in React + TypeScript

A component with deep prop drilling where a callback's type was being incorrectly inferred. Real compilation error, 6 files of context.

Type '(id: string) => Promise<void>' is not assignable to 
type '(id: string, options?: UpdateOptions) => Promise<void>'.

Claude: Identified the origin at the third level of the component tree, proposed the fix, and suggested collapsing the prop drilling with a context. Correct, though the refactor suggestion was out of scope.
GPT-5.5: Identified the origin correctly, proposed only the minimal fix. No extra suggestions. Time: ~6 seconds.
Kimi K2.6: Identified the origin, but proposed fixing the type on the first component instead of the actual source. Functionally resolves the compilation error, but in the architecturally wrong place.

Clear winner: GPT-5.5 on this one. Correct, minimal, fast.

The errors no benchmark captures

Something kept nagging at me after running these three cases, and it connects to something I mentioned when I analyzed the supply chain attack on my ML dependencies: the difference between a tool that works in isolation and one that works integrated into a real system.

All three models solve correctly when the problem is self-contained. The divergence shows up in two dimensions that no leaderboard measures:

1. Contextual discipline: Does the model respect pre-existing design decisions even when they aren't the "academically best" approach? Kimi introduced the generic type in Case 1 because it's good general practice — but it breaks project consistency. Claude sometimes suggests unrequested refactors. GPT-5.5 was the most disciplined across all three cases.

2. Latency under long context: With 400+ lines of context, Kimi was consistently ~6 seconds slower than GPT-5.5 and ~4 seconds slower than Claude in my informal measurements. Not a critical problem, but in a workflow where you're sending 20-30 queries per hour, it adds up.

This reminds me of something I learned at the cyber café at age 14: when the network went down at 11pm with a full room, it didn't matter which router had the better theoretical throughput. What mattered was which one came back up fastest after a reset and which one gave you useful information about where the failure was. Throughput benchmarks didn't capture that. LLM benchmarks don't capture real-pressure latency or contextual discipline either.

It also connects to what I observed auditing my own production prompts: models behave differently when the prompt has dense context versus when it's a clean isolated problem. Kimi K2.6 seems optimized for the second case.

Common mistakes when reading these comparisons

"It won on SWE-bench, so it's better for my project": SWE-bench uses public repositories. If the model was trained after the repo's creation date, there's possible data contamination. You'll never know exactly how much.

"The numbers are from this week, so they're current": The models being compared on HN are usually versions that have been in the API for weeks already. Kimi K2.6, Claude Sonnet 3.7, and GPT-5.5 have different knowledge cutoff dates and API versions that update without clear changelogs. What you measure today might not be what you measure in three weeks.

"Cheaper = worse": Kimi K2.6 has significantly lower pricing than Claude and GPT-5.5 on the API tiers I used. In Cases 1 and 2, quality was comparable. Cost per token is not a reliable proxy for coding quality.

"Claude always wins because it's the most used": In my three cases, GPT-5.5 won one cleanly. Confirmation bias is real — if you use Claude every day, you're going to give it more implicit context in how you phrase your prompts.

This also applies to how we read infrastructure news. When I analyzed the real impact of Linux kernel vulnerabilities on my Ubuntu/Railway stack, the problem wasn't the public CVE but the gap between the announcement and the actual patch in production. LLM benchmarks have exactly the same problem: the public number and the real impact on your workflow have a gap that only you can measure.

FAQ: Kimi K2.6 coding benchmarks

Does Kimi K2.6 actually beat Claude and GPT-5.5 at coding?
On public benchmarks like LiveCodeBench, the reported numbers are competitive. In my experiment with real project code, the result was mixed: Kimi won on discipline in one case, lost on identifying the correct debugging origin, and was comparable on refactoring. "Beating" depends entirely on the type of task and the context you give it.

Is it worth migrating to Kimi K2.6 if I'm already using Claude or GPT-5.5?
Not as a full migration. Worth having as an alternative for clean generation tasks where consistency with an existing codebase doesn't matter. For work with dense project context, Claude and GPT-5.5 showed better adherence to pre-existing patterns in my tests.

Are public LLM benchmarks reliable for making tooling decisions?
They're useful as an initial filter — if a model doesn't hit a certain threshold on HumanEval, it's probably not worth testing at all. But for deciding what to use in production, the only benchmark that matters is the one you run on your own code with your own cases.

What's the API cost of Kimi K2.6 compared to Claude and GPT-5.5?
At the time of writing this, Kimi K2.6 has notably lower per-token pricing than Claude Sonnet 3.7 and GPT-5.5. If your volume is high and the cases are relatively clean generation, the cost differential can justify the integration. Prices change frequently — verify on the official pages before projecting costs.

Is Kimi K2.6's latency a real problem in development workflows?
In my informal measurements, ~6-14 seconds for responses with medium context. Not blocking for casual use, but if you're working in a flow where the model is part of a rapid iteration loop (generate → review → refine → generate), you feel the difference. Claude and GPT-5.5 were faster in my cases.

Does it make sense to run all three models in parallel on the same problem?
I did it for this post and it was useful for understanding the differences. In daily work, no — the overhead of comparing three responses consumes more time than you save. My approach: I have a primary model (Claude for dense context), a backup (GPT-5.5 for targeted debugging), and Kimi K2.6 as an experiment for clean generation cases where cost matters.

The hype isn't wrong, but the question is

Kimi K2.6 is a serious model. It's not empty marketing and the leaderboard numbers aren't invented. But the "who wins" debate is framed wrong from the start — including in my title, which I used deliberately to get you here.

The real question isn't "which model wins at coding?" but "which model understands my specific code, my conventions, and the decisions I made 18 months ago for reasons that are no longer in any README?" That question has no leaderboard answer.

What the experiment backed up: in tasks with real, dense project context, architectural consistency matters more than your HumanEval score. GPT-5.5 was the most disciplined about not adding unrequested design decisions. Claude was the most useful when I needed proactive added value. Kimi K2.6 was competitive on quality and significantly cheaper — with the caveat that on complex debugging it got the origin wrong.

Did I change my stack after this? No. I stayed with Claude as my primary. But I added Kimi K2.6 to the rotation for specific cases, especially greenfield code generation where project context is minimal. That's the most honest thing I can say.

What I won't do is declare a universal winner. That's what viral posts do. This isn't that post — it's the follow-up the HN thread never gives you.

One more thing that doesn't sit right with me about this whole debate: when I analyzed the real impact of a DDoS on my stack in Railway, the conclusion was that the public "guaranteed uptime" numbers say nothing about behavior under real load in my specific context. LLM benchmarks have exactly the same problem. The number exists. The context that makes it relevant to you, doesn't.

Original source: Hacker News

This article was originally published on juanchi.dev

DEV Community

Kimi K2.6 vs Claude vs GPT-5.5: I ran it against my real coding cases and the numbers surprised me

Kimi K2.6 vs Claude vs GPT-5.5: I ran it against my real coding cases and the numbers surprised me

Kimi K2.6 coding benchmarks: what the leaderboard says (and what it doesn't)

The experiment: three real tasks, three models, my own numbers

Case 1: Async service refactor in TypeScript

Case 2: SQL query with specific business logic

Case 3: Debugging a type error in React + TypeScript

The errors no benchmark captures

Common mistakes when reading these comparisons

FAQ: Kimi K2.6 coding benchmarks

The hype isn't wrong, but the question is

Top comments (0)