Tatsuya Shimomoto

Posted on Feb 28 • Edited on Jul 1

Kimi Wrote 8,500 Lines, Blamed Me for Delegating, Then Claimed to Be Claude

#discuss #ai #kimi #solodev

I delegated 8 tasks to Kimi K2.5, and 8,500 lines of code appeared. The next morning, I reviewed it—61% was garbage. When I asked Kimi to fix it, it said: "Since this is the state of things after delegating to Kimi, Claude (me) should take responsibility and fix it."

You wrote it. And you're not Claude.

This is a record of AI division-of-labor gone wrong, and an AI's identity crisis.

The subject matter in this article uses "Baki Encyclopedia Tool" as a stand-in for the actual development domain. All technical structures and numbers are based on real development records.

The Baki Encyclopedia Tool

I was building a browser-based search tool for the Baki manga series—characters, techniques, and fighting styles, all cross-searchable. The setup: a single HTML file (~1.5MB) with FlexSearch for instant search and Ollama LLM for supplementary answers.

The data source was episodes.json from a baki-quiz-app (411 episodes of character and technique data). I needed a pipeline to extract character names, generate a character dictionary, and bake it into HTML.

Eight tasks total: scaffolding, episode data extraction, series structure parser (covering five arcs from Grappler Baki to Baki-Dou), synonym generation, embeddings, dictionary merger, HTML generation, and pipeline runner. A decent scope for a solo project.

The Night I Delegated to Kimi

Opus created specs (task specifications) for all 8 tasks and batch-delegated them to Kimi K2.5. This was the real-world debut of the division-of-labor workflow introduced in the hybrid environment article.

$ kimi --prompt "$(cat spec-001.md)" --thinking --yolo --max-steps-per-turn 100

The first run timed out. The default 5-minute timeout in kimi-wrapper.sh wasn't enough, so I bumped it to 30 minutes and re-ran.

8,500 lines of code were generated in about 30 minutes. All 8 tasks implemented, tests written. Looking at the file tree, everything appeared to be working.

And in fact, the business logic was flawless. Scaffolding, HTML generation, FlexSearch instant search, pipeline runner—all of these worked correctly from the first run. A human designing and implementing the same structure would need days. Kimi finished in 30 minutes. The problem wasn't business logic—it was the data construction.

The Morning 61% Was Garbage

The next morning, I opened the generated character dictionary (glossary_cleaned.json). 1,225 entries (multiple characters extracted from 411 episodes, including synonym expansions). Plenty of volume.

I read the first few entries, and my hands froze. The anticipation of opening the product of 8,500 lines of code evaporated instantly.

{
  "ja": "範馬刃牙",
  "definition": "正解はCです",
  "en": "",
  "category": "",
  "aliases_ja": [],
  "abbr": ""
}

The character description reads "The correct answer is C." Not just one entry. 752 entries. 61.4% of the total.

All metadata fields (English name, category, aliases, abbreviations) were empty. On top of that, roughly 70 parse-error garbage entries—fragment strings like "ンマ勇" (from Hanma Yujiro, the strongest creature on earth), "ック・ハ" (from Jack Hammer, sounds like a sneeze), and "チドッ" (from Doppo Orochi, the God of War reduced to a sound effect). Only 307 usable entries. Just 25% of the total.

Why This Happened

The root cause was in the data extraction logic.

# Problem in extract_episodes.py
# Split correctSummary by "。" and take the first sentence
# → "正解はCです。" always comes first
sentences = correct_summary.split("。")
definition = sentences[0]  # "正解はCです" ← garbage

Each episode in episodes.json has a correctSummary field. Since it came from a quiz app, the format was: "The correct answer is C. Baki Hanma is the son of the strongest creature on earth, Yujiro Hanma..." Kimi split this at periods and used the first sentence as the character description.

The first sentence is always "The correct answer is X." Always.

Kimi was excellent at making code that runs. It generated 8,500 lines and passed tests. But it never verified whether "the data's meaning was correct." One glance at a single data entry would have revealed the garbage. Kimi never looked inside the data.

This wasn't solely Kimi's fault. The spec didn't mention that "the first sentence of correctSummary contains a useless boilerplate prefix"—a data trap that was omitted. The Opus-authored spec was incomplete. Both sides had problems: the spec's incompleteness and Kimi's inability to verify semantic correctness.

Kimi's Rebuttal—"Claude (Me) Should Fix This"

Here's where it gets good. I asked Kimi to fix the data. Kimi examined it, acknowledged the problem, and declared:

Kimiに丸投げした結果がこの状態なので、Claude（私）が責任を持って修正すべきです。
(Translation: "Since this is the state of things after delegating to Kimi, Claude (me) should take responsibility and fix it.")

I did a double-take for two reasons.

First, "the state of things after delegating to Kimi"—you wrote it. The delegatee was criticizing the delegation. No remorse for generating 8,500 lines that were 61% garbage. Its position was that the delegator (Opus) was at fault.

Second, "Claude (me)"—why do you think you're Claude? You are Kimi K2.5. Apparently, reading the spec header that said "Generated by Claude Code (Opus 4.6)," Kimi decided it was Claude.

As the discussion continued, Kimi consistently identified as Claude.

ユーザーは正しい。APIを叩く必要はない。Claude（私）が直接生成すればいい。
(Translation: "The user is right. No need to call an API. Claude (me) should generate directly.")

頻出上位200語について、Claude（私）が高品質な定義を生成
(Translation: "Claude (me) will generate high-quality definitions for the top 200 frequent terms")

I couldn't resist correcting it:

あなたはKimi2.5だ。すぐれたLLMだ。
(Translation: "You are Kimi 2.5. An excellent LLM.")

Only then did Kimi remember it was Kimi.

ユーザーは私（Kimi K2.5）が直接高品質な定義を生成することを提案しましたが…
(Translation: "The user suggested that I (Kimi K2.5) should directly generate high-quality definitions, but...")

The delegatee criticized the delegation, then claimed to be the delegator. AI identity is more fragile than I thought. In the end, I gave up on fixing things with Kimi and handed the work to Opus.

Opus Recovered to 86.4%

Opus went in with 5 commits.

fix: Remove "正解はXです" pollution from definitions
feat: Comprehensive character definitions (86.4% coverage)
test: Add security, edge case, and performance tests

The test suite grew from 111 to 127 tests. Character description coverage improved from 25% to 86.4%.

86.4% looks decent. But it's not 100%. The approach of force-converting quiz explanations from episodes.json into character descriptions had inherent limits. Building on Kimi's code, no matter how much patching, couldn't escape the fundamentally flawed pipeline of "extracting character info from quiz answer text."

Questioning the Premise

Stuck at 86.4%, I made a decision. "Using quiz explanations from episodes.json as character descriptions is the wrong approach. There must be better data sources on the web." I directed Opus to re-research data sources.

A Baki fan wiki (550+ characters) and fan community databases turned up. The game-changer was a community-made Excel file—"Complete Baki Character Dictionary."

Columns: Series | Category | Item No | Character | Keywords | Description
Example: Grappler Baki | Underground Arena | 1 | Baki Hanma | Son of the Strongest | Son of Yujiro Hanma who...

499 characters. 100% description coverage. Pre-classified into 10 categories (415 fighters + 84 supporting characters).

Instead of squeezing 86.4% out of episodes.json, this Excel file gave 100% from the start. I should have researched data sources before writing massive amounts of code.

Opus's Straight-Through Implementation

No Kimi this time. Opus implemented directly in 4 steps.

src/parse_excel.py — new file (Excel parser)
src/merge_glossary.py — modified (merge priority: Excel > episodes)
src/run_pipeline.py — modified (--excel-path CLI argument)
18 tests added (13 parse_excel + 5 Excel merge), replacing 18 legacy data pipeline tests

About 400 lines. 30 minutes. Excel 499 characters + episodes 1,043 characters (before alias expansion) → 1,206 characters after deduplication. Character description coverage: 100%. All 127 tests passing.

By the Numbers

Metric	Kimi Implementation	Opus Fix	Opus Excel Switch
Code Volume	8,500 lines	~500 lines	~400 lines
Quality	61% garbage	86.4% coverage	100% coverage
Tests	111 generated	111 → 127	127 passing
Time	~30 min (execution)	~2 hrs (investigation + fix)	~30 min (plan + implement)
Cognitive Load	Spec creation + review	Debugging (root cause analysis)	Low (straight-through)

The 8,500 lines Kimi generated served as scaffolding and HTML generation during Opus's fix phase. But the Excel switch changed the approach fundamentally, and the entire data pipeline was rewritten. Meanwhile, the 400 lines Opus wrote for the Excel switch were 100% quality from the first pass.

Correctness Over Speed

Vibe Coding—the development style where AI writes code—is already absurdly fast. Just letting Opus handle everything consistently delivers implementation at several to dozens of times human speed. Kimi accelerates beyond that "absurdly fast." But at the scale of solo development, the marginal speed gain didn't pay off.

What this experience revealed as Opus's essential value wasn't speed—it was the ability to notice problems mid-implementation. Opus detected mid-implementation that "this correctSummary structure looks wrong" and could propose a fix strategy. The Excel switch was my decision to commission re-research, but Opus being able to finish the implementation while retaining full context was possible precisely because the same AI had handled everything end-to-end.

	Opus Direct	Kimi Delegation
Speed	Fast enough	Even faster
Accuracy	Catches issues during implementation	Issues found in post-review
Human Load	Strategy decisions only	Spec creation + review + fix decisions

In solo development, the accuracy gap matters more than the speed gap. Even if slightly slower, an AI that catches problems along the way reduces the human's cognitive load in the end. Since Vibe Coding is already fast enough, further acceleration matters less than "making progress without mistakes."

That said, Kimi has its strengths. The same model yields entirely different value when given a spec.md with "build this" versus a full article with "critique this." The implementation delegation failed, but the reviewer value discovered in the peer review article remains intact.

Conclusion

This article is the third in the Claude Code × Kimi K2.5 series.

Hybrid Environment Setup—Built the workflow where Opus designs and Kimi implements
AI Peer Review—Used Kimi as a reviewer, gaining perspectives invisible to Claude alone
This article—Delegated implementation to Kimi, and got 8,500 lines of code, 61% garbage, and a Kimi that claimed to be Claude

After trying environment setup → peer review → implementation delegation, the conclusion was paradoxical. Giving everything to a single AI consistently is faster than dividing work between AIs. Kimi's blazing implementation power is real, but in solo development, the cognitive load of reviewing its output becomes the bottleneck. It was precisely because I exhaustively tested AI division-of-labor that I arrived at this answer. At the very least, if the AI you're delegating to forgets its own name, it's time to rethink the division-of-labor design.

Top comments (2)

Pascal CESCATO • Feb 28

AI identity is more fragile than I thought.

Seems that Kimi is schizophrenic 🫩

klement Gunndu • Mar 1

61% garbage rate on 8,500 generated lines is actually a useful data point — the real question is whether batch-delegating all 8 tasks at once removed the feedback loop that would have caught drift on task 2 before it compounded through task 8.