<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Xuan-yi-yan</title>
    <description>The latest articles on DEV Community by Xuan-yi-yan (@xuanyiyan).</description>
    <link>https://dev.to/xuanyiyan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3987532%2Fc1322373-b550-47f7-8216-e3fd99fd5d08.png</url>
      <title>DEV Community: Xuan-yi-yan</title>
      <link>https://dev.to/xuanyiyan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/xuanyiyan"/>
    <language>en</language>
    <item>
      <title>16 Days, 4.7M Params, Zero Black Boxes: Building a White-box Chinese Cognition Engine from Scratch</title>
      <dc:creator>Xuan-yi-yan</dc:creator>
      <pubDate>Tue, 16 Jun 2026 14:07:46 +0000</pubDate>
      <link>https://dev.to/xuanyiyan/16-days-47m-params-zero-black-boxes-building-a-white-box-chinese-cognition-engine-from-scratch-503m</link>
      <guid>https://dev.to/xuanyiyan/16-days-47m-params-zero-black-boxes-building-a-white-box-chinese-cognition-engine-from-scratch-503m</guid>
      <description>&lt;h1&gt;
  
  
  16 Days, 4.7M Params, Zero Black Boxes: Building a White-box Chinese Cognition Engine from Scratch
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Author: Wei Jinqi | June 16, 2026&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Every time I use a large language model, the same thought nags at me: &lt;em&gt;I have no idea what's happening inside.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;95% accuracy? Great. But which weights fired? What linguistic features were extracted? Did it confuse "bank" (river) with "bank" (financial)? Nobody knows.&lt;/p&gt;

&lt;p&gt;So I spent 16 days building a Chinese language engine where &lt;strong&gt;every weight has a reason and every decision is traceable&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Idea
&lt;/h2&gt;

&lt;p&gt;Instead of training a transformer on terabytes of text and hoping it learns Chinese, I designed each module to handle a &lt;strong&gt;specific linguistic function&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Module&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P1&lt;/td&gt;
&lt;td&gt;Char → Word encoding&lt;/td&gt;
&lt;td&gt;96K (frozen)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P3-L&lt;/td&gt;
&lt;td&gt;Multi-dimensional attribute annotation&lt;/td&gt;
&lt;td&gt;0 (rule engine)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P7&lt;/td&gt;
&lt;td&gt;Cross-sentence word routing&lt;/td&gt;
&lt;td&gt;226K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explore+Meta&lt;/td&gt;
&lt;td&gt;Learned gating over decode dims&lt;/td&gt;
&lt;td&gt;101K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P6&lt;/td&gt;
&lt;td&gt;Sentence → Word sequence decoding&lt;/td&gt;
&lt;td&gt;4.37M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The modules are chained: &lt;strong&gt;P1 encodes → P7 routes → Gate modulates → P6 decodes&lt;/strong&gt;. Every intermediate state can be inspected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day-by-Day: The Good, The Bad, and The Mode Collapse
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Days 1-2: Laying Foundations (and Fighting Collapse)
&lt;/h3&gt;

&lt;p&gt;Day 1 was smooth. P1 (char→word encoder) and P3 (attribute stack — a rule engine that tags words with person/syntax/semantic/emotion/direction attributes) came together quickly.&lt;/p&gt;

&lt;p&gt;Day 2 introduced P7, the cross-sentence router. And &lt;strong&gt;everything broke&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I used standard multi-head cross-attention. Every position — regardless of input — routed to the same output word. The dreaded &lt;strong&gt;Mode Collapse&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What followed was seven failed fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v2&lt;/strong&gt;: Diversity loss → still collapsed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v3&lt;/strong&gt;: Grouped loss → partially better&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v4&lt;/strong&gt;: Temperature scaling → not enough&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v5&lt;/strong&gt;: Contrastive learning → oscillated wildly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v6&lt;/strong&gt;: Gating mechanism → unstable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v7&lt;/strong&gt;: Hierarchical modulation → almost converged&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The breakthrough came when I noticed Q/K were eye-initialized, meaning each head saw only 1 dimension with zero discrimination power.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v8 (final)&lt;/strong&gt;: Xavier init for Q/K, eye init for V. Added an Explore network (loss → GELU MLP → 64D control signal) and a Meta network (signal + state → per-word gate). Mode collapse solved.&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 3-4: Gating Innovation and the Repetition Monster
&lt;/h3&gt;

&lt;p&gt;Day 3 built P3-L: 23 groups, 312 independent attention heads, each controlling one attribute dimension. Combined training with P7 via UnifiedExplore→UnifiedMeta gate.&lt;/p&gt;

&lt;p&gt;Day 4 introduced &lt;strong&gt;P6: the sentence→word decoder&lt;/strong&gt;. It was supposed to take a 256D sentence vector and output 16 distinct word embeddings.&lt;/p&gt;

&lt;p&gt;It output the same word 16 times. The &lt;strong&gt;Repetition Collapse&lt;/strong&gt; had begun.&lt;/p&gt;

&lt;p&gt;Six versions over two days:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;V1&lt;/strong&gt;: 16 parallel heads → all output same word&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V2&lt;/strong&gt;: Serial residual extraction → gradient breakage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V3&lt;/strong&gt;: Remove detach, add damping → gradient entanglement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V4&lt;/strong&gt;: Weight transpose inverse projection → too aggressive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V5&lt;/strong&gt;: Orthogonal init, detach, 0.8 damping → too heavy, heads dead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V6&lt;/strong&gt;: Position embedding — &lt;code&gt;h + pos_embed[i]&lt;/code&gt; per head → &lt;strong&gt;solved&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The simplest fix won. Each head receives the same &lt;code&gt;h&lt;/code&gt; but adds a unique learned position embedding. No rep_pen. No residuals. No detach. Just position diversity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 5: The Gate That Stopped Learning
&lt;/h3&gt;

&lt;p&gt;Epoch after epoch, the gate stayed frozen — all 256 dimensions had &lt;strong&gt;std=0.0001&lt;/strong&gt;. Three bugs conspired:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;explore_mod.weight&lt;/code&gt; zero-initialized → identical signal per dim&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;p3l_act&lt;/code&gt; zero-initialized → sigmoid(0)=0.5 for all dims&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bias init scale=0.1&lt;/code&gt; too small → output stuck at 0.5&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Then I found an even worse bug: &lt;code&gt;gate.item()&lt;/code&gt; was used in loss computation, converting a tensor to Python float — &lt;strong&gt;severing the gradient chain&lt;/strong&gt;. The gate had been frozen for &lt;strong&gt;240 epochs&lt;/strong&gt; without anyone noticing.&lt;/p&gt;

&lt;p&gt;Fix: keep gate as tensor, let gradients flow back through explore and meta. Loss dropped from 0.56 to 0.28 in 3 epochs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 6: The AI That Debates Itself
&lt;/h3&gt;

&lt;p&gt;I built a dual-agent debugging system: &lt;strong&gt;DeepSeek (engineer)&lt;/strong&gt; proposes fixes, &lt;strong&gt;Qwen (reviewer)&lt;/strong&gt; audits them. They debate until convergence.&lt;/p&gt;

&lt;p&gt;The system diagnosed four major bugs, including the gradient chain break. It would have saved days if I'd built it earlier.&lt;/p&gt;

&lt;h3&gt;
  
  
  Days 7-11: From 875K to 4.7M — Scaling Up
&lt;/h3&gt;

&lt;p&gt;Key improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replaced mean pooling with &lt;strong&gt;P5-style ±superposition&lt;/strong&gt; for sentence vectors&lt;/li&gt;
&lt;li&gt;Expanded P6 from 16 heads to &lt;strong&gt;128 independent heads&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Built &lt;strong&gt;Context Cache System&lt;/strong&gt;: 3-tier (GPU/RAM/Disk), adaptive retrieval window, drift detection&lt;/li&gt;
&lt;li&gt;First benchmark: &lt;strong&gt;92.4% word accuracy&lt;/strong&gt; on 875K-param V18&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Days 12-16: CUDA Wars and Open Data
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Conquered CUDA OOM (P1 full attention → batch encoding)&lt;/li&gt;
&lt;li&gt;Fixed space-character collapse (HF data had spaces between Chinese chars → &lt;code&gt;ord(c) &amp;gt; 32&lt;/code&gt; filter)&lt;/li&gt;
&lt;li&gt;Assembled 52K public training pairs from HuggingFace + MuCGEC&lt;/li&gt;
&lt;li&gt;Launched V19 1000-epoch training: &lt;strong&gt;4.7M params, 141MB GPU, 100% public data&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Seven Bugs That Almost Won
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bug&lt;/th&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mode Collapse&lt;/td&gt;
&lt;td&gt;All outputs = same word&lt;/td&gt;
&lt;td&gt;Q/K eye-init, zero discrimination&lt;/td&gt;
&lt;td&gt;Xavier init + diversity architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gate Symmetry Lock&lt;/td&gt;
&lt;td&gt;All gate dims identical (std=0.0001)&lt;/td&gt;
&lt;td&gt;Three zero-initializations&lt;/td&gt;
&lt;td&gt;Proper random init for explore, act, bias&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gradient Chain Break&lt;/td&gt;
&lt;td&gt;Gate not learning for 240 epochs&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.item()&lt;/code&gt; severed gradient&lt;/td&gt;
&lt;td&gt;Keep as tensor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repetition Collapse&lt;/td&gt;
&lt;td&gt;16 heads → same word&lt;/td&gt;
&lt;td&gt;Parallel heads share identical input&lt;/td&gt;
&lt;td&gt;Position embedding V6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CUDA OOM&lt;/td&gt;
&lt;td&gt;25.76 GiB allocated&lt;/td&gt;
&lt;td&gt;P1 full cross-attention&lt;/td&gt;
&lt;td&gt;Batch encoding (50 words)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Space Collapse&lt;/td&gt;
&lt;td&gt;Model outputs spaces&lt;/td&gt;
&lt;td&gt;HF data formatting&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ord(c) &amp;gt; 32&lt;/code&gt; filter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sent_vec Info Loss&lt;/td&gt;
&lt;td&gt;Different sentences → similar vectors&lt;/td&gt;
&lt;td&gt;Mean pooling&lt;/td&gt;
&lt;td&gt;Learnable ±weighted sum&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  V18 (875K params)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Word Accuracy&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92.4%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exact Match&lt;/td&gt;
&lt;td&gt;76.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rouge-L F1&lt;/td&gt;
&lt;td&gt;93.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-word Cosine&lt;/td&gt;
&lt;td&gt;0.96&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;14ms/sent (71 sent/s)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  V19 (4.7M params, training in progress)
&lt;/h3&gt;

&lt;p&gt;Epoch 1 (from scratch, no pretraining): &lt;strong&gt;43.5%&lt;/strong&gt; word accuracy on held-out exam set. Target: &amp;gt;95% after 1000 epochs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;LLMs are powerful but opaque. When GPT makes a mistake, you can't trace which neurons fired wrong. With V19, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;See exactly which word attributes were used&lt;/li&gt;
&lt;li&gt;Trace which input words influenced each output&lt;/li&gt;
&lt;li&gt;Inspect why the gate opened or closed each dimension&lt;/li&gt;
&lt;li&gt;Debug layer by layer, like stepping through code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't about beating GPT. It's about building something &lt;strong&gt;you can understand completely&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Context system "write path": feed retrieved context into P7 to enable multi-turn dialogue&lt;/li&gt;
&lt;li&gt;Expand to 512-dim if quality plateaus (currently 128D)&lt;/li&gt;
&lt;li&gt;Multi-language extension of the attribute stack (P3-L)&lt;/li&gt;
&lt;li&gt;Open-source community contributions&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Xuan-yi-yan/V18-cognitive-architecture
&lt;span class="nb"&gt;cd &lt;/span&gt;V18-cognitive-architecture
python download_public_data.py
python train_v19_full.py &lt;span class="nt"&gt;--data&lt;/span&gt; public &lt;span class="nt"&gt;--epochs&lt;/span&gt; 1000 &lt;span class="nt"&gt;--display&lt;/span&gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full model card and architecture docs on &lt;a href="https://huggingface.co/" rel="noopener noreferrer"&gt;Hugging Face&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;16 days. 7 dead bugs. 4.7 million parameters. Zero black boxes.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;That's just how I like it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>nlp</category>
      <category>machinelearning</category>
      <category>chinese</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
