<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: phanngoc-0847</title>
    <description>The latest articles on DEV Community by phanngoc-0847 (@phanngoc0847).</description>
    <link>https://dev.to/phanngoc0847</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F353017%2F17e77f01-e693-4b91-aaff-4088c05bc721.png</url>
      <title>DEV Community: phanngoc-0847</title>
      <link>https://dev.to/phanngoc0847</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/phanngoc0847"/>
    <language>en</language>
    <item>
      <title>DeepSeek: The Open-Source AI That Shook the Industry</title>
      <dc:creator>phanngoc-0847</dc:creator>
      <pubDate>Sun, 21 Jun 2026 05:34:41 +0000</pubDate>
      <link>https://dev.to/phanngoc0847/deepseek-the-open-source-ai-that-shook-the-industry-1hp4</link>
      <guid>https://dev.to/phanngoc0847/deepseek-the-open-source-ai-that-shook-the-industry-1hp4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;In January 2025, a Chinese AI lab quietly released a model that sent shockwaves through Silicon Valley — and permanently changed how the world thinks about AI development costs.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🤔 What is DeepSeek?
&lt;/h2&gt;

&lt;p&gt;DeepSeek is a Chinese AI research lab that burst onto the global AI scene in early 2025. Their flagship models — &lt;strong&gt;DeepSeek-V3&lt;/strong&gt; and &lt;strong&gt;DeepSeek-R1&lt;/strong&gt; — achieved performance comparable to GPT-4 and Claude 3.5 Sonnet at a fraction of the training cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  💥 Why Did It Shake the Industry?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;DeepSeek&lt;/th&gt;
&lt;th&gt;Western Competitors&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Training Cost&lt;/td&gt;
&lt;td&gt;~$6 million&lt;/td&gt;
&lt;td&gt;Hundreds of millions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model Weights&lt;/td&gt;
&lt;td&gt;Open source ✅&lt;/td&gt;
&lt;td&gt;Mostly closed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning (AIME)&lt;/td&gt;
&lt;td&gt;Matches o1 🏆&lt;/td&gt;
&lt;td&gt;o1-level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute Required&lt;/td&gt;
&lt;td&gt;Highly optimized&lt;/td&gt;
&lt;td&gt;Massive GPU clusters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;💰 &lt;strong&gt;Cost efficiency&lt;/strong&gt;: ~$6M training vs. hundreds of millions for comparable Western models&lt;/li&gt;
&lt;li&gt;🔓 &lt;strong&gt;Open weights&lt;/strong&gt;: Freely available for fine-tuning and local deployment&lt;/li&gt;
&lt;li&gt;🧠 &lt;strong&gt;Reasoning&lt;/strong&gt;: DeepSeek-R1 matches o1-level on AIME and MATH-500&lt;/li&gt;
&lt;li&gt;⚡ &lt;strong&gt;Novel architecture&lt;/strong&gt;: MLA, MoE, MTP, FP8, DualPipe innovations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🏗️ Architecture Deep Dive
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Based on DeepSeek-V3 (arxiv: 2412.19437) and DeepSeek-R1 (2501.12948) technical reports&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  1. Multi-Head Latent Attention (MLA)
&lt;/h3&gt;

&lt;p&gt;Traditional LLMs cache full KV tensors per attention head. MLA compresses them into &lt;strong&gt;low-rank latent vectors&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;128&lt;/strong&gt; attention heads × 128 dims/head&lt;/li&gt;
&lt;li&gt;KV compressed to &lt;strong&gt;512 dims&lt;/strong&gt; (vs full-rank) → &lt;strong&gt;93.3% cache reduction&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5.76× throughput&lt;/strong&gt; improvement during generation
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Standard MHA:  cache(K,V) per head  →  O(num_heads × d_head)
DeepSeek MLA:  cache(latent_KV)     →  O(512)  ← 93% smaller
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Paper detail&lt;/strong&gt;: MLA performs low-rank joint compression using a compressed latent vector &lt;code&gt;c_KV ∈ ℝ^d_c&lt;/code&gt; where &lt;code&gt;d_c &amp;lt;&amp;lt; d_h × n_h&lt;/code&gt;. Decoupled RoPE-carrying keys are maintained separately to preserve positional encoding fidelity.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. DeepSeekMoE — Sparse Activation at Scale
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;V2&lt;/th&gt;
&lt;th&gt;V3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total Params&lt;/td&gt;
&lt;td&gt;236B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;671B&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active/Token&lt;/td&gt;
&lt;td&gt;21B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;37B&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Experts/Layer&lt;/td&gt;
&lt;td&gt;160&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;256 routed + 1 shared&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Top-K&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;9&lt;/strong&gt; (1 shared + 8 routed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Activation&lt;/td&gt;
&lt;td&gt;~9%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~5.5%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost vs Dense&lt;/td&gt;
&lt;td&gt;-42.5%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-82%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;671B total params but only 37B fire per token — like a 671-doctor hospital where only 37 attend each patient.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Routing constraint&lt;/strong&gt;: node-limited routing restricts each token to at most M=4 compute nodes, ensuring communication locality across 2,048 H800 GPUs.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Auxiliary-Loss-Free Load Balancing (ALF-LB)
&lt;/h3&gt;

&lt;p&gt;Traditional MoE uses auxiliary losses to prevent routing collapse — but they hurt model quality. DeepSeek uses &lt;strong&gt;learnable bias terms&lt;/strong&gt; instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Standard: Loss = task_loss + λ × aux_balance_loss  ← degrades quality
DeepSeek: Route = top-K(affinity_score + bias_k)   ← bias not in gradient!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dynamic adjustment: if expert is overloaded → decrease bias by γ; if underloaded → increase by γ. No backprop through the balance signal.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Val Loss&lt;/th&gt;
&lt;th&gt;Imbalance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Auxiliary loss&lt;/td&gt;
&lt;td&gt;3.690&lt;/td&gt;
&lt;td&gt;0.074&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ALF-LB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.646&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.090&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Better quality AND acceptable balance — no trade-off.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. FP8 Training — First at 671B Scale
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Memory saving&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50% vs BF16&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed gain&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2× FLOPS vs FP16&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quality loss&lt;/td&gt;
&lt;td&gt;&amp;lt; 0.25% vs BF16 baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Activation format&lt;/td&gt;
&lt;td&gt;1×128 tile-wise (per-token, 128 channels)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weight format&lt;/td&gt;
&lt;td&gt;128×128 block-wise (input × output channels)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key trick&lt;/strong&gt;: FP8 Tensor Cores accumulate to only ~14 bits → DeepSeek promotes to FP32 every 128 channels to prevent numerical drift. Fine-grained grouping (1×128 tiles) handles outlier activations far better than per-tensor quantization.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. DualPipe — Smarter Pipeline Parallelism
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Standard 1F1B:  [F][F][F][F][ bubble ][ bubble ][B][B][B][B]
DualPipe:       [F][F][B][F][B][F][B][B]  ← computation + comm overlapped
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DualPipe feeds micro-batches from &lt;strong&gt;both pipeline ends simultaneously&lt;/strong&gt;, manually adjusting GPU SM allocation between compute warps and communication warps within the same kernel launch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: near-zero all-to-all communication overhead vs 1F1B or ZeroBubble. For 8 PP ranks + 20 micro-batches, nearly all communications are fully hidden during execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Multi-Token Prediction (MTP) — Thinking 4 Steps Ahead
&lt;/h3&gt;

&lt;p&gt;Standard LLMs predict 1 token at a time. DeepSeek-V3 predicts &lt;strong&gt;D=4 future tokens&lt;/strong&gt; simultaneously at each position via sequential causal chains (not parallel independent predictions):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Depth&lt;/th&gt;
&lt;th&gt;Predicts&lt;/th&gt;
&lt;th&gt;Block&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Main model&lt;/td&gt;
&lt;td&gt;token t+1&lt;/td&gt;
&lt;td&gt;Main Transformer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTP depth 1&lt;/td&gt;
&lt;td&gt;token t+2&lt;/td&gt;
&lt;td&gt;TRM₁ (dedicated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTP depth 2&lt;/td&gt;
&lt;td&gt;token t+3&lt;/td&gt;
&lt;td&gt;TRM₂ (dedicated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTP depth 3&lt;/td&gt;
&lt;td&gt;token t+4&lt;/td&gt;
&lt;td&gt;TRM₃ (dedicated)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each depth &lt;code&gt;k&lt;/code&gt; &lt;strong&gt;shares the embedding layer and output head&lt;/strong&gt; with the main model. A projection matrix &lt;code&gt;M_k&lt;/code&gt; combines the prior-depth hidden representation with the target token embedding, maintaining complete causal chain integrity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training loss&lt;/strong&gt;: &lt;code&gt;L_MTP = (λ/D) × Σ L_MTP^k&lt;/code&gt; — weighted contribution alongside primary language modeling loss.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 MTP is a &lt;strong&gt;training-only technique&lt;/strong&gt;. At inference the extra modules are discarded — but the main model retains better long-range coherence and planning for free.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  7. GRPO — Emergent Reasoning via Pure RL
&lt;/h3&gt;

&lt;p&gt;DeepSeek-R1 proved reasoning &lt;strong&gt;emerges from pure RL&lt;/strong&gt; without any human-annotated reasoning chains:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sample G responses per math/code question&lt;/li&gt;
&lt;li&gt;Score with rule-based verifier (objective ground truth)&lt;/li&gt;
&lt;li&gt;Optimize relative to group average (no value network needed)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Emergent behaviors&lt;/strong&gt;: self-reflection, self-verification, dynamic strategy switching.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;DeepSeek-R1 &lt;strong&gt;surpasses OpenAI o1&lt;/strong&gt; on AIME 2024 (79.8% vs 79.2%) — without a single human-labeled example.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Architecture Summary
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Innovation&lt;/th&gt;
&lt;th&gt;Key Metric&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Head Latent Attention&lt;/td&gt;
&lt;td&gt;93.3% KV cache reduction, 5.76× throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sparse MoE&lt;/td&gt;
&lt;td&gt;5.5% activation (671B params, 37B active)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ALF-LB&lt;/td&gt;
&lt;td&gt;+0.044 loss improvement vs aux-loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP8 Training&lt;/td&gt;
&lt;td&gt;2× speed, 50% memory, &amp;lt;0.25% quality loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DualPipe&lt;/td&gt;
&lt;td&gt;Near-zero all-to-all comm overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-Token Prediction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;D=4 tokens ahead, causal chain, shared embeddings&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GRPO + RL&lt;/td&gt;
&lt;td&gt;Beats o1 on AIME without SFT&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  📈 Market Impact
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;📉 NVIDIA lost ~&lt;strong&gt;$600B&lt;/strong&gt; in market cap when R1 dropped&lt;/li&gt;
&lt;li&gt;🔄 &lt;strong&gt;Efficiency &amp;gt; raw compute&lt;/strong&gt; — a paradigm shift from the scaling hypothesis&lt;/li&gt;
&lt;li&gt;🌍 Democratized frontier AI for developers worldwide&lt;/li&gt;
&lt;li&gt;🏃 Triggered OpenAI, Google, and Meta to accelerate open-weight releases&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🛠️ Run DeepSeek Locally
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull deepseek-r1:8b
ollama run deepseek-r1:8b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use cases: local code assistants, private RAG pipelines, domain fine-tuning, self-hosted inference.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎯 Conclusion
&lt;/h2&gt;

&lt;p&gt;DeepSeek proved the AI race isn't won by the biggest budget. By combining MLA, sparse MoE, MTP, FP8, DualPipe, and GRPO with open-source values, they democratized frontier AI and forced the entire industry to rethink its assumptions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2412.19437" rel="noopener noreferrer"&gt;DeepSeek-V3&lt;/a&gt; · &lt;a href="https://arxiv.org/abs/2501.12948" rel="noopener noreferrer"&gt;DeepSeek-R1&lt;/a&gt; · &lt;a href="https://arxiv.org/abs/2405.04434" rel="noopener noreferrer"&gt;DeepSeek-V2&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you tried DeepSeek? Share in the comments! 👇&lt;/em&gt;&lt;/p&gt;

</description>
      <category>deepseek</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
