Flash MLA Offical Github Repo: FlashMLA - deepseek-ai - Github
DeepSeek Official Anouncement of Flash MLA on X:
Hacker News Discussion: DeepSeek Open Source FlashMLA β MLA Decoding Kernel for Hopper GPUs | Hacker News
Deepseek Open Source week series
Day 1: Flash MLA
π Day 1 of #OpenSourceWeek: FlashMLA
Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production.
β BF16 support
β Paged KV cache (block size 64)
β‘ 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800π Explore on GitHub: https://github.com/deepseek-ai/FlashMLA
Day 2: DeepEP
π Day 2 of #OpenSourceWeek: DeepEP
Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and inference.
β Efficient and optimized all-to-all communication
β Both intranode and internode support with NVLink and RDMA
β High-throughput kernels for training and inference prefilling
β Low-latency kernels for inference decoding
β Native FP8 dispatch support
β Flexible GPU resource control for computation-communication overlappingπ GitHub: https://github.com/deepseek-ai/DeepEP
Day 3: DeepGEMM
π Day 3 of #OpenSourceWeek: DeepGEMM
Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference.
β‘ Up to 1350+ FP8 TFLOPS on Hopper GPUs
β No heavy dependency, as clean as a tutorial
β Fully Just-In-Time compiled
β Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes
β Supports dense layout and two MoE layoutsπ GitHub: https://github.com/deepseek-ai/DeepGEMM
Day 4: Optimized Parallelism Strategies
π Day 4 of #OpenSourceWeek: Optimized Parallelism Strategies
β DualPipe - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
π https://github.com/deepseek-ai/DualPipeβ EPLB - an expert-parallel load balancer for V3/R1.
π https://github.com/deepseek-ai/eplbπ Analyze computation-communication overlap in V3/R1.
π https://github.com/deepseek-ai/profile-data
Day 5: 3FS
π Day 5 of #OpenSourceWeek: 3FS, Thruster for All DeepSeek Data Access
Fire-Flyer File System (3FS) - a parallel file system that utilizes the full bandwidth of modern SSDs and RDMA networks.
β‘ 6.6 TiB/s aggregate read throughput in a 180-node cluster
β‘ 3.66 TiB/min throughput on GraySort benchmark in a 25-node cluster
β‘ 40+ GiB/s peak throughput per client node for KVCache lookup
𧬠Disaggregated architecture with strong consistency semantics
β Training data preprocessing, dataset loading, checkpoint saving/reloading, embedding vector search & KVCache lookups for inference in V3/R1π₯ 3FS β https://github.com/deepseek-ai/3FS
β² Smallpond - data processing framework on 3FS β https://github.com/deepseek-ai/smallpond
Top comments (0)