DEV Community

Andy
Andy

Posted on β€’ Edited on

1 1 1 1 1

Flash MLA curated references

Flash MLA Offical Github Repo: FlashMLA - deepseek-ai - Github

DeepSeek Official Anouncement of Flash MLA on X:

Hacker News Discussion: DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs | Hacker News

Deepseek Open Source week series

Day 1: Flash MLA

πŸš€ Day 1 of #OpenSourceWeek: FlashMLA

Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production.

βœ… BF16 support
βœ… Paged KV cache (block size 64)
⚑ 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800

πŸ”— Explore on GitHub: https://github.com/deepseek-ai/FlashMLA

Day 2: DeepEP

πŸš€ Day 2 of #OpenSourceWeek: DeepEP

Excited to introduce DeepEP - the first open-source EP communication library for MoE model training and inference.

βœ… Efficient and optimized all-to-all communication
βœ… Both intranode and internode support with NVLink and RDMA
βœ… High-throughput kernels for training and inference prefilling
βœ… Low-latency kernels for inference decoding
βœ… Native FP8 dispatch support
βœ… Flexible GPU resource control for computation-communication overlapping

πŸ”— GitHub: https://github.com/deepseek-ai/DeepEP

Day 3: DeepGEMM

πŸš€ Day 3 of #OpenSourceWeek: DeepGEMM

Introducing DeepGEMM - an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference.

⚑ Up to 1350+ FP8 TFLOPS on Hopper GPUs
βœ… No heavy dependency, as clean as a tutorial
βœ… Fully Just-In-Time compiled
βœ… Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes
βœ… Supports dense layout and two MoE layouts

πŸ”— GitHub: https://github.com/deepseek-ai/DeepGEMM

Day 4: Optimized Parallelism Strategies

πŸš€ Day 4 of #OpenSourceWeek: Optimized Parallelism Strategies

βœ… DualPipe - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
πŸ”— https://github.com/deepseek-ai/DualPipe

βœ… EPLB - an expert-parallel load balancer for V3/R1.
πŸ”— https://github.com/deepseek-ai/eplb

πŸ“Š Analyze computation-communication overlap in V3/R1.
πŸ”— https://github.com/deepseek-ai/profile-data

Day 5: 3FS

πŸš€ Day 5 of #OpenSourceWeek: 3FS, Thruster for All DeepSeek Data Access

Fire-Flyer File System (3FS) - a parallel file system that utilizes the full bandwidth of modern SSDs and RDMA networks.

⚑ 6.6 TiB/s aggregate read throughput in a 180-node cluster
⚑ 3.66 TiB/min throughput on GraySort benchmark in a 25-node cluster
⚑ 40+ GiB/s peak throughput per client node for KVCache lookup
🧬 Disaggregated architecture with strong consistency semantics
βœ… Training data preprocessing, dataset loading, checkpoint saving/reloading, embedding vector search & KVCache lookups for inference in V3/R1

πŸ“₯ 3FS β†’ https://github.com/deepseek-ai/3FS
β›² Smallpond - data processing framework on 3FS β†’ https://github.com/deepseek-ai/smallpond

API Trace View

Struggling with slow API calls? πŸ‘€

Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more β†’

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs