DeepSeek has released FlashMLA, an open-source MLA decoding kernel optimized for Hopper GPUs. This kernel supports BF16 and includes a paged KV cache with a block size of 64, achieving high performance with 3000 GB/s memory bandwidth and 580 TFLOPS on an H800 SXM5 GPU using CUDA 12.6.
FlashMLA is designed to enhance AI model efficiency by managing variable-length sequences, thereby reducing the computational resources required for complex tasks.
https://twitter.com/deepseek_ai/status/1893836827574030466
- DeepSeek’s FlashMLA leverages Hopper GPUs’ Transformer Engines, which use 8-bit floating point precision to boost AI performance by up to 6X over previous generations, as part of NVIDIA’s Hopper architecture announced for accelerated computing in data centers.
- The optimizations take advantage of Hopper Tensor Cores, supporting mixed FP8 and FP16 precisions to accelerate AI calculations, particularly for transformer models, enhancing efficiency for variable-length sequences.
- FlashMLA utilizes CUDA 12.6 on Hopper GPUs like the H800 SXM5, achieving 3000 GB/s memory bandwidth and 580 TFLOPS compute, enabled by NVIDIA’s cutting-edge TSMC 4N process with over 80 billion transistors.
- The kernel incorporates a paged KV cache with a block size of 64, optimized for Hopper’s parallel processing capabilities, reducing computational overhead for machine learning tasks.

