FlashAttention-3, the newest addition to the FlashAttention series, was created to address the fundamental issues related to Transformer architectures’ attention layer. This is particularly important to the performance of large language models (LLMs) and applications that need long-context processing.
Historically, the FlashAttention series, which includes FlashAttention and FlashAttention-2, has reshaped how attention mechanisms function on GPUs by reducing memory reads and writes. This innovative approach has been embraced by most libraries to speed up Transformer training and inference, resulting in a substantial expansion in the context length of LLMs over the years. For example, context length went from 2-4K tokens in models like GPT-3 to 128K tokens in GPT-4, and up to 1 million tokens in models like Llama 3.
Despite these advancements, FlashAttention-2 only achieved 35% utilization of the theoretical maximum FLOPs on the H100 GPU, revealing a gap between possible and actual performances. FlashAttention-3 aims to bridge this gap by capitalizing on new hardware abilities in modern GPUs. It introduces three strategies to improve attention speed on Hopper GPUs: using the asynchrony of Tensor Cores and TMA to overlay calculation and data movement, interspersing block-wise matrix multiplication and softmax operations, and utilizing incoherent processing to employ hardware support for FP8 low-precision calculations.
FlashAttention-3 stands out due to its capability to exploit Tensor Cores and TMA’s asynchrony. This feature allows it to overlay the overall computation and data movement via warp specialization and interleaved operations. This technique ensures optimal utilization of GPU resources.
FlashAttention-3 notably harnesses low-precision FP8 computations, which doubles Tensor Core’s throughput compared to FP16. This method increases computing speed and accuracy by reducing quantization error through incoherent processing. This efficient reduction makes FlashAttention-3 a robust solution for high-performance LLMs.
In terms of speed, FlashAttention-3 is 1.5 to 2 times faster than FlashAttention-2 with FP16, peaking at around 740 TFLOPs, reaching 75% of H100 GPUs theoretical maximum FLOPs. Using FP8 computations, FlashAttention-3 gets close to 1.2 PFLOPs. This is a significant performance boost, having 2.6 times less error than the standard FP8 attention.
The improvements of FlashAttention-3 are supported by the use of NVIDIA’s CUTLASS library, allowing FlashAttention-3 to harness Hopper GPUs’ capabilities. This rewrite by Dao AI Lab means that substantial efficiency gains have been made possible, leading to new model features like expanded context lengths and increased inference speeds.
In conclusion, the launch of FlashAttention-3 marks a significant shift in the design and implementation of attention mechanisms in large language models. Dao AI Lab has shown that targeted optimizations can result in significant performance improvements by aligning algorithmic innovations with hardware advancements. These breakthroughs will be crucial in pushing the possibilities of large language models and their applications in various domains.