Skip to content Skip to footer

LASP: A Streamlined Machine Learning Technique Specifically Designed for Linear Attention-Based Linguistic Models

Researchers from the Shanghai AI Laboratory and TapTap have developed a Linear Attention Sequence Parallel (LASP) technique that optimizes sequence parallelism on linear transformers, side-stepping the limitations led by the memory capacity of a single GPU.

Large language models, due to their significant size and long sequences, can place a considerable strain on graphical unit processors (GPUs). To help manage this, Sequence Parallelism (SP) techniques are applied, which partition a long sequence into several sub-sequences and train them on multiple GPUs separately. Yet, present-day SP methods can somewhat misapply linear attention characteristics, leading to inefficient parallelism and usability issues.

The new approach, LASP, employs point-to-point (P2P) communication for the effective swapping of states among GPUs within or beyond nodes. Crucially, LASP doesn’t depend on attention head partitioning, rendering it versatile for multi-head, multi-query, and grouped-query attentions. Therefore, it optimizes sequence parallelism on linear transformers using a tiling approach to partition input sequences into subsequence chunks that are then spread across GPUs.

This unique technique distinguishes attention computation into intra-chunks and inter-chunks, utilizing the so-called ‘right-product’ advantage of linear attention. While intra-chunks use conventional attention computation, inter-chunks exploit kernel tricks. LASP also includes data distribution, forward pass, and backward pass mechanisms to boost parallel processing efficiency.

When tested, LASP showcased a significant throughput enhancement for linear attention due to its efficient communication design. Notably, it outperformed both DeepSpeed Ulysses and Megatron in throughput by 38% and 136% respectively at a 256K sequence length on a 1B model.

Moreover, LASP – equipped with system optimizations like kernel fusion and KV State caching – can support longer sequence lengths within the same cluster, reaching as far as 2048K for the 1B model and 512K for the 7B model. As a byproduct of its efficient design, LASP is also compatible with all batch-level Distributed Data Parallel (DDP) methods, such as PyTorch/Legacy DDP, FSDP, and ZeRO-series optimizers.

This research marks a breakthrough for SP strategies targeting linear attention, enabling such models to scale for long sequences and alleviating the GPU capacity constraint. Furthermore, the team has incorporated a sequence length-independent communication overhead, meaning the exchanging of linear attention intermediate states is independent of sequence length.

In conclusion, LASP presents a potent solution to current limitations in SP methods applied on linear transformers. It leverages linear attention capabilities to boost parallelism efficiency and overall usability. Implementing P2P communication, kernel fusion, and KV state caching successfully reduces communication traffic while simultaneously enhancing GPU cluster utilization. Its compatibility with batch-level DDP methods makes it highly practical for large-scale distributed training. Practical experiments underline LASP’s advantages in scalability, speed, memory usage, and convergence performance when compared with traditional SP methods.

Leave a comment

0.0/5