The rapid development of Transformer models in natural language processing (NLP) has brought about significant challenges, particularly with memory requirements for the training of these large-scale models. A new paper addresses these issues by presenting a new methodology called MINI-SEQUENCE TRANSFORMER (MST) which optimizes memory usage during long-sequence training without compromising performance.
Traditional approaches such as multi-query attention and grouped query attention have successfully minimized memory usage during inference by optimizing key-value cache size. However, continuous advancements in model architecture, such as an increased vocabulary size and additional layers in models like Llama3, have continued to exacerbate memory challenges during training.
The MST, proposed by a team of researchers from Caltech and CMU, presents a solution that divides input sequences and processes these in smaller, iterative mini-sequences. This greatly decreases intermediate memory usage by integrating activation recomputation into the process, a technique where activations of specific layers are recalculated during the backward pass. This saves memory in both forward and backward passes, and the MST requires minimal code alterations to existing training frameworks.
Additionally, the MST method is also extended to a distributed setting by combining it with DeepSpeed-Ulysses, which segments input tensors of each Transformer layer to allow parallel computation across multiple GPUs. This significantly reduces activation memory requirements whilst ensuring compatibility with various sequence parallelism techniques such as Megatron-LM and Ring Attention.
Efficacy of the MST was tested with Llama3-8B and Llama2 models and demonstrated substantial improvement in sequence length capabilities. Overall, MST maintained similar training throughput to standard long-sequence training methods, ensuring optimization without negatively impacting performance.
Furthermore, MST’s scalability in distributed settings was highlighted, demonstrating the ability to linearly scale sequence length with the number of GPUs used. Memory optimization using MST was particularly prominent for the LM-Head component, reducing memory usage with a minimal effect on execution time for longer sequences.
The MST represents a promising solution to the memory challenges associated with training large-scale Transformer models, offering a method to optimize memory usage through mini-sequence processing and activation recomputation. Not only does this effectively reduce memory footprint, it maintains high efficiency and accuracy, potentially enhancing scalability and performance for NLP long-sequence training.