Large language models (LLMs) like GPT-3 and Llama-2, encompassing billions of parameters, have dramatically advanced our capability to understand and generate human language. However, the considerable computational resources required to train and deploy these models presents a significant challenge, especially in resource-limited circumstances. The primary issue associated with the deployment of LLMs is their enormity, demanding extensive computational power and memory. This necessitates multiple versions of the same model to be trained, balancing efficiency and accuracy based on the resources available.
A novel approach is being explored by researchers from NVIDIA and the University of Texas at Austin, introducing FLEXTRON, a flexible model architecture and post-training optimization framework. This architecture presents a nested elastic structure, adjusting dynamically to specific latency and accuracy targets during inference, enabling a single pre-trained model to be used across various deployment scenarios. The FLEXTRON system turns a pre-trained LLM into an elastic model using a sample-efficient training method and advanced routing algorithms.
FLEXTRON also includes an elastic Multi-Layer Perceptron (MLP) and elastic Multi-Head Attention (MHA) layers. The elastic MHA layers make up a significant part of LLM runtime and memory usage, enhancing overall efficiency by selecting a subset of attention heads based on the input data. This feature is beneficial in resource-scarce scenarios, as it allows for a more efficient use of available memory and processing power.
Performance evaluations of FLEXTRON have shown its superior efficiency and accuracy compared to multiple end-to-end trained models and other elastic networks. The model performs exceptionally well on the GPT-3 and Llama-2 model families, using only 7.63% of the training tokens used in the original pre-training. Consequently, this efficiency results in significant savings in both computational resources and time. In conclusion, FLEXTRON addresses the need for efficient model deployment in various computational environments due to its flexible and adaptable architecture that optimizes resource use and performance. The development of this framework emphasizes the potential for innovation in overcoming the obstacles associated with large language models.