Neural Magic, an AI solutions provider, has recently announced a breakthrough in AI model compression with the introduction of a fully quantized FP8 version of Meta’s Llama 3.1 405B model. This achievement is significant in the field of AI as it allows this massive model to fit on any 8xH100 or 8xA100 system without the usual out-of-memory (OOM) errors experienced with FP8 and FP16 versions.
This model addresses memory constraints and improves inference speed twice over by using more efficient memory and computing capabilities, rendering CPU offloading superfluous or the need to distribute the load across multiple nodes.
They have made two versions of the model, namely Meta-Llama-3.1-405B-Instruct-FP8-dynamic and Meta-Llama-3.1-405B-Instruct-FP8. The dynamic version retains Meta Llama 3.1’s architecture, which is multilingual and designed to simulate an assistant-like chat. However, its usage is limited to English and lawful activities.
Neural Magic has achieved remarkable model efficiency via weight and activation quantization to the FP8 data type. The quantization process is based on symmetric per-channel quantization, allowing linear scaling per output dimension to map the FP8 representations of quantized weights and activations.
The optimized model from Neural Magic can be deployed efficiently utilising the vLLM backend, and libraries such as `vllm` and `transformers` in Python. The performance of the model was assessed with several benchmarks, including MMLU, ARC-Challenge, GSM-8K, and others. The results demonstrate a near-perfect model recovery, showing the quantized model is highly accurate.
The release of Neural Magic’s fully quantized FP8 model for Meta’s Llama 3.1 405B model effectively lessens memory requirements and boosts inference speeds, creating new possibilities for scalable and efficient AI applications. The success of this quantization process, coupled with minimal loss in accuracy, points to the potential for further innovations in the field of AI, making powerful AI models more accessible and practical for a wider range of users.