The application of Generative AI into real-world situations has been deterred by its slow inference speed. The term inference speed refers to the time taken by the AI model to generate an output after being given a prompt or input. Generative AI models, as they are required to create text, images, and other outputs, need complex calculations, therefore demanding significant processing power that, at a larger scale, can be quite costly to run. To fully optimize generative AI, accelerated inference speeds must be achieved, to allow quicker processing, smoother user experiences, quicker turnover times, and capability to manage larger workloads, which are considered fundamental for practical applications.
Recognizing the pressing need for accelerated inference speed, researchers at NVIDIA have devised the NVIDIA TensorRT Model Optimizer, an optimized library comprising advanced optimization techniques. The Optimizer offers a comprehensive support for approaches and techniques such as post-training quantization (PTQ) and sparsity. Specifically, it allows a generative AI to reduce its memory usage and elevate its computational speed by converting the model’s data into smaller precision formats.
PTQ is used to reduce model complexity and to speed up the inference process while maintaining accuracy. The incorporation of these advanced calibration algorithms allows a Falcon 180B model to fit into a single NVIDIA H200 GPU, for example. The NVIDIA TensorRT Model Optimizer also caters for the different algorithms entailed for accurate quantification and does so without compromising 4-bit floating-point inference accuracy.
The NVIDIA TensorRT Model Optimizer has been put through rigorous testing on various models to evaluate its efficiency at different tasks. The results showed that this Optimizer could produce images with quality on par with the FP16 baseline, while improving inference speeds by approximately 35 to 45 percent.
To summarize, the NVIDIA TensorRT Model Optimizer provides an efficient solution to the issue of slow inference speeds in generative AI models. By offering comprehensive support for advanced optimization techniques, such as post-training quantization and sparsity, the Optimizer enables developers to reduce the complexity of the AI models and accelerate inference without compromising accuracy. The incorporation of Quantization Aware Training further enhances 4-bit floating-point inference accuracy. As proven by the MLPerf Inference v4.0 results and benchmarking data, the Model Optimizer has accomplished significant performance improvements.