Skip to content Skip to footer

Speed up NLP interpretation using ONNX Runtime on AWS Graviton processors

ONNX is an open-source machine learning framework, offering interoperability across various platforms. It collaborates with ONNX Runtime, the runtime engine for model inference and training. AWS Graviton3 processors are specifically tailored for machine learning tasks and support a series of instructions to optimize performance. The ONNX Runtime 1.17.0 release integrates some of these instructions, improving performance by up to 65% for natural language processing models on the AWS Graviton3 Amazon Elastic Compute Cloud (EC2) platform.

The post thoroughly explains how to execute ONNX Runtime inference on AWS Graviton3-based EC2 instances and configure them to use optimized General Matrix Multiply (GEMM) kernels. It also reveals the resulting speed upgrade through benchmarking.

Optimized GEMM Kernels are a feature of the ONNX Runtime. They are incorporated into the Microsoft Linear Algebra Subroutine (MLAS) backend, which is the default Execution Provider. AWS Graviton3-based EC2 instances support bfloat16 format and Matrix Multiplication instructions, which enhances the deep learning operator acceleration. The AWS team has implemented MLAS kernels to facilitate deployment of models, and the optimized GEMM kernels have been incorporated into the ONNX Runtime CPU execution provider.

To enable these optimizations, they are incorporated as part of the ONNX Runtime 1.17.0 release and can be activated in the ONNX Runtime by adding specific session options.

Comparative benchmark testing the inference throughput for unoptimized and optimized models, using the fp32 model and Onnx Runtime 1.17.1, revealed an improvement in throughput of up to 65%. Similar improvements were observed in inference latency. Tests for the int8 quantized model exhibited improvements of up to 30% in throughput and comparable improvements in latency. These benchmarks were performed on an AWS Graviton3-based c7g.4xl EC2 instance.

The post concludes with step-by-step instructions for running the inference for the fp32 model with bfloat16 fast math mode and int8 quantized mode using the ONNX Runtime benchmarking script.

The impressive speed advantage will encourage developers to adopt these methods to improve performance. If there are concerns where similar performance gains are not observed on AWS Graviton, users are invited to open an issue on the GitHub page. The author, Sunita Nadampalli, is a proficient software developer at AWS, dedicated to enhancing machine learning and HPC workloads performance using Arm SoCs.

Leave a comment

0.0/5