PyTorch has introduced TK-GEMM, an enhanced Triton FP8 GEMM (General Matrix-Matrix Multiply) kernel, designed to expedite FP8 inference for large language models (LLMs) such as Llama3. This new development responds to the struggle faced in standard PyTorch execution, where multiple kernels are launched on the GPU for each operation in LLMs, typically leading to inefficient inference.
Existing methods for operating LLMs, particularly with FP8 precision, typically face issues with efficiency in PyTorch execution due to the overhead associated with multiple kernel launches on the GPU for each operation. Triton Kernels, the newly developed method, addresses this challenge by providing customized kernel enhancements for specific hardware, including Nvidia GPUs. Such enhancements enable developers to merge several operations into a single kernel launch using the torch.compile() function – significantly reducing overhead and boosting performance. Triton kernels also leverage specialized FP8 Tensor Cores present on Nvidia GPUs, enhancing computational efficiency compared to the standard FP16 cores adopted by PyTorch’s cuBLAS library.
SplitK parallelization, used by TK-GEMM, enhances Llama3-70B’s performance by decomposing the work along the k-dimension and initiating extra thread blocks to calculate partial output sums. The finer-grained work decomposition achieved through TK-GEMM results in substantial speedups over the primary Triton GEMM implementation. The experimental results reveal up to a 1.94 times speedup over the Triton matmul implementation. This also includes a 1.87 times speedup over cuBLAS FP8 and a 1.71 times speedup over cuBLAS FP16 for Llama3-70B inference problem sizes.
Additionally, the adoption of CUDA graphs significantly enhances end-to-end speedup by reducing kernel launch latencies. By designing and establishing a graph, developers can cut down the CPU launch overhead, realizing substantial performance gains in production environments.
In conclusion, PyTorch has introduced a new approach for accelerating FP8 inference for large language models using Triton Kernels. This approach addresses the inefficiencies of standard PyTorch execution and cuBLAS FP8 calculations by introducing an optimized TK-GEMM kernel. The kernel incorporates SplitK parallelization and CUDA graphs for end-to-end speedup. This solution provides notable performance improvements for Llama3-70B inference problem sizes on Nvidia H100 GPUs. As such, it presents promising advances in the area of deep learning model inference optimization. This method, ultimately, expedites FP8 inference for large language models like Llama3 by optimizing kernels and enhancing performance.