The rapid development of Large Language Models (LLMs) has transformed multiple areas including generative AI, Natural Language Understanding, and Natural Language Processing. However, hardware constraints have often limited the ability to run these models on devices such as laptops, desktops, or mobiles. In response to this, the PyTorch team has developed Torchchat, a versatile framework designed to optimise LLM performance, specifically for models like Llama 3 and 3.1, in diverse computing environments. It allows for local inference in a range of devices, which could make strong AI models more universally accessible.
PyTorch 2 forms the foundation of Torchchat, providing exceptional performance for CUDA-based LLM execution. Yet, Torchchat offers even more by extending its service to other environments including mobile platforms and programming languages like Python and C++. The library provides a comprehensive solution for users wanting to deploy LLMs locally, incorporating features such as export, quantization, and evaluation.
A distinctive feature of Torchchat is its ability to enable local inference across multiple platforms. For Python, users can access Torchchat’s REST API using a web browser or a Python command-line interface (CLI), which is a convenient solution for researchers and developers. In the case of C++, Torchchat uses PyTorch’s AOTInductor backend to provide a desktop-compatible binary that allows for efficient execution of LLMs on x86-based platforms. For mobile devices, Torchchat utilizes ExecuTorch to export a ‘.pte’ binary file for on-device inference, responding to the growing demand for AI on mobile platforms.
Torchchat’s performance has been impressive across various platforms. The PyTorch team has provided comprehensive benchmarks showing the flexibility and efficacy of Torchchat by testing Llama 3 on multiple systems. On an Apple MacBook Pro M1 Max using 64GB of RAM, Llama 3 8B Instruct achieved great results. Moreover, in a Linux x86 environment, when combined with an Intel Xeon Platinum 8339HC CPU and an A100 GPU, outstanding results were observed. Torchchat’s mobile performance has also been remarkable, with 4-bit GPTQ through ExecuTorch allowing over 8T/s on the Samsung Galaxy S23 and iPhone.
In summary, Torchchat provides a flexible and efficient way to operate powerful AI models on a variety of devices, representing significant progress in local LLM inference. It enables developers and researchers to install and optimize LLMs more easily, broadening the scope for AI exploration from desktop applications to mobile advancements. The research behind Torchchat has been credited to the researchers of the project.