Large language models (LLMs), used to solve natural language processing (NLP) tasks, have seen a significant increase in their size. This increase dramatically improves the model’s performance, with larger models scoring better on tasks such as reading comprehension. However, these larger models require more computation and are more costly to deploy.
The role of larger models in machine learning (ML)-based product development, however, cannot be underestimated. As building these products often involves experimenting with newer and larger models, there is a pressing need for efficient and cost-effective solutions.
One such solution is ‘speculative sampling’, a technique designed to make large language model inference more computationally efficient. The technique improves LLM inference throughput and output token latency (TPOT) by using a smaller, faster ‘draft’ model to generate multiple tokens, which are then verified by a larger, slower ‘target’ model.
The speculative process involves an adjustable window where the target model provides one guaranteed correct token, and the draft model speculates on the next few tokens. If the draft model’s tokens are accepted, the process speeds up. If not, the target model takes over, ensuring accuracy.
The effectiveness of this approach is demonstrated using a Llama-2-70B/7B model on Amazon EC2 Inf2 instances and Trainium-powered EC2 Trn1 instances. Given the size of these models, an innovative technique called tensor parallelism is used. This technique enables the distribution of the model’s weights across different NeuronCores, helping to optimize memory bandwidth and improve throughput.
Further, speculative sampling allows for some level of customization. Developers can create custom token acceptors if they want more deterministic responses. This technique has been integrated into LLM inference on AWS Inferentia and Trainium, and the models need to be loaded with speculative sampling functionality enabled.
Speculative sampling not only allows for the use of larger LLMs for better accuracy but also provides the speed and responsiveness of smaller LLMs, thus loosening the trade-off between model size, accuracy, speed, and cost. It opens up opportunities for developers to incorporate LLMs further into their applications without drastically compromising the utilization of hardware resources.