Transformer-based generative Large Language Models (LLMs) are showing significant strength in various Natural Language Processing (NLP) tasks. Among those benefiting are application developers, who interact with LLMs through APIs supplied by AI firms such as Google, OpenAI, and Baidu, who provide language model-as-a-service (LMaaS) platforms.
In the LMaaS scenario, developers send the LLM service user input messages along with specific instructions. LMaaS providers aim to improve the quality of service and serve more customers by reducing response times and increasing their throughput.
Efficiencies, however, are needed in the handling of queries by some systems like TensorFlow Serving and Triton Inference Server, which currently operate in a first-come, first-served (FCFS) manner with a set batch size. Due to concerns over out-of-memory (OOM) issues, these approaches limit the GPUs’ capacity for parallel computation.
A suggested remedy to counter these limitations is continuous batching, which dynamically removes completed requests while processing new ones. Other strategies, such as model quantization and pruning, are also employed to reduce memory, though these may compromise the quality of the generated output.
Given this, a team of AI researchers from China proposed a system named Magnus. This uses application and user-level semantic information alongside the length of the user’s input to predict the generation lengths of requests. Magnus consists of a batch scheduler, an adaptive batcher, a serving time estimator, and a request generation length predictor. User input, application-level, and user-level semantic features are processed through a random forest regressor to estimate request lengths.
The purpose of the system is to create efficient batching by grouping requests with similar projected lengths together and selecting the appropriate batch size, thereby reducing computational waste. The proposed batch scheduler employs a highest response ratio next (HRRN) policy to minimize request queue times and lower response times, while a KNN regression-based serving time estimator predicts batch serving times to improve quality of service.
Rewards of this new approach have already been seen in tests on NVIDIA V100 GPUs using the Magnus prototype system. Compared to baseline methods, Magnus reportedly increased request throughput by up to 234% and reduced response times by nearly 90%. The team’s findings illustrate the effectiveness of leveraging generation length estimates to enhance batch serving in LMaaS. This potentially revolutionizes the efficient serving of LLMs in the LMaaS sector. The paper detailing this research can be found with the applicable links.