Large Language Models (LLMs), which have immense computational needs, have revolutionized a variety of artificial intelligence (AI) applications, yet the efficient delivery of multiple LLMs remains a challenge due to their computational requirements. Present methods, like spatial partitioning that designates different GPU groups for each LLM, need improvement as lack of concurrency leads to resource underuse and performance issues.
Many current attempts to solve LLM serving challenges concentrate on smaller models and single LLM inferences, whereas the optimal solution would manage multiple models concurrently. That’s where MuxServe comes into play. Developed by researchers from distinguished institutions like The Chinese University of Hong Kong, Shanghai AI Laboratory, Huazhong University of Science and Technology and others, MuxServe applies spatial-temporal multiplexing to handle multiple LLMs effectively.
MuxServe solves the GPU utilization problem by using a flexible approach to hosting more than one LLM. The system formulates an optimization problem to determine the best assembly of LLM units to get the most out of GPU utilization. A unified resource manager provides effective multiplexing by dynamically distributing SM resources and implementing a head-wise cache for shared memory use. This allows MuxServe to host LLMs of different popularity and resource needs, enhancing system use overall.
Significantly, MuxServe has shown superior performance in both synthetic and real-world scenarios, outperforming current systems, even when LLM popularity varies extensively. This is achieved through a system that effectively colocates LLMs based on popularity and computational requirements by employing a greedy placement algorithm, adaptive batch scheduling, and a unified resource manager. MuxServe accomplishes an efficient spatial-temporal partitioning, leading to up to 1.8 times higher throughput than existing systems.
In conclusion, MuxServe represents a significant step forward in the field of LLM serving. It effectively addresses the challenges of serving multiple LLMs concurrently through an innovative colocation approach based on the popularity of models, leading to improved GPU use. MuxServe’s adaptability to various LLM sizes and request patterns makes it an ideal solution for the burgeoning demands of LLM deployment. As AI develops, MuxServe provides a promising foundation for efficient and scalable system for serving multiple LLMs concurrently. This study, combined with the project’s ongoing work, can be of great importance for the future of AI.