Large Language Models (LLMs) require an appropriate inference backend to function correctly, influencing user experience and operational costs. A recent study conducted by the BentoML Engineering Team has benchmarked various backends to better understand their performance when serving LLMs. The study focused primarily on vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and Hugging Face TGI. The experiment carried out on Llama 3 8B and 70B 4-bit quantization models on an A100 80GB GPU instance.
The study evaluated the backends using two key metrics. Firstly, the Time to First Token (TTFT), which measures how long it takes from a user request until the first token is generated, with a shorter TTFT translating to a stronger perceived performance and increased user satisfaction. Secondly, the Token Generation Rate (TGR), which measures how many tokens the model can generate per second, with a higher TGR indicating a model can handle a high load of requests.
In testing the Llama 3 8B model, the study found that LMDeploy had the highest TGR and the shortest TTFT, especially notable with ten users. MLC-LLM achieved a slightly lower TGR and its TTFT significantly increased at 100 users. Although vLLM had the lowest TTFT across all user levels, it had a less optimal TGR compared to LMDeploy and MLC-LLM.
When it came to the Llama 3 70B model, LMDeploy maintained the highest TGR and the shortest TTFT across all concurrent users. TensorRT-LLM had similar TGR as LMDeploy but experienced significant TTFT increases at 100 users. Consistently, vLLM had the lowest TTFT, but TGR lagged due to a lack of optimization for quantized models.
The inference backend selection goes beyond performance, considering factors such as quantization support, hardware compatibility, and developer experience.
In conclusion, LMDeploy provides superior performance for high-load scenarios because of its superior TTFT and TGR. vLLM offers low latency, critical for applications needing quick response times. MLC-LLM requires further optimization for unexpected stress testing. Understanding these findings is invaluable for developers and enterprises deploying LLMs, as it makes it easier to select the appropriate inference backend that will best serve their applications. Integrating these backends with platforms like BentoML can help streamline deployment, ensuring performance and scalability.