A Detailed Examination by BentoML on Rating LLM Inference Backends: Evaluating the Efficiency of vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI.

Large Language Models (LLMs) require an appropriate inference backend to function correctly, influencing user experience and operational costs. A recent study conducted by the BentoML Engineering Team has benchmarked various backends to better understand their performance when serving LLMs. The study focused primarily on vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and Hugging Face TGI. The experiment carried out on Llama 3 8B and 70B 4-bit quantization models on an A100 80GB GPU instance.

The study evaluated the backends using two key metrics. Firstly, the Time to First Token (TTFT), which measures how long it takes from a user request until the first token is generated, with a shorter TTFT translating to a stronger perceived performance and increased user satisfaction. Secondly, the Token Generation Rate (TGR), which measures how many tokens the model can generate per second, with a higher TGR indicating a model can handle a high load of requests.

In testing the Llama 3 8B model, the study found that LMDeploy had the highest TGR and the shortest TTFT, especially notable with ten users. MLC-LLM achieved a slightly lower TGR and its TTFT significantly increased at 100 users. Although vLLM had the lowest TTFT across all user levels, it had a less optimal TGR compared to LMDeploy and MLC-LLM.

When it came to the Llama 3 70B model, LMDeploy maintained the highest TGR and the shortest TTFT across all concurrent users. TensorRT-LLM had similar TGR as LMDeploy but experienced significant TTFT increases at 100 users. Consistently, vLLM had the lowest TTFT, but TGR lagged due to a lack of optimization for quantized models.

The inference backend selection goes beyond performance, considering factors such as quantization support, hardware compatibility, and developer experience.

In conclusion, LMDeploy provides superior performance for high-load scenarios because of its superior TTFT and TGR. vLLM offers low latency, critical for applications needing quick response times. MLC-LLM requires further optimization for unexpected stress testing. Understanding these findings is invaluable for developers and enterprises deploying LLMs, as it makes it easier to select the appropriate inference backend that will best serve their applications. Integrating these backends with platforms like BentoML can help streamline deployment, ensuring performance and scalability.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

A Detailed Examination by BentoML on Rating LLM Inference Backends: Evaluating the Efficiency of vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI.

Leave a comment Cancel reply

You May Also Like

Overcoming the ‘Lost-in-the-Middle’ Issue in Extensive Language Models: A Significant Progress in Adjusting Attention

Researchers at Amazon have suggested a novel approach to evaluate the accuracy of retrieval-enhanced large language models (RAG) relative to individual tasks.

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

A Detailed Examination by BentoML on Rating LLM Inference Backends: Evaluating the Efficiency of vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI.

Leave a comment Cancel reply

You May Also Like

Overcoming the ‘Lost-in-the-Middle’ Issue in Extensive Language Models: A Significant Progress in Adjusting Attention

Researchers at Amazon have suggested a novel approach to evaluate the accuracy of retrieval-enhanced large language models (RAG) relative to individual tasks.

+60 12-462 2768

All
Categories

All
Categories

All
Categories