Evaluating the performance of large language model (LLM) inference systems comes with significant difficulties, especially when using conventional metrics. Existing measurements such as Time To First Token (TTFT), Time Between Tokens (TBT), normalized latency and Time Per Output Token (TPOT) fail to provide a complete picture of the user experience during actual, real-time interactions. Such shortcomings are particularly noticeable in applications like chat and translation services where the speed of response directly impacts user satisfaction. The lack of consideration for real-world suitability means that current methods, like TTFT and TBT that solely assess individual token latencies, cannot ensure smooth and consistent token generation, a crucial element in real-time applications. To overcome these limitations, researchers from Georgia Institute of Technology, Microsoft Research India, and Intel AI Lab have introduced Metron, a novel framework for evaluating LLM inference performance.
Metron applies new principles such as fluidity-index and fluid token generation rate to capture the nuances of real-time, streaming LLM interactions more accurately. By setting token-level deadlines and quantifying the fraction of deadlines met, the fluidity-index provides a comprehensive definition of user experience parameters. Moreover, the process factors in unavoidable variances, like scheduling delays and different token generation rates, achieving smoother results overall. Applying the principle of fluidity-index, the measurement framework can dynamically adjust deadlines based on the real-time performance of both proprietary and open-source LLM inference systems, thereby evaluating their capacity to manage user requests without impacting responsiveness.
When compared with traditional measures, Metron offers a more accurate assessment of LLM inference systems. Metrics like fluidity-index and fluid token generation rate reveal significant discrepancies in user experience, which are usually omitted by TTFT or TBT alone. The performance evaluation of platforms such as vLLM and Sarathi-Serve, for instance, demonstrated that Sarathi-Serve consistently maintained a fluidity-index of over 0.9 for 99% of requests, and a throughput of 600 tokens per second, while vLLM remained three times worse due to generation issues. This suggests that Metron’s innovative approach can effectively identify performance variations and enhance user experiences in real-world scenarios.
In conclusion, the Metron evaluation system serves as a significant developmental leap in the performance evaluation of LLM inference systems. By focusing on realistic, user-centric parameters that capture the intricacies of real-time token generation, Metron offers a reliable and accurate way to identify and solve issues in conventional metrics, presenting a powerful tool for improving user experience in real-life applications. Wholeheartedly user-centric in its approach, Metron is likely to bring noticeable improvements to the performance of LLM serving frameworks.