Evaluating the performance of large language model (LLM) inference systems comes with significant difficulties, especially when using conventional metrics. Existing measurements such as Time To First Token (TTFT), Time Between Tokens (TBT), normalized latency and Time Per Output Token (TPOT) fail to provide a complete picture of the user experience during actual, real-time interactions. Such…
