The evaluation of Large Language Models (LLMs) requires a systematic and multi-layered approach to accurately identify areas of improvement and limitations. As these models advance and become more intricate, their assessment presents greater challenges due to the diversity of tasks they are required to execute. Current benchmarks often employ non-precise, simplistic criteria such as "helpfulness"…
