Large Language Models (LLMs) have surpassed previous generations of language models on various tasks, sometimes even equating or surpassing human performance. However, it's challenging to evaluate their true capabilities due to potential contamination in testing datasets or a lack of datasets that accurately assess their abilities.
Most studies assessing LLMs have focused primarily on the English…
