Large Language Models (LLMs) have demonstrated impressive performances in numerous tasks, particularly classification tasks, in recent years. They exhibit a high degree of accuracy when provided with the correct answers or “gold labels”. However, if the right answer is deliberately left out, these models tend to select an option from the available choices, even when none of them are correct. This limitation poses considerable questions about these models’ actual comprehension and intelligence in classification contexts.
There are two main concerns associated with the absence of uncertainty within LLMs. Firstly, LLMs are capable of handling any set of labels, regardless of their accuracy, which may lead to potential misinformation. Ideally, these models should imitate human behavior and recognize correct labels or indicate when they’re absent. Secondly, LLMs, which are primarily generative models, often lack discriminative abilities. Hence, the high performance metrics appear to oversimplify classification tasks, which may exaggerate LLMs’ capabilities.
To facilitate further research in overcoming these limitations, researchers have created three different tasks that serve as benchmarks. These include BANK77, an intent classification task; MC-TEST, a multiple-choice question-answering task; and EQUINFER, a test that helps determine the correct equation from four choices given the context of scientific paper paragraphs. Collectively, these tasks form the KNOW-NO benchmark and cover a range of classification issues, including different label sizes, lengths, and scopes.
A new metric, OMNIACCURACY, was introduced to measure LLMs’ performance with heightened precision. This statistic assesses LLMs’ categorizing abilities by integrating their outcomes from two dimensions of the KNOW-NO framework — Accuracy-W/-GOLD, which measures conventional accuracy when the correct label is provided, and ACCURACY-W/O-GOLD, that gauges accuracy when the correct label is not available. The aim of OMNIACCURACY is to more closely emulate human-level discrimination intelligence in classification tasks, underscoring the LLMs’ ability to handle situations both with and without correct labels.
This research makes several key contributions to the study of LLMs. It is the first that highlights the limitations of LLMs in the absence of correct answers in classification tasks. It also presents the CLASSIFY-W/O-GOLD, a novel framework to evaluate LLMs, and introduces the KNOW-NO Benchmark comprising one new and two existing classification tasks for assessing LLMs. Moreover, it proposes the OMNIACCURACY metric that combines results when correct labels are present and absent, offering a comprehensive evaluation of LLMs across different scenarios. This work is a significant step forward in understanding LLMs’ limitations and developing more accurate means of their assessment.