Large Language Models (LLMs) have become crucial tools for various tasks, such as answering factual questions and generating content. However, their reliability is often questionable because they frequently provide confident but inaccurate responses. Currently, no standardized method exists for assessing the trustworthiness of their responses. To evaluate LLMs’ performance and resilience to input changes, researchers have developed numerous methods, including the FLASK, PromptBench, and adding noise to prompts.
Further building on this trend, a team of researchers from VISA has introduced an innovative, model-agnostic, and unsupervised method for assessing the robustness of any black-box LLM. The team’s approach measures local deviation from harmonicity, symbolized by γ, to analyze LLM stability and explainability. The experiments show a positive correlation between γ and misleading or false answers, which indicates the effectiveness of this approach.
The team has also developed an algorithm that calculates γ to assess the robustness of a given input to LLMs. This algorithm calculates the angle between the average output embedding of perturbed inputs and the original output, revealing that grammatical changes often lead to small γ values and trustworthy responses. Meanwhile, significant variations usually result in increased γ values and reduced trustworthiness.
The scientists measure the correlation between γ, robustness, and trustworthiness across different LLMs and question-answer corpora. They evaluated seven prominent models, including GPT-4, ChatGPT, and Smaug-72B, while considering three QA corpora, namely, Web QA, TruthfulQA, and Programming QA. The research showed that γ values below 0.05 generally denote trustworthy responses, and larger LLMs have lower γ values, implying higher trustworthiness. This finding suggests GPT-4 often leads in quality and certified trustworthiness.
In summary, the research introduces a robust method for assessing the robustness of LLM responses by using γ values, which provides insights into their trustworthiness. Similarly, it proposes correlating γ with human annotations as a practical metric for assessing LLM reliability across different models and domains. Across all tested models and domains, human ratings confirm that low γ values indicate trustworthiness, with GPT-4, ChatGPT, and Smaug-72B leading among the tested models.