Large Language Models (LLMs) present a potential problem in their inability to accurately represent uncertainty about the reliability of their output. This uncertainty can have serious consequences in areas such as healthcare, where stakeholder confidence in the system’s predictions is critical. Variations in freeform language generation can further complicate the issue, as these cannot be perfectly accounted for during training. The dichotomy between black-box and white-box uncertainty estimation methods has led to a rise in popularity of the former, and increased accessibility of the latter due to open-source models.
In trying to solve this issue, researchers have explored a variety of approaches. Some have used the inherent capacity of LLMs to express a range of potential outcomes, while others have tried different prompt techniques to estimate uncertainty. However, black-box methods have often fallen short in providing useful uncertainties specifically for popular open-source models, requiring supplemental fine-tuning interventions.
To better understand what interventions might be necessary for calibration, researchers from New York University, Abacus AI, and Cambridge University have conducted an in-depth study into the calibration of uncertainties in LLMs. Their proposed solution is a method of fine-tuning for improved uncertainty, which produces more reliable estimates more quickly, using fewer additional parameters, and can be extended to new question types and tasks. This involves training language models to recognize their own limitations using a calibration dataset, parameter optimization, and determining the data volume required for effective generalization.
The focus of this approach is on black-box methods which require a single data sample or forward pass (a complete pass of data through the neural network). During the open-ended generation of responses, the researchers used perplexity as a normalized metric of length. They also investigated using prompting methods as an alternative to sequence likelihood and experimented with formats such as zero-shot classifiers and verbalized confidence statements.
Results from this approach showed a significant improvement over standard baselines. The researchers observed the validity of the black-box uncertainty estimates produced by open-source models against accuracy, using models like LLaMA-2, Mistral, and LLaMA-3. The study found that out-of-the-box LLM uncertainty is unreliable and concluded that the new fine-tuning methods produced uncertainties that were better calibrated, with a high degree of generalization. Importantly, the fine-tuning process proved to be surprisingly efficient and did not hinge on specific model representation.
The findings of this research suggest that calibrated uncertainties, achieved through fine-tuning, could indeed be robust against distribution shifts. For the full details, see the published paper available online.