Advancements in Large Language Models (LLMs) technology have burgeoned its use in clinical and medical fields, not only providing medical information, keeping track of patient records, but also holding consultations with patients. LLMs are equipped to generate long-form text compatible for responding to patient inquiries in a thorough manner, ensuring correct and instructive responses.
To ensure the factual accuracy of responses generated by LLMs, an automated evaluation process is required. For that, a team of researchers has developed MedLFQA, a benchmark dataset constructed from existing long-form biomedical question-answering datasets aimed at automatically gauging the factual accuracy of responses produced by LLMs.
To further enhance the accuracy of response generated by LLMs, the research team has also introduced the OLAPH (Optimizing Large language models’ Answers with Preferences of reducing Hallucination) framework, which employs a series of automated assessments as well as iterative training to prioritize responses with highest factual and assessment metric scores. Such frameworks generate multiple response samples for each question, and via predetermined assessment criteria, selects the response with the highest score and trains the LLM further with this preferred choice. This method reduces the potential for LLMs to generate false information.
The results indicated a significant improvement in the factual accuracy of responses generated by LLMs trained with the OLAPH framework. For instance, an LLM with 7 billion parameters, when trained using the OLAPH framework produced long-form responses that were almost on par with responses from medical professionals.
In summary, the researchers developed MedLFQA to automatically assess the factual accuracy of LLMs responses specifically in the biomedical field. They created two unique statements to evaluate the claims made in long-form responses. They introduced the OLAPH framework that enlists iterative learning and automated assessment to improve LLM responses.
Finally, the study demonstrated how training LLMs using the OLAPH framework resulted in their ability to produce long-form responses with factual accuracy, comparable to medical experts. This proposal could significantly improve the reliability of LLMs in generating accurate medical information, which could prove vital in several medical processes and applications.