The increased adoption and integration of large Language Models (LLMs) in the biomedical sector for interpretation, summary and decision-making support has led to the development of an innovative reliability assessment framework known as Reliability AssessMent for Biomedical LLM Assistants (RAmBLA). This research, led by Imperial College London and GSK.ai, puts a spotlight on the critical importance of LLM’s accuracy and reliability alongside the challenges posed by the complexities of biomedical information.
RAmBLA addresses the shortcomings of traditional task-specific AI evaluation methods, instead opting for an approach that tests the proficiency of LLMs across a series of meticulously designed tasks that mimic real-life biomedical scenarios. These include parsing complex prompts, recalling, and summarising medical literature. A crucial factor in RAmBLA’s evaluation is its focus on reducing hallucinations or instances where the LLM creates plausible but incorrect or unfounded information, a problem that presents potential reliability issues in medical applications.
The study’s findings demonstrate the superior performance of larger LLMs across several tasks. A notable achievement is the proficiency of GPT-4 in semantic similarity measures, where it achieved a 0.952 accuracy in freeform Q&A tasks within biomedical queries. Larger models also excelled at refraining from answering when presented with irrelevant content, achieving a 100% success rate in the ‘I don’t know’ task. Conversely, smaller models like Llama and Mistral showed a drop in performance, particularly in the areas of hallucinations and varying recall accuracy, indicating a need for further enhancements.
Therefore, in advancing the use of LLMs as reliable tools for biomedical research and healthcare, the RAmBLA framework serves as an essential guide for evaluating and refining these models’ abilities. Through more accurate, evidence-based LLMs, the field of Biomedicine can make significant strides, leading to enhanced patient care and innovative approaches to research. Moreover, RAmBLA’s holistic evaluation method aligns with the increasing integration of AI in healthcare, ensuring dependable and effective implementation of these powerful tools.