In recent times, large language models (LLMs), such as Med-PaLM 2 and GPT-4, have shown impressive performance on clinical question-answer (QA) tasks. However, these models are restrictive due to their high costs, ecological unsustainability, and paid only accessibility for researchers. A promising approach is on-device AI, which uses local devices to run language models. This approach could greatly impact biomedicine by providing access to medical information in situations where internet service is limited or non-existent.
Two types of models, smaller domain-specific models (<3B parameters), such as BioGPT-large and BioMedLM, and larger 7B parameter models, such as LLaMA 2 and Mistral 7B, were examined in a biomedical context. The potential of these models for clinical QA applications is yet to be determined.
In a rigorous evaluation conducted by researchers from Stanford University, University College London, and the University of Cambridge, all four models were assessed in the clinical QA domain using the MedQA and MultiMedQA Long Form Answering tasks. The four models' performances were objectively compared through the same format, training data, and coding.
The top-performing model, Mistral 7B, was fine-tuned by merging the MedQA training data with the bigger MedMCQA training set, which provided a large archive of examples. This analysis allowed the researchers to explore the capabilities of mid-size models.
For the MultiMedQA task, researchers used a diverse array of health-related queries from multiple datasets that users typically pose to search engines, covering a wide range of consumer health topics.
These evaluations resulted in Mistral 7 B emerging as the top performer, affirming its potential for clinical question-answering tasks, followed by BioMedLM and BioGPT-large. Those with necessary computational resources can achieve satisfactory results from BioGPT-large. Yet, domain-specific models overall showed inferior performance to larger-scale models trained on generic English. The findings draw attention to the need for medical expert review and the potential for optimized larger biomedical speciality models.
This research's credit goes to professionals from Stanford University, University College London, and the University of Cambridge. Reflecting the growing role of AI in healthcare, it serves as a stepping stone towards the application of AI models in real-life clinical scenarios and calls for more investigations on such fronts.