Biomedical Natural Language Processing (NLP) uses machine learning to interpret medical texts, aiding with diagnoses, treatment recommendations, and medical information extraction. However, ensuring the accuracy of these models is a challenge due to diverse and context-specific medical terminologies.
To address this issue, researchers from MIT, Harvard, and Mass General Brigham, among other institutions, developed RABBITS (Robust Assessment of Biomedical Benchmarks Involving Drug Term Substitutions). This specialized dataset evaluates language model performance by swapping brand and generic drug names, mimicking real-world variability in drug nomenclature.
The creation of the RABBITS dataset involved systematic substitution of brand names with their generic counterparts and vice versa in existing benchmarks like MedQA and MedMCQA. The result was 2,271 generic drugs mapped to 6,961 brands. To ensure accuracy and contextual consistency, each substitution was reviewed by two physicians.
The study found large language models (LLMs), including popular open-source models, experienced significant performance drops when substituting drug names. For example, the accuracy of the Llama-3-70B model decreased by 6.9% with generic-to-brand swaps. The researchers attributed this fragility to dataset contamination, as pretraining datasets contained significant benchmark test data.
To test the models’ ability to map brand-to-generic drug pairs and vice versa, researchers used multiple-choice questions, finding larger models consistently outperformed their smaller counterparts. Despite this, these models still displayed substantial performance drops when evaluated with the RABBITS dataset. This suggested that performance might be more reliant on memorization than genuine problem-solving skills and understanding of medical terminology.
In conclusion, the study highlights the need for robust NLP systems capable of accurately processing medical information, regardless of differences in terminology. The RABBITS dataset represents a valuable tool in the quest for more effective and reliable healthcare delivery. With its introduction, RABBITS offers a promising mechanism for improved language model robustness, paving the way for better patient outcomes.