French researchers have developed the first publicly available benchmark tool, ‘DrBenchmark’, to evaluate and standardize evaluation protocols for pre-trained masked language models (PLMs) in French, particularly in the biomedical field. Existing models lacked standardized protocols and comprehensive datasets, leading to inconsistent results and stalling progress in natural language processing (NLP) research.
The advent and advancement of NLP and PLMs have created a need for accurate evaluation tools, however, many are focused primarily on the English and Chinese languages. This absence felt in the French biomedical community has now been addressed by DrBenchmark, which comprises 20 different tasks, including named-entity recognition, part-of-speech tagging, semantic textual similarity, question-answering, and classification. These diverse tasks are aggregated into a single benchmark allowing for comprehensive evaluation from various perspectives.
Importantly, the new offering provides a reproducible, easily customizable, and automated protocol that promotes fairness in language model comparisons. It utilizes HuggingFace Datasets and the Transformers libraries for data loading, pre-training, and evaluation, ensuring consistency by tuning all models with the same hyperparameters for each task.
The researchers tested eight leading pre-trained masked language models (MLMs), across general and biomedical data. The models tested included French generalist models, cross-lingual generalist models, French biomedical models, and an English biomedical model. The results indicated that no single model excelled in all tasks, reinforcing the importance of domain-specific models for exceptional performance within the biomedical field. Interestingly, the results showed that certain models trained in different languages or outside of the biomedical domain maintained competitiveness in specific tasks.
In conclusion, DrBenchmark offers a solution to the lack of evaluation resources previously available to French biomedical NLP models. A significant revelation in the research shows that specific models trained outside a domain or in another language can still compete, suggesting that this area needs further study. All credit for this research goes to the project’s French researchers.