Artificial Intelligence and Machine Learning are rapidly advancing fields, and a crucial aspect of their progress is the evaluation of model performance, particularly with the advent of Large Language Models (LLMs). However, the integrity of these evaluations is often compromised by what are known as Questionable Research Practices (QRPs), which can severely inflate published results and mislead the science community, as well as the general public, about the true efficacy of ML models.
QRPs typically stem from a desire to publish in well-regarded journals, or to attract funds or users. Interestingly, the complex nature of ML research – inclusive of pre-training, post-training and evaluation phases – provides a broad scope for QRPs. Such actions commonly fall into three categories: contamination, cherrypicking and misreporting.
Contamination refers to the use of test set data during training, assessment or model prompts. Potent models, like LLMs, can memorize test data exposed during training. This could lead to overly optimistic performance predictions, give the model unfair advantages, allow data leakage through retrieval systems, and encourage recycling of contaminated designs or biasing hyperparameters based on test results.
Cherrypicking is the act of manipulating experimental parameters to favor a desired outcome. A researcher might test their models numerous times under various circumstances and then only disclose the most favorable results. This practice might involve the under-optimization of baseline models to make a new model appear superior, altering inference parameters post-experiment to boost performance metrics, selecting easier benchmarks or benchmark subsets to ensure effective model performance, or reporting the best results after training with multiple random seeds.
Misreporting involves researchers making broad claims based on skewed or limited benchmarks. This might include fraudulent results, unrelated modules used to assert originality, ad-hoc adjustments of specific errors, selective presentation of statistically significant findings, or results reported from a singular run without recognizing variability.
Within this complex landscape, Irreproducible Research Practices (IRPs) also exist, adding to the challenges of evaluating ML. IRPs make it difficult for other researchers to replicate, build on, or evaluate previous research. An instance of this is dataset hiding, where researchers withhold details about their training datasets, often due to the competitive nature of ML research and concerns over unauthorized use.
In conclusion, to preserve the credibility and longevity of ML research and evaluation, ethical practices must be maintained. Temporary gains from QRPs and IRPs harm the field’s integrity and reliability over time. As ML models increasingly shape society, it is paramount to establish and adhere to stringent research guidelines. Genuine advancements in ML can only be realized through transparency, accountability, and ethical research, and it is vital that the community works together to identify and rectify these issues, ensuring that progress in ML is established on a foundation of truthfulness and equity.