Evaluating the performance of Artificial Intelligence (AI) and Machine Learning (ML) models is crucial, especially with the advent of Large Language Models (LLMs). These evaluations help to assess the abilities of these models and establish reliable systems based on their capabilities. However, certain practices, termed as Questionable Research Practices (QRPs), frequently compromise the authenticity and integrity of these assessments. The driving force for such practices is often the desire to publish in prominent journals, attract funding, or attract users.
QRPs fall into three main categories: contamination, cherrypicking, and misreporting. Contamination refers to those instances when test set data is utilized for training, assessment, or model prompts, thus compromising the integrity of the assessment. Instances of contamination include training on the test set, prompt contamination, retrieval augmented generation contamination, dirty paraphrases and contaminated models, and over-hyping and meta-contamination.
Cherrypicking involves adjusting experimental conditions to substantiate a desired outcome, often leading to sceptically high performance predictions. These practices include baseline nerfing, runtime hacking, benchmark hacking, and the golden seed method.
Misreporting includes strategies where generalizations based on restricted or skewed benchmarks are presented. These involve adding unnecessary modules to claim originality, adjust specific problems as needed, selectively present statistically significant findings, disregard variability by reporting results from a singular run, and outright lying or making incorrect claims about the abilities of the model.
Besides QRPs, the evaluation environment of ML research is also complicated by Irreproducible Research Practices (IRPs). These practices make it hard for subsequent researchers to replicate, build upon or scrutinize prior research. An example of IRPs is dataset concealing, done out of competition or copyright infringement concerns, and this inhibits transparency in data sharing and hampers the validation and reproduction of discoveries.
In summary, maintaining the integrity of ML research and evaluation is of utmost importance. Despite the fact that QRPs and IRPs may bring about short-term benefits for companies and researchers, they ultimately undermine the credibility and reliability of the field in the long run. As ML models become prevalent and their impact on society rises, it is crucial to establish and uphold rigid guidelines for conducting research. True potential of ML models can only be reached through openness, responsibility, and adherence to ethical research practices. The global community should work together to recognize and mitigate these practices ensuring that the advancements in ML are rooted in honesty and fairness. The researchers of the research paper highlighted in the article discuss this topic in detail. Interested readers are recommended to engage with the research for a comprehensive understanding of this subject.