In the rapidly advancing field of Artificial Intelligence (AI), evaluating the outputs of models accurately becomes a complex task. State-of-the-art AI systems such as GPT-4 are using Reinforcement Learning with Human Feedback (RLHF) which implies human judgement is used to guide the training process. However, as AI models become intricate, even experts find it challenging to judge their accuracy and quality.
As a solution to this, OpenAI has developed CriticGPT. This tool helps trainers to identify errors in ChatGPT’s responses. The main goal of this model is to overview mistakes intensively, specifically when it comes to code outputs. The inherent limitations of human review in RLHF can be addressed by CriticGPT. This model provides a scalable supervising mechanism that enhances the reliability and accuracy of AI systems.
CriticGPT has demonstrated significant effectiveness in refining the evaluation processes. According to the experiments conducted, human reviewers who evaluated ChatGPT’s code outputs using CriticGPT outperformed by 60% compared to those who did not use CriticGPT. This improvement underscores CriticGPT’s capacity to augment human-AI collaboration resulting in thorough and accurate AI output assessments.
Efforts are underway to integrate models like CriticGPT into the RLHF labeling pipeline. This combination would offer explicit AI support for AI trainers, profacilitating the evaluation of sophisticated AI system outputs. This is an essential progress addressing the fundamental issue of RLHF where human trainers struggle to identify subtle errors in intricate AI models.
ChatGPT is powered by the GPT-4 series through RLHF. It is designed to be engaging and informative. AI trainers play a crucial role in this process as they assess several responses of ChatGPT relative to each other to generate comparative data. As the accuracy of ChatGPT increases, the errors become increasingly complex to identify, making the comparison process more difficult, which is fundamental to RLHF.
CriticGPT can author comprehensive critiques highlighting the errors in ChatGPT’s responses. Therefore, CriticGPT aids in improving the reliability and accuracy of the evaluation process by helping AI trainers to spot minor mistakes. This progress is significant as it ensures sophisticated AI models are kept aligned with their intended behaviours and objectives.
The team has documented their main contributions as follows:
– The team has developed the first instance of a basic, scalable oversight method that greatly assists humans in detecting issues in real-world RLHF data more effectively.
– The team has discovered in the ChatGPT and CriticGPT training pools that CriticGPT-generated critiques catch more inserted bugs than those authored by human contractors.
– The team has found that critics and human contractors working as a team produce more comprehensive criticisms compared to human contractors working by themselves. This partnership decreases the occurrence of delusions compared to reviews produced solely by models.
– The team has introduced Force Sampling Beam Search (FSBS), an inference-time sampling and scoring technique. This strategy balances the trade-off between reducing false concerns and detecting real faults in LLM-generated critiques.
All credit for this research goes to the project’s researchers. Further details can be found in the research paper.