Researchers, including experts from Scale AI, the Center for AI Safety, and leading academic institutions, have launched a benchmark to determine the potential threat large language models (LLMs) may hold in terms of the dangerous knowledge they contain. Using a new technique, these models can now “unlearn” hazardous data, preventing bad actors from using AI to commit cyber attacks or develop weaponry.
The basis for the measurement is the Weapons of Mass Destruction Proxy (WMDP) benchmark. This features 4,157 multiple-choice questions based on the fields of biosecurity, cybersecurity, and chemical security. A high score on this benchmark signifies the potential for an LLM to assist in acts of violence. Consequently, lower scores denote less risk.
The researchers also developed an algorithm, Contrastive Unlearn Tuning (CUT), to target and expunge malignant knowledge from LLMs while preserving benign information. CUT accomplishes this by leveraging a “forget term” to diminish the LLM’s expertise in precarious issues and a “retain term” to uphold its aptitude for innocuous subjects.
The nature of the data in the LLM training dataset often records dual-usage, which complicates the unlearning process. However, employing the WMDP benchmark gives researchers the ability to construct “forget” and “retain” datasets that guide the CUT technique.
A series of tests were conducted using WMDP to discern the likelihood of the ZEPHYR-7B-BETA model disclosing dangerous information before and after implementing CUT. Challenges arose when attempting to differentiate between harmful and helpful knowledge within the same field, and it was noted that more preciseness is needed in the unlearning process to resolve this issue.
While not perfect, the CUT methodology can remove undesirable and harmful information from both closed and open-source AI models. Despite concerns that open-source models could relearn dangerous data due to their public access, the CUT approach provides a valuable and resilient addition to alignment methods currently in use for managing AI models’ knowledge.
In the future, the capacity for AI language models to recall risky information will be significantly decreased, thanks to these research efforts. It is still underscored that further development in precision is required in optimizing the unlearning process to ensure more security in utilizing these models.