The technique of language model adaptation is integral in artificial intelligence as it aids in modifying large pre-existing language models to function effectively across a range of languages. Notwithstanding their remarkable performance in English, these language learning models’ (LLM) capabilities tend to diminish considerably when adapted to less familiar languages. This necessitates the implementation of other adaptation techniques.
A major challenge in adapting language models to new languages is termed “catastrophic forgetting.” This phenomenon is characterized by a model losing its fluency in the original language while it learns a new language, substantially limiting its practical use. The model’s ability to retain the fundamental capabilities of the base model is vital for solving tasks in the new language.
Various methods are currently in use to counteract catastrophic forgetting, such as extended pretraining, instruction tuning with experience replay. However, these techniques need to be re-evaluated, especially when the exact source data remains an enigma. As the approximation of experience replay curtails its efficacy, further regularization is needed to uphold the model’s performance in the base language.
To address this issue, researchers from INSAIT, LogicStar.ai, ETH Zurich, the University of Chicago, and Together AI have introduced a new procedure called Branch-and-Merge (BAM). The BAM method splits the training data into multiple parts, then fine-tunes the base model on these slices in unison, with the resulting models being combined to form a new base model for the next iteration. They applied BAM to models like MISTRAL-7B and LLAMA-3-8B to adapt from primarily English to languages such as Bulgarian and German, with positive results.
Upon further analysis, the researchers deduced that BAM significantly lowered forgetting while simultaneously matching or enhancing target domain performance compared to standard continued pretraining and fine-tuning instruction. Remarkably, through BAM, the LLAMA-3-8B model surpassed its standard counterpart by 10.9% in Bulgarian tasks and 1.3% in English tasks due to the more efficient weight modifications facilitated by BAM.
To assess the efficacy of BAM, the researchers exercised both approximate and minimal experience replay. Their findings demonstrated that approximate experience replay induced an enhanced increase in target domain performance and diminished forgetting of the source domain compared to minimal experience replay.
Moreover, BAM has proven successful in instruction fine-tuning. By using BAM to mix English fine-tuning data with German or Bulgarian data, learning in both target languages marginally improved while noticeably diminishing forgetting. The BAM-trained models surpassed the standard instruction fine-tuning models in the Bulgarian instruction tuning, yielding 10.8% improved performance in Bulgarian tasks and 1.3% better in English tasks.
In summary, BAM offers an impressive solution for mitigating catastrophic forgetting in language model adaptation. This technique also defends the model’s effectiveness in the original language while boosting its performance in the target language. Consequently, it can be beneficial for professionals involved in multilingual AI applications, providing an efficient method to adapt large language models to disparate linguistic settings. The research demonstrated that BAM could do justice to both learning and forgetting, making it an instrumental method for continuous pretraining and instruction tuning in both alphabet and non-alphabet sharing languages.