The impressive advancements that have been seen in artificial intelligence, specifically in Large Language Models (LLMs), have seen them become a vital tool in many applications. However, the high cost associated with the computational power needed to train these models has limited their accessibility, stifling wider development. There have been several open-source resources attempting to address this issue including BLOOM, StarCoder, StarCoder-2, Pythia, and OLMo which have allowed more access to pretrained LLMs.
Despite this, LLMs struggle with languages other than English, which is largely due to the amount of training data being predominantly English-based. This inequality of performance across languages needs addressing and the development of multilingual models should be promoted. Also, continual pretraining can also cause issues where models essentially forget previously learned information. This is a particular problem when considering models that need to learn diverse grammar and lexical structures. In addition, AI developments need to stay compliant with regulations stipulating safety and security, an aspect often overlooked in open-source LLM development, especially for multilingual models.
In response to these issues, researchers have created AURORA-M, a novel open-source, multilingual LLM with 15 billion parameters. This model is designed to work with six languages: English, Finnish, Hindi, Japanese, Vietnamese, along with coding languages. The model was continually pretrained using a dataset that contains 435 billion tokens, resulting in a total of 2 trillion training tokens. AURORA-M has therefore developed a comprehensive understanding of these diverse languages and code.
Importantly, safety has been a fundamental design consideration for AURORA-M, and it’s the first multilingual open-source LLM to be aligned with the Biden-Harris Executive Order on Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. This has been achieved by fine-tuning the model using a dataset that addresses key safety concerns such as harm prevention, resistance to cyber-attacks, preventing illegal activities, privacy protection, and safety controls.
Tests showed that AURORA-M avoided catastrophic forgetting in English and coding tasks and performed competitively in multilingual benchmarks. The assessments also confirmed the model’s commitment to safety and responsible AI development.
In conclusion, AURORA-M has taken a significant step toward access to multilingual, safe LLMs, addressing challenges such as accessibility, language diversity, continual learning, and legal compliance. Researchers and developers can now access this pioneering model. However, users must remain cognizant of the content generated by AURORA-M, assessing its potential implications.
The research paper on AURORA-M has been made available here, and more information can be found on its HF page. The credit for this research goes to the researchers who developed this promising project.