Skip to content Skip to footer

Introducing Orion-14B: A Newly Developed Open-source Multilingual Extensive Language Model, Trained with 2.5T Tokens Spanning Languages Such as Chinese, English, Japanese, and Korean

The advent of AI has brought about the use of large language models, also known as LLMs, in a variety of fields including dialogue systems, machine translation and information retrieval. Orion Star researchers have created a new LLM framework named Orion-14B, which was trained with unprecedented 14 billion parameters on 2.5 trillion tokens. This language span includes languages such as Chinese, English, Japanese and Korean.

Orion-14B is not a singular model, but a series of models each with unique features and specializations. Among these models is Orion-14B-Chat-RAG, a model fine-tuned on custom data for superior performance on retrieval enhanced generation tasks. Also included in the Orion-14B series is a model designed for agent-related scenarios known as Orion-14B-Chat-Plugin. Furthermore, there are other models tailored towards specific applications, such as those capable of handling long contexts, quantified models and more.

These models have proven themselves to be versatile with laudable performances on human-annotated blind tests. Noteworthy is that the quantized versions of Orion-14B, which have not only reduced the model size by a significant 70% but also improved inference speed by 30%. Performance degradation is minimal at a loss of less than 1%. Regarding capabilities, these models have been observed to perform better than models of the 20-billion parameter scale, especially on Japanese and Korean test sets.

The dataset used for creating the models is primarily text-based and sourced from a variety of topics and languages, predominantly English and Chinese (90% of the dataset). The team is also aiming to increase the content of Japanese and Korean texts to account for over 5% of the content. Additional text sources comprise various languages including Spanish, French, German, Arabic and more.

Despite the challenges faced during their creation, the Orion-14B series is a breakthrough in multilingual LLMs due to its enhanced performance compared to other open-source models. They also afford a potential robust baseline for future LLM research. The researchers are committed to optimizing the efficiency of these models, which in turn could bolster further research in this field.

Those interested in the model or the research paper can follow the related project researchers or join numerous online communities for further discussions. Furthermore, regular updates are shared through newsletters and a Telegram Channel.

Leave a comment

0.0/5