The Beijing Academy of Artificial Intelligence (BAAI) has launched BGE M3-Embedding in collaboration with researchers from the University of Science and Technology of China, aiming to address challenges in existing embedding models. The new model introduces three novel properties of text embedding: Multi-Lingual, Multi-Functionality, and Multi-Granularity.
The biggest challenges with existing models such as Contriever, GTR, and E5 are their limitations in language support, retrieval functionalities, and accommodation of long input texts. They are usually trained only in English and support just one retrieval functionality. On the other hand, BGE M3-Embedding supports over 100 languages, can handle diverse retrieval functionalities (dense, sparse, and multi-vector retrieval), and is capable of processing long texts, handling up to 8192 tokens.
The method employed in the M3-Embedding is a novel self-knowledge distillation approach, which optimizes batching strategies for larger input lengths by leveraging large-scale, diverse multi-lingual datasets from various sources. It supports three common retrieval functionalities: dense retrieval, lexical retrieval, and multi-vector retrieval. The process combines various scores from different retrieval functionalities to create a teacher signal, thus enhancing the model’s efficiency in multiple retrieval tasks.
The performance of the model was evaluated, focusing on its capacity for multilingual text, varying sequence length, and narrative QA responses. The evaluation metric used was nDCG@10 or normalized discounted cumulative gain. The M3 embedding model proved superior to existing models in more than ten languages, while matching the performance in English. Though it performed similarly to the other models with shorter input lengths, it delivered improved results with longer text sequences.
In summary, M3 embedding represents a major step forward in the evolution of text embedding models. It is a versatile solution that supports multiple languages, different retrieval functionalities, and a range of input granularities. It performs better than baseline methods like BM25, mDPR, and E5, showing its strength in addressing the issues identified.
For further details, refer to the research paper and Github repository. All credit for this research goes to the project’s researchers.