We are excited to share the news of an incredible breakthrough in artificial intelligence research by a team of researchers from Microsoft Corporation. They have presented a novel and simple method for obtaining high-quality text embeddings using only synthetic data. This method has achieved remarkable results on fiercely competitive text embedding benchmarks, all without using any labeled data.
Natural Language Processing (NLP) tasks extensively make use of text embeddings. Text embeddings encode semantic information contained in text by acting as vector representations of natural language. Activities such as information retrieval, question answering, semantic textual similarity, bitext mining, and item recommendation leverage these embeddings. Using methods like approximate closest neighbor search, text embeddings in information retrieval (IR) effectively retrieve a small group of candidate documents from a large corpus at the first retrieval stage.
Retrieval Augmented Generation (RAG), the latest paradigm that allows Large Language Models to access dynamic external knowledge without changing model parameters, likewise relies heavily on embedding-based retrieval. Text embeddings also play a crucial role in the attribution of the source of generated text, improving the interpretability and reliability of LLMs.
Prior research has shown that weighted averages of pre-trained word embeddings provide a reliable foundation for gauging semantic similarity. These techniques, however, are unable to capture the rich contextual information included in real language fully. Sentence-BERT and SimCSE are two methods that have evolved with the introduction of pre-trained language models.
These methods are used to fine-tune models like BERT on Natural Language Inference (NLI) datasets in order to learn text embeddings. More sophisticated multi-stage training paradigms are used by state-of-the-art techniques like E5 and BGE, which pre-train on weakly-supervised text pairs and fine-tune on labeled datasets to improve resilience and performance.
In this groundbreaking research, the team from Microsoft Corporation has presented a unique and simple method for producing high-quality text embeddings. This new approach has achieved remarkable results using only synthetic data and a remarkably small number of training steps, which are less than 1,000. This is in contrast to existing methods that rely on multi-stage pre-training using billions of weakly-supervised text pairs and subsequent fine-tuning with limited labeled datasets. The main difference lies in not relying on labor-intensive training pipelines and manually gathered datasets, which frequently have issues with task variety and language coverage.
The method uses proprietary Large Language Models to generate a wide range of synthetic data for text embedding jobs across around 100 languages. This approach uses a basic contrastive loss to fine-tune open-source decoder-only LLMs on the generated synthetic data instead of utilizing complex pre-training stages.
The team has conducted some tests in order to verify this approach. The model has demonstrated its outstanding results on fiercely competitive text embedding benchmarks, all without using any labeled data. The model has also established itself as a state-of-the-art method in text embedding without requiring large labeled datasets when it is refined using a combination of synthetic and labeled data, setting new records on the BEIR and MTEB benchmarks.
Patented LLMs like GPT-4 have been used to produce a diverse range of synthetic data that includes multilingual instructions. On the fiercely competitive MTEB benchmark, the method has achieved remarkable performance in nearly all work categories by using the powerful language understanding capabilities of the Mistral model.
We are incredibly enthused to witness this groundbreaking research, which has shown that using LLMs can significantly increase the quality of text embeddings. The training procedure of this study greatly eliminates the need for intermediate pre-training and is more streamlined and effective than current multi-stage systems. We invite you to join us in congratulating the researchers of this remarkable project.