Researchers from Mila, McGill University, ServiceNow Research, and Facebook CIFAR AI Chair have developed a method called LLM2Vec to transform pre-trained decoder-only Large Language Models (LLMs) into text encoders. Modern NLP tasks highly depend on text embedding models that translate text’s semantic meaning into vector representations. Historically, pre-trained bidirectional encoding models such as BERT and T5 have been favored. However, a recent trend is the deployment of decoder-only LLMs because they learn from all input tokens during pre-training, benefiting from a wealth of tooling and pre-training recipes, making them highly proficient at instruction-following tasks.
However, LLMs have been slow to take off text embedding tasks, primarily due to their causal attention mechanism. In a causal attention mechanism, the representation of each token is only determined by the tokens coming before it, limiting the model’s ability to extract information from the full input sequence. LLM2Vec was developed to overcome this limitation.
LLM2Vec allows bidirectional attention, meaning the model looks at tokens before and after the current one to build representations. It uses a masked next token prediction technique where the model predicts the masked tokens that would appear in the input sequence, aiding it in comprehending and encoding contextual information. Lastly, the model uses unsupervised contrastive learning, contrasting similar and different instances in the embedding space, thereby developing robust representations.
The efficacy of LLM2Vec was tested on three renowned LLMs with parameter sizes ranging between 1.3 billion to 7 billion. According to the results, LLM2Vec recorded considerable performance gains over conventional encoder-only models, especially in word-level tasks. It set a new benchmark performance in unsupervised learning on the Massive Text Embeddings Benchmark (MTEB). The team has also achieved state-of-the-art results on the MTEB by combining LLM2Vec with supervised contrastive learning. These accomplishments validate how well Large Language Models can work as universal text encoders, achieved in a parameter-efficient way, obviating the need for costly adaptation or the creation of synthetic data.