Researchers from Tsinghua University have developed an approach to improve the performance of smaller language models such as MiniCPM, Phi-2, and Gemma by enhancing their text embeddings. By applying contrastive fine-tuning using the NLI dataset, the researchers significantly improved the text embedding quality across various benchmarks. In particular, MiniCPM showed a significant 56.33% performance improvement, highlighting the potential of this small-scale model for resource-limited applications.
Text embeddings are low-dimensional vector representations of text that capture semantic meaning. They are crucial for tasks like information retrieval, classification, and similarity matching. As opposed to the large language models (LLMs), such as GPT-4, LLaMA, and Mistral, smaller models like MiniCPM offer a more efficient alternative by addressing the resource demands of large-scale models. However, these models often require specific optimizations to meet the performance offered by their larger counterparts.
In their study, the researchers used contrastive fine-tuning to enhance text embeddings, encouraging the model to differentiate between similar and dissimilar text pairs, thus producing more accurate and contextually relevant embeddings. They employed a low-rank adaptation (LoRA) during fine-tuning to maintain computational efficiency. The experiments used a processed NLI dataset with 275k samples and evaluated the similarity score of embeddings for sentence pairs across nine benchmarks.
The results revealed that fine-tuning with LoRA significantly enhanced the performance of the models. Notably, MiniCPM achieved the highest Spearman correlations across all datasets, outperforming other models like Gemma and Phi-2. Further, an ablation study showed that MiniCPM greatly benefited from contrastive fine-tuning and the incorporation of hard negative penalties.
This research is a step towards making smaller-scale models more effective for resource-limited applications. By enhancing the text embeddings of such models, the researchers provided an approach for them to attain high performance in natural language understanding tasks, offering a scalable, resource-effective alternative to larger models.
In related news, Arcee AI has released DistillKit, an open-source tool designed to simplify the process of model distillation, aiding in the creation of efficient and high-performing small language models.