Skip to content Skip to footer

Path: A Machine Learning Technique for Educating Small-Scale (Sub-100M Parameter) Neural Data Retrieval Models Utilizing a Minimum of 10 Gold Relevance Labels

The use of pretrained language models and their creative applications have contributed to significant improvements in the quality of information retrieval (IR). However, there are questions about the necessity and efficiency of training these models on large datasets, especially for languages with scant labeled IR data or niche domains.

Researchers from the University of Waterloo, Stanford University, and IBM Research AI have presented a new technique for training neural information retrieval models using a minimal amount of data. This method, named “PATH – Prompts as Auto-optimized Training Hyperparameters” allows for the training of models with under 100 million parameters using as few as ten gold relevance labels.

The PATH approach builds on the concept of creating fictitious document or search queries using a language model (LM). The distinctive feature of this strategy is that the language model auto-optimizes the prompt used to generate the fictitious queries, thus ensuring optimal training quality. The process begins with a text corpus and a small number of relevant labels. Then, potential search queries pertinent to the documents in the corpus are generated using an LM. Pairing the queries with passages creates the training data. Optimizing the LM prompt, which guides the creating of the query, is a crucial step in enhancing the quality of the synthetic data.

The researchers deployed the PATH method in the BIRCO benchmark, containing challenging and uncommon IR tasks. They found that the small-scale models trained with minimal labeled data and optimized prompts outperformed larger models trained on extensive datasets. Notably, they were superior to RankZephyr and competitive with RankLLama, models which include 7 billion parameters and are trained on datasets with over 100,000 labels.

This research signals a clear shift in the approach to data optimization for information retrieval. It illustrates that small-scale models, with the right data creation modifications, can hold their own against larger models, thus debunking the notion that model accuracy is contingent on massive volumes of data. This study indicates the potency of automatic rapid optimization in generating high-quality synthetic datasets, proving that highly effective IR models can be trained with fewer resources.

Leave a comment

0.0/5