Large Language Models (LLMs) for Information Retrieval (IR) applications, such as those used for web search or question-answering systems, currently base their effectiveness on human-crafted prompts for zero-shot relevance ranking – ranking items by how closely they match the user’s query. Manually creating these prompts for LLMs is time-consuming and subjective. Additionally, this method struggles with the complexity of relevance ranking tasks such as pairing queries and passages or the assessment of comprehensive relevance, consequently resulting in sub-optimal optimization.
Existing solutions to this problem, including manual prompt engineering or automatic prompt engineering techniques, though effective to some extent, are not fully efficient and scalable. Manual methods lack scalability and vary in effectiveness based on human expertise, while existing automatic methods are better suited to simpler tasks like language modeling and classification, failing to effectively handle the complexities that come with relevance ranking.
To mitigate these limitations, researchers from Rutgers University and the University of Connecticut have developed a method known as APEER (Automatic Prompt Engineering Enhances LLM Re-ranking). APEER minimizes the need for human involvement in the creation of prompts, making the process of generating them automatic and more robust. This is achieved through iterative feedback and preference optimization which systematically refines and improves the prompts over time.
APEER works by initially generating prompts and enhancing them through two key optimization steps. First is the feedback optimization, wherein performance feedback is used to generate a refined version of the existing prompt. The second step, preference optimization, further enhances the prompt by learning from sets of positive and negative examples. APEER has been tested across multiple datasets, such as MS MARCO, TREC-DL, and BEIR, verifying its effectiveness across various IR tasks and LLM architectures.
The implementation of APEER has demonstrated significant advantages, especially in relevance ranking tasks. Key performance metrics including nDCG@1, nDCG@5, and nDCG@10 show considerable improvements over currently used manual prompts. For example, on eight BEIR datasets, APEER achieved an average improvement of 5.29 nDCG@10 when used with the LLaMA3 model compared to those using manual prompts. Furthermore, the prompts produced by APEER show better adaptability across diverse tasks and LLM architectures and consistently outperform baseline methods across various datasets and models including GPT-4, LLaMA3, and Qwen2.
In summary, APEER is a novel method that automates prompt engineering for LLMs in Information Retrieval, addressing the challenge of reliance on human-created prompts. Through iterative feedback and preference optimization processes, APEER reduces human effort and significantly improves the performance of LLMs across various datasets and models. This innovation represents a major advancement in the field, providing a scalable and efficient solution for optimizing LLM prompts in complex relevance ranking scenarios.