Large Language Models (LLMs) are gaining popularity in the AI community due to their impressive capabilities such as text summarization, question answering and content generation. However, the training of LLMs often involves significant computational cost and time, and is typically reliant on unstructured and often unclear web-scraped data. Additionally, the scarcity of high-quality data on the internet presents a substantial challenge.
To tackle these issues, researchers from Apple and Carnegie Mellon University have developed Web Rephrase Augmented Pre-training (WRAP). WRAP uses existing LLMs to paraphrase web content into specific styles, with the intention of improving the pre-training process by introducing both original and rephrased data.
The key benefits of WRAP are its efficiency and performance enhancement. When it’s applied to a noisy C4 dataset, WRAP speeds up pre-training threefold, thus reducing the costs and time associated with LLM training. It also boosts performance when operating within the same computational budget. When different subsets of the large-scale Pile dataset are used, ambiguity is cut by over 10%, and zero-shot question-answer accuracy improves by over 2% across 13 different tasks.
WRAP also uses a medium-sized LLM to rephrase web documents into various styles. This approach differs from creating new data as it improves existing content while maintaining quality and diversity. The synthetic data that WRAP creates has two main advantages. It encompasses a variety of styles, preparing LLMs for a broader range of real-world situations, and the rephrased synthetic data is of higher quality than raw web-scraped data, facilitating more effective model learning.
In conclusion, WRAP is a significant advancement in LLM pre-training. By utilizing higher-quality, varied-style synthetic data, it not only speeds up training but also enhances overall LLM performance. Given the abundance of low-quality web data and the resource-intensive nature of traditional LLM training approaches, WRAP presents a promising solution.