Data curation, particularly high-quality and efficient data curation, is crucial for large-scale pretraining models in vision, language, and multimodal learning performances. Current approaches often depend on manual curation, making it challenging to scale and expensive. An improvement to such scalability issues lies in model-based data curation that selects high-quality data based on training model features.
It has been known that batch data composition can impact its quality. Learning is amplified when batches are selected jointly rather than separately. In line with this theory, researchers at Google DeepMind have developed JEST (Joint Example Selection), an algorithm that selects relevant sub-batches from larger super-batches, enhancing learning and reducing computational overhead whilst maintaining high performance efficiency.
JEST settles its primary focus on finding the most fitting data sub-batches from a larger super-batch via model-based scoring functions. It does so by comparing the loss data from both the student and the pretrained reference models. A balance is struck between discarding trivial data and not over depending on the reference model in identifying high-quality examples.
The effectiveness of JEST was made evident in its evaluations when it was revealed that it surpasses independent selection and performs comparably to brute force methods. Its ability to enhance training and improve final performance in multimodal learning was also noted, with its effects increasing with filtering ratios.
Moreover, JEST’s performance only improves with better data curation, and it outperforms previous models on several benchmarks, demonstrating its effectiveness in training and efficiency. A variant of JEST, Flexi-JEST, which makes use of multi-resolution training, not only reduces computational overhead but also maintains speed, making it a more cost-efficient model.
The study also emphasizes the potential implied in ‘data quality bootstrapping,’ a process where small curated datasets guide more extensive, uncurated learning. It implies that foundation distributions could play a more prominent role in the future of data curation, replacing generic foundation datasets. Such distributions, enriched by JEST, could either be preset or dynamically adjusted according to learnability. However, reference datasets’ reliance on this method indicates an area for future research where reference datasets can be inferred from downstream tasks.
The research indicates that by utilizing JEST’s approach, large-scale multimodal learning can be significantly accelerated, improving power efficiency and requiring fewer examples. Consequently, this could potentially redefine the future of data curation and model training.