Researchers from Meta’s FAIR, INRIA, Université Paris Saclay, and Google are working on ways to automatically curate high-quality datasets to improve self-supervised learning (SSL). SSL enables models to be trained without human annotations, expanding data and model scalability, but its success often requires careful data curation. The team proposes a clustering-based technique involving hierarchical k-means clustering on large data repositories and balanced sampling from these clusters.
Self-supervised learning is significant in modern machine learning. In natural language processing (NLP), language modeling has evolved from basic neural architectures to complex large-scale models. High-quality data is required, and automatic curation methods, such as hierarchical k-means clustering, are proposed to balance large datasets without labels, thereby improving the performance of SSL models.
For effective training, pre-training datasets need to be large, varied, and balanced. Balanced datasets ensure that each concept is equally represented, avoiding biases toward dominant ideas. Hierarchical k-means with resampling can ensure that centroids follow an even distribution, maintaining balance across the dataset and promoting better model performance.
The team conducted four experiments to study the proposed algorithm. Firstly, they tested on simulated data, which showed a more uniform cluster distribution than other methods. Secondly, they curated web-based image data leading to a dataset of 743 million images. A ViT-L model was then trained and evaluated on various benchmarks, demonstrating improved performance. Furthermore, the algorithm was used to curate text data for training large language models, providing significant gains over benchmarks. Lastly, they curated satellite images for tree canopy height prediction, which improved the model’s performance on all evaluated datasets.
The research team introduced an automatic data curation pipeline that creates large, varied, and balanced training datasets for self-supervised feature learning. Using k-means clustering and resampling, this method created a uniform distribution of clusters. The curation pipeline enhances learning across web-based images, text data, and satellite imagery. The curated datasets outperformed raw data and ImageNet1k in terms of robustness. Future work includes addressing dataset quality, dependence on pre-trained features, and scalability.
Automatic dataset creation can include risks such as reinforcing biases and breaches of privacy. In this research, these risks were mitigated by blurring human faces and ensuring balance in the dataset concepts. This approach underlines the importance and potential of data curation in SSL and proposes hierarchical k-means as an alternative in various data-dependent tasks.