Scientists from The Hong Kong University of Science and Technology, and the University of Illinois Urbana-Champaign, have presented ScaleBiO, a unique bilevel optimization (BO) method that can scale up to 34B large language models (LLMs) on data reweighting tasks. The method relies on memory-efficient training technique called LISA and utilizes eight A40 GPUs.
BO is attracting attention due to its usefulness in machine learning, hyperparameter optimization, meta-learning, and reinforcement learning. However, it remains underexplored for large-scale problems, mainly due to the mutual interdependence between the upper and lower levels, making it computationally challenging.
In BO, the solution to the outer problem depends on the solution to the inner one, split into two categories – approximate implicit differentiation (AID) and iterative differentiation (ITD). Both these approaches are computationally expensive for large-scale problems.
The scientists’ work also delves into data reweighting, where the proportion of training data has a sizable impact on LLM performance. After discussing various methods to reweight optimal training data, it is concluded no method assures optimal data weights, and there have been no notable experiments on models bigger than 30 billion parameters.
ScaleBiO, however, is the first BO method to scale successfully in LLMs of such large sizes and shows promise in real-world applications. It optimizes learned data weights effectively while offering a convergence guarantee similar to first-order BO methods geared towards smooth, strongly convex objectives.
Experiments in data reweighting show that ScaleBiO works efficiently on varying model sizes and can filter irrelevant data to select only meaningful samples. In small-scale language models like GPT-2, experiments revealed its effectiveness on synthetic data tasks like data denoising, multilingual training, and instruction-following fine-tuning.
The evaluation of ScaleBiO uses 3,000 data from each source for reweighting, and then 10,000 data are sampled based on BO-derived final weights to train the model. Applying learned sampling weights for fine-tuning of LLaMA-3-8B and LLaMA-3-70B demonstrated ScaleBiO’s effectiveness.
While ScaleBiO offers an efficient way to boost model performance, testing its effectiveness on large-scale pre-training still needs to be conducted, a task requiring extensive computational resources. Thus, confirming its success with large-scale fine-tuning settings could be a significant first progress indicator.