MIT researchers have unveiled how the idea of symmetry in datasets can help reduce the amount of data needed for training models. The research from MIT Ph.D. student Behrooz Tahmasebi and his advisor Stefanie Jegelka, is based on a mathematical understanding derived from Weyl’s law, a century-old law originally developed to measure spectral information complexity.
Studying differential equations, Tahmasebi realized the potential of Weyl’s law to simplify data input into machine learning. By recognizing symmetries in a dataset, a machine learning model could be made more efficient with less numerical data.
Their research indicates that by exploiting symmetries in datasets, machine learning tasks can be simplified, thereby reducing the need for vast amounts of training data. This works because, if a model understands that an image remains the same regardless of whether it is rotated or mirrored, it can learn more effectively. This understanding allows the model to make better use of its data and lessen the need for large amounts of data to achieve accurate results.
The study also delves into Kernel Ridge Regression (KRR) invariances that include symmetric transformations and other data properties that remain the same under certain operations. Tahmasebi reveals that this is the first-time Weyl’s law has been employed to ascertain how machine learning can be improved through symmetry.
This development is particularly important in sectors like computational chemistry and cosmology where high-quality data is scarce. For example, in cosmology, you might only find a small amount of useful data amidst a vast collection of irrelevant information. In such cases, symmetry can be a useful tool.
Soledad Villar, an applied mathematician at Johns Hopkins University, acknowledged this study’s implication that models, which conform to the symmetries of the problem, can provide more accurate prediction using very limited training points.
The researchers identified two types of improvements from utilizing symmetries. One is a linear improvement that correlates with the symmetry and an exponential gain which provides a disproportionately large benefit when dealing with multidimensional symmetries.
Tahmasebi and Jegelka found that machine learning models can learn as if they have more data when they recognize patterns or symmetries. Therefore, models can learn more from less, enhancing their efficiency. Additionally, these symmetries simplify the learning task by enabling the model to disregard irrelevant changes (like object position or orientation), thereby speeding up the learning process and improving performance.
The potential impacts of this research are substantial, specifically in computational chemistry, where these principles could speed up drug discovery processes. By recognizing symmetries in molecular structures, machine learning models can predict interactions and properties using fewer data points. Notably, this approach could also help analyze cosmic phenomena by enabling scientists to uncover more insights from limited data.