Neural networks using gradient descent often perform well even when overparameterized and initialized randomly. They frequently find global optimal solutions, achieving zero training error without overfitting, a phenomenon referred to as “benign overfitting.” However, in the case of Rectified Linear Unit (ReLU) networks, solutions can lead to overfitting if they interpolate the data. Particularly in scenarios with noisy data, the best solutions usually do not interpolate.
Research carried out by scientists at UC Santa Barbara, the Technion, and UC San Diego investigates the generalization of two-layer ReLU neural networks in a one-dimensional nonparametric regression setting with noisy labels. They have developed a theory indicating that gradient descent with a fixed learning rate yields local minima that represent smooth, sparsely linear functions. The solutions do not interpolate, thereby avoiding overfitting and accomplishing near-optimal mean squared error (MSE) rates.
In traditional models, tackling noisy labels typically requires explicit regularization, but the research suggests that a large learning rate used with gradient descent can create sparse, smooth functions without the need for explicit regularization. This method deviates from traditional theories that rely on interpolation and gives insights into achieving optimal rates in nonparametric regression without weight decay.
The study focuses on two-layer ReLU networks dealing with regression problems using datasets with noisy labels. The focus is on stable local minima, which are twice differentiable and lie within a specific distance from the global minimum. The study also explores the “Edge of Stability” regime, observing that when the largest eigenvalue of the Hessian reaches a critical value related to the learning rate, gradient descent cannot successfully find stable interpolating solutions in noisy settings.
The primary findings scrutinize stable solutions generated from gradient descent on ReLU networks from three perspectives. Firstly, the study looks at the implicit bias of these stable solutions under large learning rates, showing that they are inherently smoother and simpler. Secondly, it outlines generalization boundaries for these solutions, showing they evade overfitting. Finally, the researchers demonstrated that gradient descent can accomplish optimal rates for estimating variations within given intervals.
In summary, the research assesses the generalization of gradient descent-trained two-layer ReLU networks from the standpoint of minima stability and the Edge-of-Stability phenomena. It shows that gradient descent unable to interpolate data and reveals that local smoothness of the training loss implies a total variation constraint, leading to a diminishing generalization gap. These stable gradient descent solutions can achieve near-optimal rates for estimating functions with bounded variations under certain conditions. Validation through simulations showed that larger learning rates induce sparsely linear solutions even in noisy conditions.