Gradient-based Optimization and Implicit Regularization Over Non-convex Landscapes
Author | : Xiaoxia (Shirley) Wu |
Publisher | : |
Total Pages | : 328 |
Release | : 2020 |
Genre | : |
ISBN | : |
Download Gradient-based Optimization and Implicit Regularization Over Non-convex Landscapes Book in PDF, Epub and Kindle
Large-scale machine learning problems can be reduced to non-convex optimization problems if state-of-the-art models such as deep neural networks are applied. One of the most widely-used algorithms is the first-order iterative gradient-based algorithm, i.e., (stochastic) gradient descent method. Two main challenges arise from understanding the gradient-based algorithm over the non-convex landscapes: the convergence complexity and the algorithm's solutions. This thesis aims to tackle the two challenges by providing a theoretical framework and empirical investigation on three popular gradient-based algorithms, namely, adaptive gradient methods [39], weight normalization [138] and curriculum learning [18]. For convergence, the stepsize or learning rate plays a pivotal role in the iteration complexity. However, it depends crucially on the (generally unknown) Lipschitz smoothness constant and noise level on the stochastic gradient. A popular stepsize auto-tuning method is the adaptive gradient methods such as AdaGrad that update the learning rate on the fly according to the gradients received along the way; Yet, the theoretical guarantees to date for AdaGrad are for online and convex optimization. We bridge this gap by providing theoretical guarantees for the convergence of AdaGrad for smooth, non-convex functions; we show that it converges to a stationary point at the (log(N)/ √N) rate in the stochastic setting and at the optimal (1/N) rate in the batch (non-stochastic) setting. Extensive numerical experiments are provided to corroborate our theory. For the gradient-based algorithm solution, we study weight normalization (WN) methods in the setting of an over-parameterized linear regression problem where WN decouples the weight vector with a scale and a unit vector. We show that this reparametrization has beneficial regularization effects compared to gradient descent on the original objective. WN adaptively regularizes the weights and converges close to the minimum l2 norm solution, even for initializations far from zero. To further understand the stochastic gradient-based algorithm, we study the continuation method -- curriculum learning (CL) -- inspired by cognitive science that humans learn from simple to complex order. CL has proposed ordering examples during training based on their difficulty, while anti-CL proposed the opposite ordering. Both CL and anti-CL have been suggested as improvements to the standard i.i.d. training. We set out to investigate the relative benefits of ordered learning in three settings: standard-time, short-time, and noisy label training. We find that both orders have only marginal benefits for standard benchmark datasets. However, with limited training time budget or noisy data, curriculum, but not anti-curriculum ordering, can improve the performance