Gradient Descent, Stochastic Optimization, and Other Tales

Gradient Descent, Stochastic Optimization, and Other Tales
Author: Jun Lu
Publisher: Eliva Press
Total Pages: 0
Release: 2022-07-22
Genre:
ISBN: 9789994981557


Download Gradient Descent, Stochastic Optimization, and Other Tales Book in PDF, Epub and Kindle

The goal of this book is to debunk and dispel the magic behind the black-box optimizers and stochastic optimizers. It aims to build a solid foundation on how and why the techniques work. This manuscript crystallizes this knowledge by deriving from simple intuitions, the mathematics behind the strategies. This book doesn't shy away from addressing both the formal and informal aspects of gradient descent and stochastic optimization methods. By doing so, it hopes to provide readers with a deeper understanding of these techniques as well as the when, the how and the why of applying these algorithms. Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize machine learning tasks. Its stochastic version receives attention in recent years, and this is particularly true for optimizing deep neural networks. In deep neural networks, the gradient followed by a single sample or a batch of samples is employed to save computational resources and escape from saddle points. In 1951, Robbins and Monro published A stochastic approximation method, one of the first modern treatments on stochastic optimization that estimates local gradients with a new batch of samples. And now, stochastic optimization has become a core technology in machine learning, largely due to the development of the back propagation algorithm in fitting a neural network. The sole aim of this article is to give a self-contained introduction to concepts and mathematical tools in gradient descent and stochastic optimization.

Stochastic Optimization Methods for Modern Machine Learning Problems

Stochastic Optimization Methods for Modern Machine Learning Problems
Author: Yuejiao Sun
Publisher:
Total Pages: 178
Release: 2021
Genre:
ISBN:


Download Stochastic Optimization Methods for Modern Machine Learning Problems Book in PDF, Epub and Kindle

Optimization has been the workhorse of solving machine learning problems. However, the efficiency of these methods remains far from satisfaction to meet the ever-growing demand that arises in modern applications. In this context, the present dissertation will focus on two fundamental classes of machine learning problems: 1) stochastic nested problems, where one subproblem builds upon the solution of others; and, 2) stochastic distributed problems, where the subproblems are coupled through sharing the common variables. One key difficulty of solving stochastic nested problems is that the hierarchically coupled structure makes the computation of (stochastic) gradients, the basic element in first-order optimization machinery, prohibitively expensive or even impossible.We will develop the first stochastic optimization method, which runs in a single-loop manner and achieves the same sample complexity as the stochastic gradient descent method for non-nested problems. One key difficulty of solving stochastic distributed problems is the resource intensity, especially when algorithms are running atresource-limited devices. In this context, we will introduce a class of communication-adaptive stochastic gradient descent (SGD) methods, which adaptively reuse the stale gradients, thus saving communication. We will show that the new algorithms have convergence rates comparable to original SGD and Adam algorithms, but enjoy impressive empirical performance in terms of total communication round reduction.

Optimization Algorithms for Distributed Machine Learning

Optimization Algorithms for Distributed Machine Learning
Author: Gauri Joshi
Publisher: Springer Nature
Total Pages: 137
Release: 2022-11-25
Genre: Computers
ISBN: 303119067X


Download Optimization Algorithms for Distributed Machine Learning Book in PDF, Epub and Kindle

This book discusses state-of-the-art stochastic optimization algorithms for distributed machine learning and analyzes their convergence speed. The book first introduces stochastic gradient descent (SGD) and its distributed version, synchronous SGD, where the task of computing gradients is divided across several worker nodes. The author discusses several algorithms that improve the scalability and communication efficiency of synchronous SGD, such as asynchronous SGD, local-update SGD, quantized and sparsified SGD, and decentralized SGD. For each of these algorithms, the book analyzes its error versus iterations convergence, and the runtime spent per iteration. The author shows that each of these strategies to reduce communication or synchronization delays encounters a fundamental trade-off between error and runtime.

Sentiment Analysis and Deep Learning

Sentiment Analysis and Deep Learning
Author: Subarna Shakya
Publisher: Springer Nature
Total Pages: 987
Release: 2023-01-01
Genre: Technology & Engineering
ISBN: 9811954437


Download Sentiment Analysis and Deep Learning Book in PDF, Epub and Kindle

This book gathers selected papers presented at International Conference on Sentimental Analysis and Deep Learning (ICSADL 2022), jointly organized by Tribhuvan University, Nepal and Prince of Songkla University, Thailand during 16 – 17 June, 2022. The volume discusses state-of-the-art research works on incorporating artificial intelligence models like deep learning techniques for intelligent sentiment analysis applications. Emotions and sentiments are emerging as the most important human factors to understand the prominent user-generated semantics and perceptions from the humongous volume of user-generated data. In this scenario, sentiment analysis emerges as a significant breakthrough technology, which can automatically analyze the human emotions in the data-driven applications. Sentiment analysis gains the ability to sense the existing voluminous unstructured data and delivers a real-time analysis to efficiently automate the business processes.

Noise-aware Stochastic Optimization

Noise-aware Stochastic Optimization
Author: Lukas Balles
Publisher:
Total Pages:
Release: 2021
Genre:
ISBN:


Download Noise-aware Stochastic Optimization Book in PDF, Epub and Kindle

First-order stochastic optimization algorithms like stochastic gradient descent (SGD) are the workhorse of modern machine learning. With their simplicity and low per-iteration cost, they have powered the immense success of deep artificial neural network models. Surprisingly, these stochastic optimization methods are essentially unaware of stochasticity. Neither do they collect information about the stochastic noise associated with their gradient evaluations, nor do they have explicit mechanisms to adjust their behavior accordingly. This thesis presents approaches to make stochastic optimization methods noise-aware using estimates of the (co-)variance of stochastic gradients. First, we show how such variance estimates can be used to automatically adapt the minibatch size for SGD, i.e., the number of data points sampled in each iteration. This can replace the usual decreasing step size schedule required for convergence, which is much more challenging to automate. We highlight that both approaches can be viewed through the same lens of reducing the mean squared error of the gradient estimate. Next, we identify an implicit variance adaptation mechanism in the ubiquitous Adam method. In particular, we show that it can be seen as a version of sign-SGD with a coordinatewise “damping” based on the stochastic gradient's signal-to-noise ratio. We make this variance adaptation mechanism explicit, formalize it, and transfer it from sign-SGD to SGD. Finally, we critically discuss a family of methods that preconditions stochastic gradient descent updates with the so-called “empirical Fisher” matrix, which is closely related to the stochastic gradient covariance matrix. This is usually motivated from information- geometric considerations as an approximation to the Fisher information matrix. We caution against this argument and show that the empirical Fisher approximation has fundamental theoretical flaws. We argue that preconditioning with the empirical Fisher is better understood as a form of variance adaptation.

Distributed Stochastic Optimization in Non-Differentiable and Non-Convex Environments

Distributed Stochastic Optimization in Non-Differentiable and Non-Convex Environments
Author: Stefan Vlaski
Publisher:
Total Pages: 284
Release: 2019
Genre:
ISBN:


Download Distributed Stochastic Optimization in Non-Differentiable and Non-Convex Environments Book in PDF, Epub and Kindle

The first part of this dissertation considers distributed learning problems over networked agents. The general objective of distributed adaptation and learning is the solution of global, stochastic optimization problems through localized interactions and without information about the statistical properties of the data. Regularization is a useful technique to encourage or enforce structural properties on the resulting solution, such as sparsity or constraints. A substantial number of regularizers are inherently non-smooth, while many cost functions are differentiable. We propose distributed and adaptive strategies that are able to minimize aggregate sums of objectives. In doing so, we exploit the structure of the individual objectives as sums of differentiable costs and non-differentiable regularizers. The resulting algorithms are adaptive in nature and able to continuously track drifts in the problem; their recursions, however, are subject to persistent perturbations arising from the stochastic nature of the gradient approximations and from disagreement across agents in the network. The presence of non-smooth, and potentially unbounded, regularizers enriches the dynamics of these recursions. We quantify the impact of this interplay and draw implications for steady-state performance as well as algorithm design and present applications in distributed machine learning and image reconstruction. There has also been increasing interest in understanding the behavior of gradient-descent algorithms in non-convex environments. In this work, we consider stochastic cost functions, where exact gradients are replaced by stochastic approximations and the resulting gradient noise persistently seeps into the dynamics of the algorithm. We establish that the diffusion learning algorithm continues to yield meaningful estimates in these more challenging, non-convex environments, in the sense that (a) despite the distributed implementation, individual agents cluster in a small region around the weighted network centroid in the mean-fourth sense, and (b) the network centroid inherits many properties of the centralized, stochastic gradient descent recursion, including the escape from strict saddle-points in time inversely proportional to the step-size and return of approximately second-order stationary points in a polynomial number of iterations. In the second part of the dissertation, we consider centralized learning problems over networked feature spaces. Rapidly growing capabilities to observe, collect and process ever increasing quantities of information, necessitate methods for identifying and exploiting structure in high-dimensional feature spaces. Networks, frequently referred to as graphs in this context, have emerged as a useful tool for modeling interrelations among different parts of a data set. We consider graph signals that evolve dynamically according to a heat diffusion process and are subject to persistent perturbations. The model is not limited to heat diffusion but can be applied to modeling other processes such as the evolution of interest over social networks and the movement of people in cities. We develop an online algorithm that is able to learn the underlying graph structure from observations of the signal evolution and derive expressions for its performance. The algorithm is adaptive in nature and able to respond to changes in the graph structure and the perturbation statistics. Furthermore, in order to incorporate prior structural knowledge to improve classification performance, we propose a BRAIN strategy for learning, which enhances the performance of traditional algorithms, such as logistic regression and SVM learners, by incorporating a graphical layer that tracks and learns in real-time the underlying correlation structure among feature subspaces. In this way, the algorithm is able to identify salient subspaces and their correlations, while simultaneously dampening the effect of irrelevant features.

Adaptive Gradient Descent for Convex and Non-convex Stochastic Optimization

Adaptive Gradient Descent for Convex and Non-convex Stochastic Optimization
Author: Aleksandr Ogaltsov
Publisher:
Total Pages:
Release: 2019
Genre:
ISBN:


Download Adaptive Gradient Descent for Convex and Non-convex Stochastic Optimization Book in PDF, Epub and Kindle

In this paper we propose several adaptive gradient methods for stochastic optimization. Our methods are based on Armijo-type line search and they simultaneously adapt to the unknown Lipschitz constant of the gradient and variance of the stochastic approximation for the gradient. We consider an accelerated gradient descent for convex problems and gradient descent for non-convex problems. In the experiments we demonstrate superiority of our methods to existing adaptive methods, e.g. AdaGrad and Adam.

On Deterministic and Stochastic Optimization Algorithms for Problems with Riemannian Manifold Constraints

On Deterministic and Stochastic Optimization Algorithms for Problems with Riemannian Manifold Constraints
Author: Dewei Zhang (Ph. D. in systems engineering)
Publisher:
Total Pages: 0
Release: 2021
Genre: Mathematical optimization
ISBN:


Download On Deterministic and Stochastic Optimization Algorithms for Problems with Riemannian Manifold Constraints Book in PDF, Epub and Kindle

Optimization methods have been extensively studied given their broad applications in areas such as applied mathematics, statistics, engineering, healthcare, business, and finance. In the past two decades, the fast growth of machine learning and artificial intelligence and their increasing applications in different industries have resulted in various optimization challenges related to scalability, uncertainty, or requirement to satisfy certain constraints. This dissertation mainly looks into the optimization problems where their solutions are required to satisfy certain (possibly nonlinear) constraints, \emph{namely} Riemannian manifold constraints or they should satisfy certain sparsity structures in conformance with directed acyclic graphs. More specifically, this dissertation explores the following research directions. \begin{enumerate} \item To optimize objective functions in form of finite-sum over Riemannian manifolds, the dissertation proposes a stochastic variance-reduced cubic regularized Newton algorithm in Chapter~\ref{chapter2:cubic}. The proposed algorithm requires a full gradient and Hessian updates at the beginning of each epoch while it performs stochastic variance-reduced updates in the iterations within each epoch. The iteration complexity of the algorithm to obtain an $(\epsilon,\sqrt{\epsilon})$-second order stationary point, i.e., a point with the Riemannian gradient norm upper bounded by $\epsilon$ and minimum eigenvalue of Riemannian Hessian eigenvalue lower bounded by $-\sqrt{\epsilon}$, is shown to be $O(\epsilon^{-3/2})$. Furthermore, this dissertation proposes a computationally more appealing extension of the algorithm which only requires an \emph{inexact} solution of the cubic regularized Newton subproblem with the same iteration complexity. \item To optimize the nested composition of two or more functions containing expectations over Riemannian manifolds, this dissertation proposes multi-level stochastic compositional algorithms in Chpter~\ref{chapter3:compositional}. For two-level compositional optimization, the dissertation presents a Riemannian Stochastic Compositional Gradient Descent (R-SCGD) method that finds an approximate stationary point, with expected squared Riemannian gradient smaller than $\epsilon$, in $\cO(\epsilon^{-2})$ calls to the stochastic gradient oracle of the outer function and stochastic function and gradient oracles of the inner function. Furthermore, this dissertation generalizes the R-SCGD algorithms for problems with multi-level nested compositional structures, with the same complexity of $\cO(\epsilon^{-2})$ for first-order stochastic oracles. \item In many statistical learning problems, it is desired that the optimal solution conforms to an a priori known sparsity structure represented by a directed acyclic graph. Inducing such structures by means of convex regularizers requires nonsmooth penalty functions that exploit group overlapping. Chapter~\ref{chap4_HSS} investigates evaluating the proximal operator of the Latent Overlapping Group lasso through an optimization algorithm with parallelizable subproblems. This dissertation implements an Alternating Direction Method of Multiplier with a sharing scheme to solve large-scale instances of the underlying optimization problem efficiently. In the absence of strong convexity, global linear convergence of the algorithm is established using the error bound theory. More specifically, this work also contributes to establishing primal and dual error bounds when the nonsmooth component in the objective function \emph{does not have a polyhedral epigraph}. \end{enumerate} The theoretical results established in each chapter are numerically verified through carefully designed simulation studies and also implemented on real applications with real data sets.