Some Probabilistic Understandings of the Effects of Noise in the Stochastic Gradient Descent
Abstract

The remarkable empirical performance of the Stochastic Gradient Descent (SGD) algorithm with constant learning rate has led to effective training of many large-scale machine learning models in modern data science. In this talk, we provide several attempts to understand the effects of noise in the SGD algorithm from a probabilistic approach. Our first attempt introduces a stochastic differential equation to describe the diffusion limit (as the learning rate tends to zero) of the discrete SGD recursions. Based on this diffusion limit, we connect SGD with a noise-injected gradient system. This connection enables us to understand the stochastic dynamics of SGD via delicate probabilistic techniques in stochastic analysis (stochastic calculus), such as its escape from stationary points (including saddle points and local minima) and how its covariance structure implicitly affects the regularization properties. Our second attempt re-interprets the stochastic gradient of vanilla SGD as a matrix-vector product of the matrix of gradients and a random noise vector. This helps us to establish a general case of SGD, namely Multiplicative SGD (M-SGD). We demonstrate that M-SGD recovers the generalization performances of vanilla SGD from both empirical and theoretical validations.

Speaker: Dr Wenqing HU
Date: 3 July 2019
Time: 10:30am - 11:30am
PosterClick here

Biography

Dr Wenqing HU did his B.Sc in Mathematics at Peking University and his PhD Mathematics at University of Maryland. He is currently an assistant professor at Missouri S&T and he works in the area of probability and applied mathematics. Dr Wenqing Hu did his B.Sc in Mathematics at Peking University and his PhD Mathematics at University of Maryland. He is currently an assistant professor at Missouri S&T and he works in the area of probability and applied mathematics.