Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

Abstract

Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning where the problem is non-convex and the gradient noise might exhibit a heavy-tailed behavior, as empirically observed in recent studies. In this study, we consider a continuous-time variant of SGDm, known as the underdamped Langevin dynamics (ULD), and investigate its asymptotic properties under heavy-tailed perturbations. Supported by recent studies from statistical physics, we argue both theoretically and empirically that the heavy-tails of such perturbations can result in a bias even when the step-size is small, in the sense that the optima of stationary distribution of the dynamics might not match the optima of the cost function to be optimized. As a remedy, we develop a novel framework, which we coin as fractional ULD (FULD), and prove that FULD targets the so-called Gibbs distribution, whose optima exactly match the optima of the original cost. We observe that the Euler discretization of FULD has noteworthy algorithmic similarities with natural gradient methods and gradient clipping, bringing a new perspective on understanding their role in deep learning. We support our theory with experiments conducted on a synthetic model and neural networks.

120201-2602/20–/–tbdŞimşekli et al. \ShortHeadingsFractional Underdamped Langevin Dynamics:Simsekli et al.  \firstpageno1 \WarningFilter[pdftoc]hyperrefToken not allowed in a PDF string \WarningFilter[xclr]xcolorIncompatible color definition \ActivateWarningFilters[pdftoc] \ActivateWarningFilters[xclr]

\editor

TBD

1 Introduction

Gradient-based optimization algorithms have been the de facto choice in deep learning for solving the optimization problems of the form:

 x⋆=argminx∈Rd{f(x)≜1nn∑i=1f(i)(x)}, (1)

where denotes the non-convex loss function, denotes the loss contributed by an individual data point , denotes the collection of all the parameters of the neural network. Among others, stochastic gradient descent with momentum (SGDm) is one of the most popular algorithms for solving such optimization tasks (see e.g., Sutskever et al. (2013); Smith et al. (2018)), and is based on the following iterative scheme:

 ~vk+1=~γ~vk−~η∇~fk+1(xk),xk+1=xk+~vk+1, (2)

where denotes the iteration number, is the step-size, is the friction, and denotes the velocity (also referred to as momentum). Here, denotes the stochastic gradients defined as follows:

 ∇~fk(x)≜1b∑i∈Ωkf(i)(x), (3)

where denotes a random subset drawn from the set of data points with for all .

When the gradients are computed on all the data points (i.e., ), SGDm becomes deterministic and can be viewed as a discretization of the following continuous-time system (Gao et al., 2018a; Maddison et al., 2018):

 dvt=−(γvt+∇f(xt))dt,dxt=vtdt, (4)

where is still called the velocity. The connection between this system and (2) becomes clearer, if we discretize this system by using the Euler scheme with step-size :

 vk+1=vk−η(γvk+1+∇f(xk)), xk+1=xk+ηvk+1, (5)

and make the change of variables , , and . However, due to the presence of the stochastic gradient noise , the sequence will be a stochastic process and the deterministic system (4) would not be an appropriate proxy.

Understanding the statistical properties of would be of crucial importance as it might reveal the peculiar properties that lie behind the performance of SGDm for learning with neural networks. A popular approach for understanding the dynamics of stochastic optimization algorithms in deep learning is to impose some structure on the noise and relate the process (2) to a stochastic differential equation (SDE) (Mandt et al., 2016; Jastrzebski et al., 2017; Hu et al., 2017; Chaudhari and Soatto, 2018; Zhu et al., 2019; Simsekli et al., 2019). For instance, by assuming that the second-order moments of the stochastic gradient noise are bounded (i.e., for all admissible , ), one might argue that can be approximated by a Gaussian random vector due to the central limit theorem (CLT) (Fischer, 2010). Under this assumption, we might view (2) as a discretization of the following SDE, which is also known as the underdamped or kinetic Langevin dynamics:

 dvt=−(γvt+∇f(xt))dt+√2γ/βdBt dxt=vtdt, (6)

where denotes the -dimensional Brownian motion and is called the inverse temperature variable, measuring the noise intensity along with . It is easy to check that, under very mild assumptions, the solution process admits an invariant distribution whose density is proportional to , where the function is often called the Gaussian kinetic energy (see e.g. (Betancourt et al., 2017)) and the distribution itself is called the Boltzmann-Gibbs measure (Pavliotis, 2014; Gao et al., 2018a; Hérau and Nier, 2004; Dalalyan and Riou-Durand, 2018). We then observe that the marginal distribution in the stationarity has a density proportional to , which indicated that any local minimum of appears as a local maximum of this density. This is a desirable property since it implies that, when the gradient noise has light tails, the process will spend more time near the local minima of . Furthermore, it has been shown that as goes to infinity, the marginal distribution of concentrates around the global optimum . This observation has yielded interesting results for understanding the dynamics of SGDm in the contexts of both sampling and optimization with convex and non-convex potentials (Gao et al., 2018a, b; Zou et al., 2018; Lu et al., 2016).

While the Gaussianity assumption can be accurate in certain settings such as small networks or ResNets (Martin and Mahoney, 2019; Panigrahi et al., 2019), recently it has been empirically demonstrated that in several deep learning setups, the stochastic gradient noise can exhibit a heavy-tailed behavior (Şimşekli et al., 2019; Zhang et al., 2019b). While the Gaussianity assumption would not be appropriate in this case since the conventional CLT would not hold anymore, nevertheless we can invoke the generalized CLT, which states that the asymptotic distribution of will be a symmetric -stable distribution (); a class of distributions that are commonly used in the statistical physics literature as an approximation to heavy-tailed random variables (Sliusarenko et al., 2013; Dubkov et al., 2008). As we will define in more detail in the next section, in the core of , lies the parameter , which determines the heaviness of the tail of the distribution. The tails get heavier as gets smaller, the case reduces to the Gaussian random variables. This is illustrated in Figure 1.

With the assumption of being distributed, the choice of Brownian motion will be no longer appropriate and should be replaced with an -stable Lévy motion, which motivates the following Lévy-driven SDE:

 dxt=vtdt, (7)

where denotes the left limit of and denotes the -stable Lévy process with independent components, which coincides with when . Unfortunately, when , as opposed to its Brownian counterpart, the invariant measures of such SDEs do not admit an analytical form in general; yet, one can still show that the invariant measure cannot be in the form of the Boltzmann-Gibbs measure (Eliazar and Klafter, 2003).

A more striking property of (7) was very recently revealed in a statistical physics study (Capała and Dybiec, 2019), where the authors numerically illustrated that, even when has a single minimum, the invariant measure of (7) can exhibit multiple maxima, none of which coincides with the minimum of . A similar property has been formally proven in the overdamped dynamics with Cauchy noise (i.e., and ) by Sliusarenko et al. (2013). Since the process (7) would spend more time around the modes of its invariant measure (i.e., the high probability region), in an optimization context (i.e., for larger ) the sample paths would concentrate around these modes, which might be arbitrarily distant from the optima of . In other words, the heavy-tails of the gradient noise could result in an undesirable bias, which would be still present even when the step-size is taken to be arbitrarily small. As we will detail in Section 3, informally, this phenomenon stems from the fact that the heavy-tailed noise leads to aggressive updates on , which are then directly transmitted to due to the dynamics. Unless ‘tamed’, these updates create an hurling effect on and drift it away from the modes of the “potential” that is sought to be minimized.

Contributions: In this study, we develop a fractional underdamped Langevin dynamics whose invariant distribution is guaranteed to be in the form of the Boltzmann-Gibbs measure, hence its optima exactly match the optima of . We first prove a general theorem which holds for any kinetic energy function, which is not necessarily the Gaussian kinetic energy. However, it turns out that some components of the dynamics might not admit an analytical form for an arbitrary choice of the kinetic energy. Then we identify two choices of kinetic energies, where all the terms in dynamics can be written in an analytical form or accurately computable. We also analyze the Euler discretization of (14) and identify sufficient conditions for ensuring weak convergence of the ergodic averages computed over the iterates.

We observe that the discretization of the proposed dynamics has interesting algorithmic similarities with natural gradient descent (Amari, 1998) and gradient clipping (Pascanu et al., 2013), which we believe bring further theoretical understanding for their role in deep learning. Finally, we support our theory with experiments conducted on both synthetic settings and neural networks.

2 Technical Background & Related Work

The stable distributions are heavy-tailed distributions that appear as the limiting distribution of the generalized CLT for a sum of i.i.d. random variables with infinite variance (Lévy, 1937). In this paper, we are interested in centered symmetric -stable distribution. A scalar random variable follows a symmetric -stable distribution denoted as if its characteristic function takes the form: , , where and . Here, is known as the tail-index, which determines the tail thickness of the distribution. becomes heavier-tailed as gets smaller. is known as the scale parameter that measures the spread of around . The probability density function of a symmetric -stable distribution, , does not yield closed-form expression in general except for a few special cases. When and , reduces to the Cauchy and the Gaussian distributions, respectively. When , -stable distributions have heavy-tails so that their moments are finite only up to the order in the sense that if and only if , which implies infinite variance.

Lévy motions are stochastic processes with independent and stationary increments. Their successive displacements are random and independent, and statistically identical over different time intervals of the same length, and can be viewed as the continuous-time analogue of random walks. The best known and most important examples are the Poisson process, Brownian motion, the Cauchy process and more generally stable processes. Lévy motions are prototypes of Markov processes and of semimartingales, and concern many aspects of probability theory. We refer to (Bertoin, 1996) for a survey on the theory of Lévy motions.

In general, Lévy motions are heavy-tailed, which make it appropriate to model natural phenomena with possibly large variations, that offer occurs in statistical physics (Eliazar and Klafter, 2003), signal processing (Kuruoglu, 1999), and finance (Mandelbrot, 2013).

We define , a -dimensional symmetric -stable Lévy motion with independent components as follows. Each component of is an independent scalar -stable Lévy process, which is defined as follows: (cf. Figure 1)

• almost surely.

• For any , the increments are independent, .

• The difference and have the same distribution: for .

• has stochastically continuous sample paths, i.e. for any and , as .

When , we obtain a scaled Brownian motion as a special case so that the difference follows a Gaussian distribution and is almost surely continuous. When , due to the stochastic continuity property, symmetric -stable Lévy motions can have have a countable number of discontinuities, which are often known as jumps. The sample paths are continuous from the right and they have left limits, a property known as càdlàg (Duan, 2015).

Recently, Şimşekli (2017) extended the overdamped Langevin dynamics to an SDE driven by , given as:1

 dxt=b(xt−,α)dt+β−1/αdLαt, (8)

where the drift is defined as follows:

 (b(x,α))i=−Dα−2xi(ϕ(x)∂xif(x))/ϕ(x). (9)

Here, and denotes the fractional Riesz derivative (Riesz, 1949):

 Dγu(x):=F−1{|ω|γ(F(u))(ω)}(x), (10)

where denotes the Fourier transform. Briefly, extends usual differentiation to fractional orders and when it coincides (up to a sign difference) with the usual second-order derivative .

The important property of the process (8) is that it admits an invariant distribution whose density is proportional to (Nguyen et al., 2019). It is easy to show that, when , the drift reduces to , hence we recover the classical overdamped dynamics:

 dxt=−∇f(xt)dt+√2/βdBt. (11)

Since the fractional Riesz derivative is costly to compute, Şimşekli (2017) proposed an approximation of based on the alternative definition of given in (Ortigueira, 2006), such that:

 b(x,α)≈−cα∇f(x), (12)

where . This approximation essentially results in replacing with in (11) in a rather straightforward manner. While avoiding the computational issues originated from the Riesz derivatives, as shown in (Nguyen et al., 2019), this approximation can induce an arbitrary bias in a non-convex optimization context. Besides, the stationary distribution of this approximated dynamics was analytically derived in (Sliusarenko et al., 2013) under the choice of and for and . These results show that, in the presence of heavy-tailed perturbations, the drift should be modified, otherwise an inaccurate approximation of the Riesz derivatives can result in an explicit bias, which moves the modes of the distribution away from the modes of .

From a pure Monte Carlo perspective, Ye and Zhu (2018) extended the fractional overdamped dynamics (8) to higher-order dynamics and proposed the so-called fractional Hamiltonian dynamics (FHD), given as follows:

 dxt= Dα−2{ϕ(zt)vt}/ϕ(zt)dt, dvt= −Dα−2{ϕ(zt)∇f(xt)}/ϕ(zt)dt −γDα−2{ϕ(zt)vt}/ϕ(zt)dt+γ1/αdLαt, (13)

where , and . They showed that the invariant measure of the process has a density proportional to , i.e., the Boltzmann-Gibbs measure. Similar to the overdamped case (8), the Riesz derivatives do not admit an analytical form in general. Hence they approximated them by using the same approximation given in (12), which yields the SDE given in (7) (up to a scaling factor). This observation also confirms that the heavy-tailed noise requires an adjustment in the dynamics, otherwise the induced bias might drive the dynamics away from the minima of (Capała and Dybiec, 2019).

3 Fractional Underdamped Langevin Dynamics

In this section, we develop the fractional underdamped Langevin dynamics (FULD), which is expressed by the following SDE:

 dvt=−(γc(vt−,α)+∇f(xt))dt+(γ/β)1/αdLαt, dxt=∇g(vt)dt, (14)

where is the drift function for the velocity and denotes a general notion of kinetic energy. In the next theorem, which is the main theoretical result of this paper, we will identify the relation between these two functions such that the solution process will keep the generalized Boltzmann-Gibbs measure, invariant. All the proofs are given in the appendix. {theorem} Let has the following form:

 (c(v,α))i:=Dα−2vi(ψ(v)∂vig(v))ψ(v),ψ(v):=e−g(v). (15)

The the measure on is an invariant probability measure for the Markov process . One of the main features of FULD is that the fractional Riesz derivatives only appears in the drift , which only depends on . This is highly in contrast with FHD (13), where the Riesz derivatives are taken over both and , which is the source of intractablility. Moreover, FULD enjoys the freedom to choose different kinetic energy functions . In the sequel, we will investigate two options for , such that the drift can be analytically obtained.

3.1 Gaussian kinetic energy

In classical overdamped Langevin dynamics and Hamiltonian dynamics, the default choice of kinetic energy is the Gaussian kinetic energy, which corresponds to taking (Neal, 2010; Livingstone et al., 2019; Dalalyan and Riou-Durand, 2018). With this choice, the fractional dynamics becomes:

 dxt=vtdt. (16)

In the next result, we will show that in this case, the drift admits an analytical solution. {theorem} Let . Then, for any ,

 (c(v,α))i=2α2vi√πΓ(α+12)1F1(2−α2;32;v2i2), (17)

where is the gamma function and is the Kummer confluent hypergeometric function. In particular, when , we have . We observe that the fractional dynamics (16) strictly extends the underdamped Langevin dynamics (6) as .

Let us now investigate the form of the new drift and its implications. In Figure 2, we illustrate for the dimensional case (note that for , each component of still behaves like Figure 2). We observe that due to the hypergeometric function , the drift grows exponentially fast with whenever . Semantically, this means that, in order to be able to compensate the large jumps incurred by , the drift has to react very strongly and hence prevent to take large values. To illustrate this behavior, we provide more visual illustrations in the appendix.

Even though this aggressive behavior of can be beneficial for the continuous-time system, it is unfortunately clear that its Euler-Maruyama discretization will not yield a practical algorithm due to the same behavior. Indeed, we would need the function to be Lipschitz continuous in order to guarantee the algorithmic stability of its discretization (Kloeden and Platen, 2013); however, if we consider the integral form of (cf. (Abramowitz and Stegun, 1972)), we observe that the function

 (c(v,α))i=2α2vi√π⋅Γ(32)Γ(2−α2)∫10ev2i2tt−α2(1−t)α−12dt

is clearly not Lipschitz continuous in . Therefore, we conclude that FULD with the Gaussian kinetic energy is mostly of theoretical interest.

3.2 Alpha-stable kinetic energy

The dynamics with the Gaussian kinetic energy requires a very strong drift mainly because we force the dynamics to make sure that the invariant distribution of to be a Gaussian. Since the Gaussian distribution has light-tails, it cannot tolerate samples with large magnitudes, hence requires a large dissipation to make sure does not take large values.

In order to avoid such an explosive drift that potentially degrades practicality, next we explore heavy-tailed kinetic energies, which would allow the components of to take large values, while still making sure that the drift in (15) admits an analytical form.

In our next result, we show that, when we choose an kinetic energy, such that the tail-index of this kinetic energy matches the one of the driving process , the drift simplifies and becomes the identity function. {theorem} Let be the probability density function of . Choose in (15), where for any . Then,

 (c(v,α))i=vi,1≤i≤d. (18)

This result hints that, perhaps is the natural choice of kinetic energy for the systems driven by .

It now follows from Theorem 3.2 that the FULD with -stable kinetic energy reduces to the following SDE:

 dvt=−γvt−dt−∇f(xt)dt+γ1/αdLαt, (19)

It can be easily verified that for , as , hence, the SDE (19) also reduces to the classical underdamped Langevin dynamics (6).

While this choice of results in an analytically available , unfortunately the function itself admits a closed-form analytical formula only when or , due to the properties of the densities. Nevertheless, as is based on one-dimensional densities, it can be very accurately computed by using the recent methods developed in (Ament and O’Neil, 2018).

We visually inspect the behavior of in Figure 2 for dimension one. We observe that, as soon as , takes a very smooth form. Besides, for small the function behaves like a linear function and when goes to infinity, it vanishes. This behavior can be interpreted as follows: since can take larger values due to the heavy tails of the kinetic energy, in order to be able target the correct distribution, the dynamics compensates the potential bursts in by passing it through the asymptotically vanishing .

3.3 Euler discretization and weak convergence analysis

As visually hinted in Figure 2, the function has strong regularity, which makes (19) to be potentially beneficial for practical implementations. Indeed, it is easy to verify that is Lipschitz continuous for and , and in our next result, we show that this observation is true for any admissible , which is a desired property when discretizing continuous-time dynamics. {proposition} For , the map is Lipschitz continuous, hence is also Lipschitz continuous. Accordingly we consider the following Euler-Maruyama discretization for (19):

 vk+1 =~γkvk−ηk∇f(xk)+(ηkγ/β)1/αsk+1, xk+1 =xk+ηk∇Gα(vk+1), (20)

where , is a random vector whose components are independently distributed, and is a sequence of step-sizes.

In this section, we analyze the weak convergence of the ergodic averages computed by using (20). Given a test function , consider its expectation with respect to the target measure , i.e. with . We will discuss next how this expectation can be approximated through the sample averages

 ¯πK(h):=(1/SK)K∑k=1ηkh(xk), (21)

where is the cumulative sum of the step-size sequence. We note that this notion of convergence is stronger than the convergence of (20) near a local minimum since it requires the convergence of the measure itself, and our analysis can be extended to a global optimization context by using the techniques presented in (Raginsky et al., 2017).

We now present the assumptions that imply our results.

A 1

The step-size sequence is non-increasing and satisfies and .

A 2

Let be a twice continuously differentiable function, satisfying , for some and has a bounded Hessian . Given , there exists , , such that and where is the drift of the process defined in (19).

These are common assumptions ensuring that the SDE is simulated with infinite time-horizon and the process is not explosive (Panloup, 2008; Şimşekli, 2017). We can now establish the weak convergence of (21) and present it as a corollary to Theorem 3, Proposition 3.3, and (Panloup, 2008) (Theorem 2). {corollary} Assume that the gradient is Lipschitz continuous and has linear growth i.e., there exits such that for all . Furthermore, assume that Assumptions 1 and 2 hold for some . If the test function then

 ¯πK(h)→π(h)almost surely as K→∞.

3.4 Connections to existing approaches

We now point out interesting algorithmic connections between (20) and two methods that are commonly used in practice. We first roll back our initial hypothesis that the gradient noise is distributed, i.e., , and modify (20) as follows:

 vk+1 =~γkvk−ηk∇~fk+1(xk), xk+1 =xk+ηk∇Gα(vk+1). (22)

As a special case when , we obtain a stochastic gradient descent-type recursion:

 xk+1=xk−η2k∇Gα(∇~fk+1(xk)). (23)

Let us now consider gradient-clipping, a heuristic approach for eliminating the problem of ‘exploding gradients’, which often appear in training neural networks (Pascanu et al., 2013; Zhang et al., 2019a). Very recently, Zhang et al. (2019b) empirically illustrated that such explosions stem from heavy-tailed gradients and formally proved that gradient clipping indeed improves convergence rates under heavy-tailed perturbations. We notice that, the behavior of (22) is reminiscent of gradient clipping: due to the vanishing behavior of for , as the components of gets larger in magnitude, the update applied on gets smaller. The behavior becomes more prominent in (23). On the other hand, (22) is more aggressive in the sense that the updates can get arbitrarily small as the value of decreases as opposed to being ‘clipped’ with a threshold.

The second connection is with the natural gradient descent algorithm, where the stochastic gradients are pre-conditioned with the inverse Fisher information matrix (FIM) (Amari, 1998). Here FIM is defined as , where the expectation is taken over the data. Notice that when (i.e., Cauchy distribution), we have the following form: . Therefore, we observe that, in (23), can be equivalently written as , where is a diagonal matrix with entries . Therefore, we can see as an estimator of the diagonal part of FIM, as they will be in the same order when is large. Besides, (22) then appears as its momentum extension. However, will be biased mainly due to the fact that FIM is the average of the squared gradients, whereas is based on the square of the average gradients. This connection is rather surprising, since a seemingly unrelated, differential geometric approach turns out to have strong algorithmic similarities with a method that naturally arises when the gradient noise is Cauchy distributed.

4 Numerical Study

In this section, we will illustrate our theory on several experiments which are conducted in both synthetic and real-data settings. We note that, as expected, FULD with Gaussian kinetic energy did not yield a numerically stable discretization due to the explosive behavior of . Hence, in this section, we only focus on FULD with kinetic energy in order to avoid obscuring the results and from now on we will simply refer to FULD with kinetic energy as FULD.

4.1 Synthetic setting

We first consider a one-dimensional synthetic setting, similar to the one considered in (Capała and Dybiec, 2019). We consider a quartic potential function with a quadratic component, . We then simulate the ‘uncorrected dynamics’ (UD) given in (7) and FULD (19) by using the Euler-Maruyama discretization to compare their behavior for different . For , we used the software given in (Ament and O’Neil, 2018) for computing .

Figure 3 illustrates the distribution of the samples generated by simulating the two dynamics. In this setup, we set , , with number of iterations . We observe that, for , FULD very accurately captures the form of the distribution, whereas UD exhibits a visible bias and the shape of its resulting distribution is slightly distorted. Nevertheless, since the perturbations are close to a Gaussian in this case (i.e., is close to ), the difference is not substantial and can be tolerable in an optimization context. However, this behavior becomes much more emphasized when we use a heavier-tailed driving process: when , we observe that the target distribution of UD becomes distant from the Gibbs measure , and more importantly its modes no longer match the minima of ; agreeing with the observations presented in (Capała and Dybiec, 2019). On the other hand, thanks to the correction brought by , FULD still captures the target distribution very accurately, even when the driving force is Cauchy.

On the other hand, in our experiments we observed that, for small values of , UD can quickly become numerically unstable and even diverge for slightly larger step-sizes, whereas this problem never occurred for FULD. This outcome also stems from the fact that UD does not have any mechanism to compensate the potential large updates originating from the heavy-tailed perturbations. To illustrate this observation more clearly, in Figure 4 we illustrate the iterates which were used for producing Figure 3. We observe that, while the iterates of UD are well-behaved for , the magnitude range of the iterates gets quite large when is set to . On the other hand, for both values of , FULD iterates are always kept in a reasonable range, thanks to the clipping-like effect of .

4.2 Neural networks

In our next set of experiments, we evaluate our theory on neural networks. In particular, we apply the iterative scheme given in (22) as an optimization algorithm for training neural networks, and compare its behavior with classical SGDm defined in (2). In this setting, we do not add any explicit noise, all the stochasticity comes from the potentially heavy-tailed stochastic gradients (3).

We consider a fully-connected network for a classification task on the MNIST and CIFAR10 datasets, with different depths (i.e. number of layers) and widths (i.e. number of neurons per layer). For each depth-width pair, we train two neural networks by using SGDm (2) and our modified version (22), and compare their final train/test accuracies and loss values. We use the conventional train-test split of the datasets: for MNIST we have K training and K test samples, and for CIFAR10 these numbers are K and K, respectively. We use the cross entropy loss (also referred to as the ‘negative-log-likelihood’).

We note that the modified scheme (22) reduces to (2) when , since . Hence in this section, we will refer to SGDm as the special case of (22) with . On the other hand, in these experiments, computing becomes impractical for , since the algorithms given in (Ament and O’Neil, 2018) become prohibitively slow with the increased dimension . Hence, we will focus on the analytically available Cauchy case, i.e., , which can be efficiently implemented (cf. its definition in Section 3.4). We expect that, if the stochastic gradient noise can be well-approximated by using a Cauchy distribution, then the modified dynamics should exhibit an improved performance since it would eliminate the potential bias brought by the heavy-tailed noise.

In these experiments, we set , , and run the algorithms for iterations 2. We measure the accuracy and the loss at every 100th iteration and we report the average of the last two measurements. Figure 5 shows the results obtained on the MNIST dataset. We observe that, in most of the cases, setting yields a better performance in terms both training and testing accuracies/losses. This difference becomes more visible when the width is set to : the accuracy difference between the algorithms reaches to . We obtain a similar result on the CIFAR10 dataset, as illustrated in Figure 6. In most of the cases performs better, with the maximum accuracy difference being , implying the gradient noise can be approximated by a Cauchy random variable.

We observed a similar behavior when the width was set to . However, when we set the width to we did not perceive a significant difference in terms of the performance of the algorithms. We suspect that this behavior is due to either the gradient noise is not well-approximated either by a Gaussian or Cauchy. On the other hand, when the width was set to , resulted in a slightly better performance, which would be an indication that the Gaussian approximation is closer. The corresponding figures are provided in the appendix.

5 Conclusion and Future Directions

We considered the continuous-time variant of SGDm, known as the underdamped Langevin dynamics (ULD), and developed theory for the case where the gradient noise can be well-approximated by a heavy-tailed -stable random vector. As opposed to naïvely replacing the driving stochastic force in ULD, which correspondonds to running SGDm with heavy-tailed gradient noise, the dynamics that we developed exactly target the Boltzmann-Gibbs distribution, and hence do not introduce an implicit bias. We further established the weak convergence of the Euler-Maruyama discretization and illustrated interesting connections between the discretized algorithm and existing approaches commonly used in practice. We supported our theory with experiments on a synthetic setting and fully connected neural networks.

Our framework opens up interesting future directions. Our current modeling strategy requires a state-independent, isotropic noise assumption, which would not accurately reflect the reality. While anisotropic noise can be incorporated to our framework by using the approach of Ye and Zhu (2018), state-dependent noise introduces challenging technical difficulties. Similarly, it has been illustrated that the tail-index can depend on the state and different components of the noise can have a different (Şimşekli et al., 2019). Incorporating such state dependencies would be an important direction of future research.

Acknowledgments

We thank Jingzhao Zhang for fruitful discussions. The contribution of Umut Şimşekli to this work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX (ANR-16-CE23-0014) project, and by the industrial chair Data science & Artificial Intelligence from Télécom Paris. Lingjiong Zhu is grateful to the support from Simons Foundation Collaboration Grant. Mert Gürbüzbalaban acknowledges support from the grants NSF DMS-1723085 and NSF CCF-1814888.

Appendix A Proof of Theorem 3

{proof}

Let denote the probability density of . Then it satisfies the fractional Fokker-Planck equation (see Proposition 1 and Section 7 in (Schertzer et al., 2001)):

 ∂tq(x,v,t) =γd∑i=1∂[(c(v,α))iq(x,v,t)]∂vi+d∑i=1∂[∂xif(x)q(x,v,t)]∂vi−γβd∑i=1Dαviq(x,v,t) −d∑i=1∂[(∂vig(v)q(x,v,t)]∂xi

We can compute that

 γd∑i=1∂[(c(v,α))iϕ(x)ψ(v)]∂vi+d∑i=1∂[∂xif(x)ϕ(x)ψ(v)]∂vi−γβd∑i=1Dαviϕ(x)ψ(v)−d∑i=1∂[(∂vig(v)ϕ(x)ψ(v)]∂xi =γβϕ(x)[βd∑i=1∂[(c(v,α))iψ(v)]∂vi−d∑i=1Dαviψ(v)]+d∑i=1∂[∂xif(x)ϕ(x)ψ(v)]∂vi−d∑i=1∂[∂vig(v)ϕ(x)ψ(v)]∂xi.

We can compute that

 βd∑i=1∂[(c(v,α))iψ(v)]∂vi−d∑i=1Dαviψ(v) =−d∑i=1∂2∂v2iDα−2viψ(v)−d∑i=1Dαviψ(v) =d∑i=1D2viDα−2viψ(v)−d∑i=1Dαviψ(v)=0,

where we used the property (Proposition 1 in (Şimşekli, 2017)) and the semi-group property of the Riesz derivative .

We can also compute that

 d∑i=1∂[∂xif(x)ϕ(x)ψ(v)]∂vi−d∑i=1∂[∂vig(v)ϕ(x)ψ(v)]∂xi =d∑i=1∂xif(x)ϕ(x)∂viψ(v)−d∑i=1∂vig(v)ψ(v)∂xiϕ(x) =−βd∑i=1∂f(x)∂xiϕ(x)∂g(v)∂viψ(v)+βd∑i=1∂g(v)∂viψ(v)∂f(x)∂xiϕ(x) =0.

Hence, is an invariant probability measure. The proof is complete.

Appendix B Proof of Theorem 3.1

{proof}

We can compute that

 (c(v,α))i=Dα−2vi(vie−12∥v∥2)e−12∥v∥2=e12v2iDα−2vi(vie−12v2i), (24)

for every .

Recall the definition of Fourier transform and its inverse:

 F{f(x)}(ω)=1√2π∫∞−∞e−ixωf(x)dx,F−1{f(ω)}(x)=1√2π∫∞−∞eixωf(ω)dω. (25)

Notice that the Fourier transform of is itself, i.e. , and moreover, , and therefore,

 F{xe−12x2}(ω)=−iωe−12ω2. (26)

Hence,

 Dα−2x(xe−12x2)=F−1{−iω|ω|α−2e−12ω2}(x)=−i√2π∫∞−∞ω|ω|α−2e−12ω2+iωxdω. (27)

Furthermore, we can compute that

 −i√2π∫∞−∞ω|ω|α−2e−12ω2+iωxdω =−i√2π∫∞0ωα−1e−12ω2+iωxdω+−i√2π∫0−∞ω(−ω)α−2e−12ω2+iωxdω =−i√2π∫∞0ωα−1e−12ω2+iωxdω+i√2π∫∞0ωα−1e−12ω2−iωxdω =√2π∫∞0ωα−1sin(ωx)e−12ω2dω.

By the Taylor expansion of sine function, we get

 √2π∫∞0ωα−1sin(ωx)e−12ω2dω =√2π∫∞0ωα−1∞∑k=0(−1)k(ωx)2k+1(2k+1)!e−12ω2dω =√2π∞∑k=0(−1)kx2k+1(2k+1)!∫∞0ω2k+αe−12ω2dω =√2π∞∑k=0(−1)kx2k+1(2k+1)!22k+α−12Γ(2k+α+12),

where we used the identity , for any given . Moreover, for any given , we have the identity:

 ∞∑k=0(−1)kxk(2k+1)!Γ(k+y)=Γ(y)1F1(y;32;−x4), (28)

where is the Kummer confluent hypergeometric function. Therefore, we conclude that

 √2π∞∑k=0(−1)kx