A Conservation Law Method in Optimization
Abstract
We propose some algorithms to find local minima in nonconvex optimization and to obtain global minima in some degree from the Newton Second Law without friction. With the key observation of the velocity observable and controllable in the motion, the algorithms simulate the Newton Second Law without friction based on symplectic Euler scheme. From the intuitive analysis of analytical solution, we give a theoretical analysis for the highspeed convergence in the algorithm proposed. Finally, we propose the experiments for strongly convex function, nonstrongly convex function and nonconvex function in highdimension.
1 Introduction
Nonconvex optimization is the dominating algorithmic technique behind many stateofart results in machine learning, computer vision, natural language processing and reinforcement learning. Finding a global minimizer of a nonconvex optimization problem is NPhard. Instead, the local search method become increasingly important, which is based on the method from convex optimization problem. Formally, the problem of unconstrained optimization is stated in general terms as that of finding the minimum value that a function attains over Euclidean space, i.e.
Numerous methods and algorithms have been proposed to solve the minimization problem, notably gradient methods, Newton’s methods, trustregion method, ellipsoid method and interiorpoint method [25, 19, 31, 16, 5, 6].
Firstorder optimization algorithms are the most popular algorithms to perform optimization and by far the most common way to optimize neural networks, since the secondorder information obtained is supremely expensive. The simplest and earliest method for minimizing a convex function is the gradient method, i.e.,
(1.1) 
There are two significant improvements of the gradient method to speed up the convergence. One is the momentum method, named as Polyak heavy ball method, first proposed in [24], i.e.,
(1.2) 
Let be the condition number, which is the ratio of the smallest eigenvalue and the largest eigenvalue of Hessian at local minima. The momentum method speed up the local convergence rate from to . The other is the notorious Nesterov’s accelerated gradient method, first proposed in [18] and an improved version [20, 19], i.e.
(1.3) 
where the parameter is set as
The scheme devised by Nesterov does not only own the property of the local convergence for strongly convex function, but also is the global convergence scheme, from to for strongly convex function and from to for nonstrongly convex function.
Although there is the complex algebraic trick in Nesterov’s accelerated gradient method, the three methods above can be considered from continuoustime limits [24, 27, 29, 30] to obtain physical intuition. In other words, the three methods can be regarded as the discrete scheme for solving the ODE. The gradient method (1.1) is correspondent to
(1.4) 
and the momentum method and Nesterov accelerated gradient method are correspondent to
(1.5) 
the difference of which are the setting of the friction parameter . There are two significant intuitive physical meaning in the two ODEs (1.4) and (1.5). The ODE (1.4) is the governing equation for potential flow, a correspondent phenomena of waterfall from the height along the gradient direction. The infinitesimal generalization is correspondent to heat conduction in nature. Hence, the gradient method (1.1) is viewed as the implement in computer or optimization simulating the phenomena in the real nature. The ODE (1.5) is the governing equation for the heavy ball motion with friction. The infinitesimal generalization is correspondent to chord vibration in nature. Hence, the momentum method (1.2) and the Nesterov’s accelerated gradient method (1.3) are viewed as the update version implement in computer or optimization by use of setting the friction force parameter .
Furthermore, we can view the three methods above as the thought for dissipating energy implemented in the computer. The unknown objective function in black box model can be viewed as the potential energy. Hence, the initial energy is from the potential function at any position to the minimization value at the position . The total energy is combined with the kinetic energy and the potential energy. The key observation in this paper is that we find the kinetic energy, or the velocity, is observable and controllable variable in the optimization process. In other words, we can compare the velocities in every step to look for local minimum in the computational process or reset them to zero to arrive to artificially dissipate energy.
Let us introduce firstly the governing motion equation in a conservation force field, that we use in this paper, for comparison as below,
(1.6) 
The concept of phase space, developed in the late 19th century, usually consists of all possible values of position and momentum variables. The governing motion equation in a conservation force field (1.6) can be rewritten as
(1.7) 
In this paper, we implement our discrete strategy with the utility of the observability and controllability of the velocity, or the kinetic energy, as well as artificially dissipating energy for two directions as below,

To look for local minima in nonconvex function or global minima in convex function, the kinetic energy, or the norm of the velocity, is compared with that in the previous step, it will be reset to zero until it becomes larger no longer.

To look for global minima in nonconvex function, an initial larger velocity is implemented at the any initial position . A ball is implemented with (1.7), the local maximum of the kinetic energy is recorded to discern how many local minima exists along the trajectory. Then implementing the strategy above to find the minimum of all the local minima.
For implementing our thought in practice, we utilize the scheme in the numerical method for Hamiltonian system, the symplectic Euler method. We remark that a more accuracy version is the StörmerVerlet method for practice.
1.1 An Analytical Demonstration For Intuition
For a simple 1D function with illconditioned Hessian, with the initial position at . The solution and the function value along the solution for (1.4) are given by
(1.8)  
(1.9) 
The solution and the function value along the solution for (1.5) with the optimal friction parameter are
(1.10)  
(1.11) 
The solution and the function value along the solution for (1.7) are
(1.12)  
(1.13) 
stop at the point that arrive maximum. Combined with (1.9), (1.11) and (1.13) with stop at the point that arrive maximum, the function value approximating are shown as below,
From the analytical solution for local convex quadratic function with maximum eigenvalue and minimum eigenvalue , in general, the step size by for momentum method and Nesterov accelerated gradient method, hence the simple estimate for iterative times is approximately
hence, the iterative times is proportional to the reciprocal of the square root of minimal eigenvalue , which is essentially different from the convergence rate of the gradient method and momentum method.
The rest of the paper is organized as follows. Section 2 summarize relevant existing works. In Section 3, we propose the artificially dissipating energy algorithm, energy conservation algorithm and the combined algorithm based on the symplectic Euler scheme, and remark a secondorder scheme — the StörmerVerlet scheme . In Section 4, we propose the locally theoretical analysis for HighSpeed converegnce. In section 5, we propose the experimental result for the proposed algorithms on strongly convex function, nonstrongly convex function and nonconvex function in highdimension. Section 6 proposes some perspective view for the proposed algorithms and two adventurous ideas based on the evolution of Newton Second Law — fluid and quantum.
2 Related Work
The history of gradient method for convex optimization can be back to the time of Euler and Lagrange. However, since it is relatively cheaper to only calculate for firstorder information, this simplest and earliest method is still active in machine learning and nonconvex optimization, such as the recent work [8, 1, 13, 10]. The natural speedup algorithms are the momentum method first proposed in [24] and Nesterov accelerated gradient method first proposed in [18] and an improved version [20]. A acceleration algorithm similar as Nesterov accelerated gradient method, named as FISTA, is designed to solve composition problems [4]. A related comprehensive work is proposed in [6].
The original momentum method, named as Polyak heavy ball method, is from the view of ODE in [24], which contains extremely rich physical intuitive ideas and mathematical theory. An extremely important work in application on machine learning is the backpropagation learning with momentum [26]. Based on the thought of ODE, a lot of understanding and application on the momentum method and Nesterov accelerated gradient methods have been proposed. In [28], a welldesigned random initialization with momentum parameter algorithm is proposed to train both DNNs and RNNs. A seminal deep insight from ODE to understand the intuition behind Nesterov scheme is proposed in [27]. The understanding for momentum method based on the variation perspective is proposed on [29], and the understanding from Lyaponuv analysis is proposed in [30]. From the stability theorem of ODE, the gradient method always converges to local minima in the sense of almost everywhere is proposed in [13]. Analyzing and designing iterative optimization algorithms built on integral quadratic constraints from robust control theory is proposed in [14].
Actually the “high momentum” phenomenon has been firstly observed in [21] for a restarting adaptive accelerating algorithm, and also the restarting scheme is proposed by [27]. However, both works above utilize restarting scheme for an auxiliary tool to accelerate the algorithm based on friction. With the concept of phase space in mechanics, we observe that the kinetic energy, or velocity, is controllable and utilizable parameter to find the local minima. Without friction term, we can still find the local minima only by the velocity parameter. Based on this view, the algorithm is proposed very easy to practice and propose the theoretical analysis. Meanwhile, the thought can be generalized to nonconvex optimization to detect local minima along the trajectory of the particle.
3 Symplectic Scheme and Algorithms
In this chapter, we utilize the firstorder symplectic Euler scheme from numerically solving Hamiltonian system as below
(3.1) 
to propose the corresponding artifically dissipating energy algorithm to find the global minima for convex function, or local minima in nonconvex function. Then by the observability of the velocity, we propose the energy conservation algorithm for detecting local minima along the trajectory. Finally, we propose a combined algorithm to find better local minima between some local minima.
Remark 3.1.
In all the algorithms below, the symplectic Euler scheme can be taken place by the StörmerVerlet scheme, i.e.
(3.2) 
which works perfectly better than the symplectic scheme even if doubling step size and keep the leftright symmetry of the Hamiltonian system. The StörmerVerlet scheme is the natural discretization for 2ndorder ODE
(3.3) 
which is named as leapfrog scheme in PDEs. We remark that the discrete scheme (3.3) is different from the finite difference approximation by the forward Euler method to analyze the stability of 2nd ODE in [27], since the momentum term is biased.
3.1 The Artifically Dissipating Energy Algorithm
Firstly, the artificially dissipating energy algorithm based on (3.1) is proposed as below.
Remark 3.2.
In the actual algorithm 1, the codes in line and are not need in the while loop in order to speed up the computation.
3.1.1 A Simple Example For Illustration
Here, we use a simple convex quadratic function with illconditioned eigenvalue for illustration as below,
(3.4) 
of which the maximum eigenvalue is and the minimum eigenvalue is . Hence the scale of the step size for (3.4) is
In figure 2, we demonstrate the convergence rate of gradient method, momentum method, Nesterov accelerated gradient method and artifically dissipating energy method with the common step size and , where the optimal friction parameter for momentum method and Nesterov accelerated gradient method with . A further result for comparison with the optimal step size in gradient method , the momentum method , and Nesterov accelerated gradient method with and the artifically disspating energy method with shown in figure 3.
With the illustrative convergence rate, we need to learn the trajectory. Since the trajectories of all the four methods are so narrow in illcondition function in (3.4), we use a relatively goodconditioned function to show it as in figure 4.
A clear fact in figure 4 shows that the gradient correction decrease the oscillation to comparing with momentum method. A more clear observation is that artificially dissipating method owns the same property with the other three method by the law of nature, that is, if the trajectory come into the local minima in one dimension will not leave it very far. However, from figure 2 and figure 3, the more rapid convergence rate from artificially dissipating energy method has been shown.
3.2 Energy Conservation Algorithm For Detecting Local Minima
Here, the energy conservation algorithm based on (3.1) is proposed as below.
Remark 3.3.
3.2.1 The Simple Example For Illustration
Here, we use the nonconvex function for illustration as below,
(3.5) 
which is the 2ndorder smooth function but not 3rdorder smooth. The maximum eigenvalue can be calculated as below
then, the step length is set . We illustrate that the algorithm 2 simulate the trajectory and find the local minima in figure 5.
Another 2D potential function is shown as below,
(3.6) 
which is the smooth function with domain in . The maximum eigenvalue can be calculated as below
then, the step length is set . We illustrate that the algorithm 2 simulate the trajectory and find the local minima in figure 6.
Remark 3.4.
We point out that for the energy conservation algorithm for detecting local minima along the trajectory cannot detect saddle point in the sense of almost everywhere, since the saddle point in original function is also a saddle point for the energy function . The proof process is fully the same in [13].
3.3 Combined Algorithm
Finally, we propose the comprehensive algorithm combining the artificially dissipating energy algorithm (algorithm 1) and the energy conservation algorithm (algorithm 2) to find global minima.
Remark 3.5.
We remark that the combined algorithm (algorithm 3) cannot guarantee to find global minima if the initial position is not ergodic. The tracking local minima is dependent on the trajectory. However, the time of computation and precision based on the proposed algorithm is far better than the large sampled gradient method. Our proposed algorithm first makes the global minima found become possible.
4 An Asymptotic Analysis for The Phenomena of Local HighSpeed Convergence
In this section, we analyze the phenomena of highspeed convergence shown in figure 1, figure 2 and figure 3. Without loss of generality, we use the translate transformation ( is the point of local minima) and into (3.1), shown as below,
(4.1) 
the locally linearized scheme of which is given as below,
(4.2) 
Remark 4.1.
The local linearized analysis is based on the stability theorem in finite dimension, the invariant stable manifold theorem and HartmanGrobman linearized map theorem [11]. The thought is firstly used in [24] to estimate the local convergence of momentum method. And in the paper [13], the thought is used to exclude the possiblity to converegnce to saddle point. However, the two theorems above belong to the qualitative theorem of ODE. Hence, the linearized scheme (4.2) is only an approximate estimate for the original scheme (4.1) locally.
4.1 Some Lemmas For The Linearized Scheme
Let be the positivesemidefinite and symmetric matrix to represent in (4.2).
Lemma 4.2.
The numerical scheme, shown as below
(4.3) 
is equivalent to the linearized symplecticEuler scheme (4.2), where we note that the linear transformation is
(4.4) 
Proof.
∎
Lemma 4.3.
For every matrix in (4.4), there exists the orthogonal transformation such that the matrix is similar as below
(4.5) 
where is matrix with the form
(4.6) 
where is the eigenvalue of the matrix .
Proof.
Let be the diagonal matrix with the eigenvalues of the matrix as below
Since is positive definite and symmetric, there exists orthogonal matrix such that
Let be the permuation matrix satisfying
where is the row index and is the column index. Then, let , we have by conjugation
∎
Lemma 4.4.
For any step size satisfying , the eigenvalues of the matrix are complex with absolute value .
Proof.
For , we have
∎
Let and for for the new coordinate variables as below
(4.8) 
In order to make and located in , we need to shrink to .
Lemma 4.5.
Proof.
With SumProduct identities of trigonometric function, we have
Since , we have , we can obtain that
and with the coordinate transfornation in (4.8), we have
Next, we use SumProduct identities of trigonometric function furthermore,
and with , we have
∎
4.2 The Asymptotic Analysis
Theorem 4.7.
Comparing (4.12) and (4.7), we can obtain that
With the intial value , then the initial value for (4.7) is . In order to make sure the numerical solution, or the iterative solution owns the same behavior as the analytical solution, we need to set .
Remark 4.8.
Here, the behavior is similar as the thought in [13]. The step size make sure the global convergence of gradient method. And the step size make the uniqueness of the trajectory along the gradient method, the thought of which is equivalent of the existencen and uniqueness of the solution for ODE. Actually, the step size owns the property with the solution of ODE, the continouslimit version. A global existence of the solution for gradient system is proved in [23].
For the goodconditioned eigenvalue of the Hessian , every method such as gradient method, momentum method, Nesterov accelerated gradient method and artificially dissipating energy method has the good convergence rate shown by the experiment. However, for our artificially dissipating energy method, since there are trigonometric functions from (4.12), we cannot propose the rigorous mathematic proof for the convergence rate. If everybody can propose a theoretical proof, it is very beautiful. Here, we propose a theoretical approximation for illconditioned case, that is, the direction with small eigenvalue .
Assumption A1.
If the step size for (4.2), for the illconditioned eigenvalue , the coordinate variable can be approximated by the analytical solution as
(4.13) 
Theorem 4.9.
For every illconditioned eigendirection, with every initial condition , if the algorithm 1 is implemented at , then there exist an eigenvalue such that
Proof.
When , then . While for the , we can write in the analytical form,
if there is no , increase with increasing. ∎
For some such that approximating , we have
(4.15)  
where . Hence, with approximating , approximatie with the linear convergence, but the coefficient will also decay with the rate with . With the Laurent expansion for at , i.e.,
the coefficient has the approximating formula
where is an arbitrary large real number in for .
5 Experimental Demonstration
In this section, we implement the artificially dissipating energy algorithm (algorithm 1), energy conservation algorithm (algorithm 2) and the combined algorithm (algorithm 3) into highdimension data for comparison with gradient method, momentum method and Nesterov accelerated gradient method.
5.1 Strongly Convex Function
Here, we investigate the artificially dissipating energy algorithm (algorithm 1) for the strongly convex function for comparison with gradient method, momentum method and Nesterov accelerated gradient method (strongly convex case) by the quadratic function as below.
(5.1) 
where is symmetric and positivedefinite matrix. The two cases are shown as below:

The generative matrix is random positive definite matrix with eigenvalue from to with one defined eigenvalue . The generative vector follows i.i.d. Gaussian distribution with mean and variance .

The generative matrix is the notorious example in Nesterov’s book [19], i.e.,
the eigenvalues of the matrix are
and is the dimension of the matrix . The eigenvector can be solved by the second Chebyshev’s polynomial. We implement and is zero vector. Hence, the smallest eigenvalue is approximating
5.2 NonStrongly Convex Function
Here, we investigate the artificially dissipating energy algorithm (algorithm 1) for the nonstrongly convex function for comparison with gradient method, Nesterov accelerated gradient method (nonstrongly convex case) by the logsumexp function as below.
(5.2) 
where is the matrix with , the column vector of and is the vector with component . is the parameter. We show the experiment in (5.2): the matrix and the vector are set by the entry following i.i.d Gaussian distribution for the paramter and .
5.3 Nonconvex Function
For the nonconvex function, we exploit classical test function, known as artificial landscape, to evaluate characteristics of optimization algorithms from general performance and precision. In this paper, we show our algorithms implementing on the StyblinskiTang function and Shekel function, which is recorded in the virtual library of simulation experiments^{1}^{1}1https://www.sfu.ca/ ssurjano/index.html. Firstly, we investigate StyblinskiTang function, i.e.
(5.3) 
to demonstrate the general performance of the algorithm 2 to track the number of local minima and then find the local minima by algorithm 3.
To the essential 1D nonconvex StyblinskiTang function of high dimension, we implement the algorithm 3 to obtain the precision of the global minima as below.
Local_min1  Local_min2  Local_min3  Local_min4  
Initial Position  (5,5,…)  (5,5,…)  (5,5,…)  (5,5,…) 
Position  (2.7486,2.7486,…)  (2.9035,2.9035,…)  (2.7486,2.9035,…)  (2.9035,2.7486,…) 
Function Value  250.2945  391.6617  320.9781  320.9781 
The global minima calculated at the position is shown on the Table 1. And the real global minima at is .
Furthermore, we demonstrate the numerical experiment from StyblinskiTang function to more complex Shekel function
(5.4) 
where
and

Case , the global minima at is .

From the position , the experimental result with the step length and the iterative times is shown as below

From the position , the experimental result with the step length and the iterative times is shown as below
