Design of generalized fractional ordergradient descent method

Design of generalized fractional order
gradient descent method

Yiheng Wei, Yu Kang, Weidi Yin, and Yong Wang All the authors are with the Department of Automation, University of Science and Technology of China. Y. Wang is the corresponding author. e-mail: neudawei@ustc.edu.cn; kangduyu@ustc.edu.cn; yinwd@mail.ustc.edu.cn; yongwang@ustc.edu.cn.
Abstract

This paper focuses on the convergence problem of the emerging fractional order gradient descent method, and proposes three solutions to overcome the problem. In fact, the general fractional gradient method cannot converge to the real extreme point of the target function, which critically hampers the application of this method. Because of the long memory characteristics of fractional derivative, fixed memory principle is a prior choice. Apart from the truncation of memory length, two new methods are developed to reach the convergence. The one is the truncation of the infinite series, and the other is the modification of the constant fractional order. Finally, six illustrative examples are performed to illustrate the effectiveness and practicability of proposed methods.

Fractional gradient, convergence design, truncation, fixed memory step, variable fractional order.

I Introduction

Gradient descent method is prevalently used in research fields, such as optimization [1, 2], machine learning [3, 4] and image denoising [5, 6]. Through plenty of studies and experiments, it is known that gradient method is one of the most effective and efficient way to find the optimal solution of optimization problems. Nowadays, one of the key point of gradient method is how to improve the performance further [7]. As an important branch of mathematics, fractional calculus is believed to be a good tool to improve the traditional gradient descent method, mainly because of its special long memory characteristics and nonlocality [8, 9].

Some remarkable progress in studies of fractional gradient method not only reveals some interesting properties, but also gives practical suggestions for future research. In [10], the authors proposed a fractional gradient method by using Caputo nabla difference with an order no more than as the iterative order, instead of first order difference. Though this method could ensure the convergence, the convergence speed becomes slow. Interesting, a similar idea can be found in [11], where Riemann–Liouville fractional derivative was initially used as a substitute for the first order gradient. With the adoption of fractional calculus, this newly developed method manifests distinct properties. For example, its iterative search process can easily pass over the local extreme points. However, one cannot guarantee that the extreme point can be found using the method in [11] even if the algorithm is indeed convergent. Additionally, it is difficult to calculate the needed fractional derivative online.

This shortcoming has been partially overcome in [12]. Despite some minor errors with the calculation procedure, the developed method has been successfully applied in speech enhancement [12] and channel equalization [13]. After that, by approximating the exact fractional derivative, a new fractional gradient method was proposed by us [14]. Hereinafter, this method was used in system identification [15]. Research indicated that introduction of fractional gradient method enhanced the performance of classical algorithm. Additionally, the promising fractional gradient method has been widely used in many applications, such as least mean square algorithm [16], back propagation neural networks [17], recommender systems [18], etc.

It is worth pointing out that fractional gradient method has been used successfully, but the related research is still in its infancy and deserves further investigation. The immediate problem is the convergence problem. On one hand, [11] has found that the fractional extreme value is not equal to the real extreme value of target function, which would make fractional gradient method loss practicability. However, the main reason for nonconvergence is unclear. The effective solutions for realizing convergence are still desirable. On the other hand, if the update equation is not searching at the right direction to the real extreme point of the cost function, the convergence speed could not be fast enough. Therefore, for a considerable performance, only convergence can not meet the requirements, and convergence speed should also be taken into account.

Although there are many problems and work to be completed with this issue, we have reasons to believe that these subsequent studies can break a new ground in the future. Inspired by the discussions above, the objective of this paper are: i) investigating the extreme point and value of fractional gradient method; ii) designing the solutions to solve the convergence problem; and iii) the convergence speed should be considered and improved.

The remainder of the paper is organized as follows. Section II is devoted to math preparation and problem formulation. Solutions to the convergence problem of fractional gradient method are introduced in Section III. Section IV shows some numerical examples to verify the proposed methods. At last, conclusion is presented in Section V.

Ii Preliminaries

This section presents a brief introduction to the mathematical background of fractional calculus and the fatal flaw of fractional order gradient descent method.

Ii-a Fractional calculus

The commonly used definitions of fractional calculus are Grünwald–Letnikov, Riemann–Liouville and Caputo [19]. Because the first two definitions are identical under certain conditions and Grünwald–Letnikov definition is usually used in numerical calculation, only Riemann–Liouville and Caputo definitions are considered in this study.

The Riemann-Liouville derivative and Caputo derivative of order , are expressed as

(1)
(2)

respectively, where , , is the lower terminal, is the Gamma function. Notably, the fractional derivatives in (1) and (2) are actually special integral and they manifest long memory characteristics not the local property of the signal .

If function can be expanded as a Taylor series, the fractional derivatives can be rewritten as follows

(3)
(4)

where , , is the generalized binomial coefficient. From the two formulas, it can be directly concluded that the fractional derivative, no matter for Riemann–Liouville definition or Caputo definition, consists of various integer order derivatives.

In general, the fractional derivative can be regarded as the natural generalization of the conventional derivative. However, it is worth mentioning that the fractional derivative has the special nonlocality, which will play a pivotal role in fractional gradient descent method.

Ii-B Problem statement

The well-known gradient descent method is a first-order iterative optimization algorithm for finding the minimum of a function. To this end, one typically takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. For example, is updated by the following law

(5)

where is the current position, is the next position, is the learning rate and is the first-order gradient at . When the classical gradient is replaced by the fractional one, it follows that

(6)

Due to the property of fractional calculus, namely, , , (6) degenerates into (5) exactly when .

For , its exact extreme point is . Its first-order derivative is , while for , its fractional order derivative satisfies

(7)
(8)

Consider the following three cases

and set , , and . On this basis, the simulation results are shown in Fig. 1.

Fig. 1: Search process of gradient descent method.

It is clearly observed that all the three cases can realize the convergence within steps, while only case 1 is able to converge to the exact extreme point. If the algorithm is convergent for case 2, the convergent point satisfies . With the exception of , one obtains

(9)

which matches the green dot dash line well.

Similarly, when case 3 is convergent, the final value can be calculated by , which leads to

(10)

This just coincides with the blue dotted line.

To sum up, fractional extreme points are different from the actual extreme point, since the fractional derivative of depends on the variable and is also connected with the order and the initial instant . It is not easy to ensure and at the actual extreme point. Without loss of generality, consider , , , and then similar conclusions can be drawn. If and , then case 2 may not converge. Even if it is convergent, it will never converge to . Besides, case 3 will never converge to either. This kind of convergence problem inevitably will, to some extent, cause performance deterioration when using fractional gradient method in some practical applications. Therefore, it is necessary to deal with the convergence problem mentioned above.

Iii Main Results

To solve the convergence problem, three viable solutions are proposed by fully considering the nonlocality of fractional derivative. Additionally, the convergence and implementation are described in detail.

Iii-a Fixed memory step

From the discussion in the previous section, it can be found that the gradient descent method cannot converge to the real extreme point as the order extends to the non-integer case. To substantially study the short memory characteristics of classical derivative and modify the long memory characteristics of fractional derivative, an intuitive idea is to replace the constant initial instant with the varying initial instant . The resulting method follows immediately

(11)

where .

This design is inspired by our previous work in [20, 14]. When the algorithm is convergent, it will converge to its actual extreme point. Now, let us prove that converges to by contradiction. Supposing that converges to a point different from and , then . Therefore, for any sufficient small positive scalar , there exists a sufficient large number such that for any .

By combining formulas (3) and (11), one has the following inequality

(12)

where and .

Actually, one can always find an satisfying . From the assumption on , it becomes . Hence, can be derived, which contradicts to the fact . This completes the proof of convergence.

The main idea of this method is called as fixed memory principle. can be calculated with the help of (4). Then the corresponding fractional gradient method can be expressed as

(13)

To facilitate the understanding, the proposed algorithm is briefly introduced in .

Iii-B Higher order truncation

Recalling the classical gradient descent method, it can be obtained that for any positive , if it is small enough, one has . Moreover, it becomes

(14)

When the dominant factor emerges, the iteration completes, which confirms the fact that the first-order derivative is equal to at the extreme point. From this point of view, only the relevant term is reserved and the other terms are omitted, resulting a new fractional gradient descent method.

(15)

To avoid the appearance of a complex number, the update law in (15) can be rewritten as

(16)

To prevent the emergence of a denominator of , i.e., , a small nonnegative number is introduced to modify the update law further as

(17)

If is a convex function with a unique extreme point and it has a Lipschitz continuous gradient with the Lipschitz constant , can guarantee the convergence, where .

Assuming that is the exact extreme point, one obtains and then

(18)

where . Then the convergence condition of the sequence becomes

(19)

From these hypotheses, one has and . Then, the desired learning rate appears.

Similarly, a brief description of the algorithm is shown in .

Iii-C Variable fractional order

It is well known that the traditional gradient method could converge to the exact extreme point. For this reason, adjusting the order with is an alternative method. If the target function satisfies and , one can design the variable fractional order as follows

(20)
(21)
(22)

where the loss function and the constant . Beside, it is noticed that

(23)
(24)

At the beginning of learning, is a relative large value and then , which results in a quick learning. Subsequently, get close to gradually and then , which leads to an accurate learning. In the end, is expected.

However, the order is constructed with the assumption on . If the minimum value of is nonzero or even negative, the designed orders in (20)-(22) will no longer work. In this case, the function will be redefined

(25)

and thereby the designed orders are revived. At this point, the corresponding method can be expressed as

(26)

The description of variable fractional order gradient descent method is given in .

Remark 1

In this paper, three solutions are developed to solve the non convergence problem of fractional order gradient descent method. The first method is benefit from the long memory characteristics of fractional derivative, and then the fixed memory principle is adopted here. The second method is proposed to weaken the nonlocality of fractional derivative and keep only the first order term. The third method is to change the order with the loss function at each step. This work surely activates the existing method, which will make this method potential, potent and practical.

Remark 2

In general, it is difficult or even impossible to obtain the analytic form of fractional derivative for arbitrary functions. With regard to this work, the equivalent representation in formulas (3)-(4) plays a critical role. In particular, to avoid the implementation difficulties brought by the infinite series, the approximate finite sum is alternative. Furthermore, the following formulas (27)-(28) could also be equally adopted in fractional gradient method.

(27)
(28)
Remark 3

To convey the main contributions of this work clearly, some points are listed here.

  1. In Algorithm 1, is generally selected, when is used. When it is selectively replaced by or , the range of can extend to .

  2. In Algorithm 2, combining with the fixed memory principle, the leading term in (17), i.e., could be modified by and then the range of its application is enlarged to .

  3. In Algorithm 3, could also find its substitute as or .

  4. Note that (20)-(22) are not the unique forms of the order and any valid forms are suitable here.

  5. Only Caputo definition is considered in constructing the solutions, while the Riemann–Liouville case can still be handled similarly.

  6. Although this paper only focuses on the scalar fractional gradient method, the multivariate case can be also established with the similar treatment.

It is intended that in-depth studies in these directions will be undertaken as future course of work.

in (20), (21) or (22)

Iv Simulation Study

In this section, several examples are provided to explicitly demonstrate the validity of the proposed solutions. Examples 1-3 aim at testifying the convergence design. Examples 4-5 consider a target function with nonzero minimum value and the notable Rosenbrock function, respectively. Example 6 gives an application regarding to LMS algorithm.

Example 1

Recalling and setting the fractional order , the learning rate , the initial point , then the results using Algorithm 1 with , and are given in Fig. 2. It is clearly seen that for any case, the expected convergence can reach within steps. Additionally, the speed of convergence becomes more rapid as decreases.

Fig. 2: Algorithm 1 with different .

When the parameter is set as , the learning rate varies from to and then the simulation with Algorithm 1 is conducted once again. Fig. 3 indicates that with the increase of the learning rate, the convergence gets faster and the overshoot emerges gradually.

Fig. 3: Algorithm 1 with different .
Example 2

Let us continue to consider the target function . Setting , , , and , Algorithm 2 is adopted to search the minimum value point. It can be clearly observed from Fig. 4 that the proposed method is effective and the convergence will accelerate along with the reduction of the order.

Fig. 4: Algorithm 2 with different .

Similarly, provided and , the related simulation is performed and the results are shown in Fig. 5. This picture suggests that the algorithm with different initial points converges simultaneously.

Fig. 5: Algorithm 2 with different .
Example 3

Look into the target function once again. The following three cases are considered

and other parameters are set as , and . The corresponding results are depicted in Fig. 6. It is shown that all the designed orders could achieve the convergence as expected while is still a valuable work to design suitable order for better performance.

Fig. 6: Algorithm 3 with different .

In the previous simulation, the initial instant is randomly selected and different from . To test the influence of , a series of values are configured for one after another. Besides, the aforementioned case 3 is still applied here and the results are shown in Fig. 7. It illustrates that when increases, the search process gets slower, while all of them are convergent as expected.

Fig. 7: Algorithm 3 with different .
Example 4

Consider the target function proposed in [11], i.e., . The real extreme point of is and the extreme value is nonzero. Setting , , , , , , and , then the results using three proposed methods are given in Figs. 8-10.

Fig. 8: Variation of .
Fig. 9: Variation of .
Fig. 10: Contour plot.

As revealed in Fig. 8 and Fig. 9, the proposed algorithms converge simultaneously in the and coordinate directions and the actual extreme points are reached, respectively. To give a more intuitive understanding, the convergence trajectories are displayed in Fig. 10. The iterative search process of plays faster than that of , since the coefficient about is larger than that under the same condition. Extra simulation indicates that when the learning rates are set separately as and instead of , the convergence on will speed up and the overshoot on will disappear with properly defined , .

Notably, the variable fractional orders are designed individually, namely,

(29)

where and . Actually, the order can also be designed uniformly

(30)

where and is the weighting factor.

The solid cyan line manifests the general fractional order gradient descent method cannot converge to the exact extreme point, which just coincides with the claim in [11].

Example 5

Consider the famous Rosenbrock function [21], i.e., . The real extreme point of is . Choose the parameters as , , , , , and . Because the minimum value of the target function is , the formula (22) is adopted to calculate a common . The learning rates are selected separately for the three algorithms, i.e., , and . On this basis, three proposed methods are implemented numerically and the simulation curves are recorded in Figs. 11-13.

Fig. 11: Variation of .
Fig. 12: Variation of .
Fig. 13: Contour plot.

Rosenbrock function is a non-convex benchmark function, which is often used as a performance test problem for optimization algorithm [21]. The extreme point of Rosenbrock function is in a long, narrow and parabolic valley, which is difficult to reach. Fig. 11 and Fig. 12 are variation of and of Rosenbrock case, respectively. The general fractional case is not given in the figures because it also converges to a point different from the exact extreme point. The three proposed methods are given in figures, and it turns out that Algorithm 2 and Algorithm 3 converge to the extreme point at around th iteration. Algorithm 1 could approximate the extreme point while small deviations still exist. More specially, defining the error as

(31)

then it can be calculated that and . Though Algorithm 1 performs bad in later period, it is surely ahead of the other algorithms within th iteration and it is the first one entered the error band. According to Fig. 13, what can be seen is that all the curves get together as when they are away from the starting point soon. Although Algorithm 1 does not converge to the extreme point in the previous two figures, it converges in the right direction. It is plain that the red dotted line, which denotes the result of Algorithm 1, in Fig. 13 would eventually reach to the extreme point .

Example 6

Consider a three order transverse filtering issue shown in Fig. 14. The optimal tap weight is . The known input and unknown noise are given in Fig. 15. Select the parameters , , , , , and . are designed according to (22) with . Then, all the three proposed methods can be used to estimate parameters of the filter and simulation results are given in Fig. 16.

Fig. 14: The block diagram of transverse filter.
Fig. 15: Known input and unknown noise.
Fig. 16: Responds of the tap weight .

As can be seen from Fig. 16, all of them accomplish the parameter estimation successfully. It is no exaggeration to say that the three methods have considerable convergence speed. Fig. 16 also demonstrates that the proposed solutions can resolve the convergence problems of fractional gradient descent method and the resulting methods are implementable in practical case.

V Conclusions

In this paper, the convergence problem of fractional order gradient descent method has been tentatively investigated. It turns out that general fractional gradient method cannot converge to the actual extreme point. By exploiting the natural properties of fractional derivative, three individual solutions are proposed in detail, including the fixed memory step, the higher order truncation, and the variable fractional order. Both theoretical analysis and simulation study indicate that all the designed methods can achieve the true convergence quickly. It is believed that this work is beneficial for solving the pertinent optimization problems with fractional order methods. The following issues would be the topic of the future researches.

  1. Design and analyze new convergence design solutions.

  2. Extended the results to fractional Lipschitz condition.

  3. Consider the nonsmooth or nonconvex target function.

References

  • [1] J. S. Zeng and W. T. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on Signal Processing, vol. 66, no. 11, pp. 2834–2848, 2018.
  • [2] H. Q. Li, S. Liu, Y. C. Soh, and L. H. Xie, “Event-triggered communication and data rate constraint for distributed optimization of multiagent systems,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 48, no. 11, pp. 1908–1919, 2018.
  • [3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.
  • [4] Z. J. Li, H. Z. Xiao, C. G. Yang, and Y. W. Zhao, “Model predictive control of nonholonomic chained systems using general projection neural networks optimization,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 45, no. 10, pp. 1313–1321, 2015.
  • [5] A. Beck and M. Teboulle, “Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems.” IEEE Transactions on Image Processing, vol. 18, no. 11, pp. 2419–2434, 2009.
  • [6] Y. F. Pu, J. L. Zhou, and X. Yuan, “Fractional differential mask: a fractional differential-based approach for multiscale texture enhancement,” IEEE Transactions on Image Processing, vol. 19, no. 2, pp. 491–511, 2010.
  • [7] S. Boyd and L. Vandenberghe, Convex Optimization.   Cambridge: Cambridge University Press, 2004.
  • [8] P. Liu, Z. G. Zeng, and J. Wang, “Multiple Mittag–Leffler stability of fractional-order recurrent neural networks,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 8, pp. 2279–2288, 2017.
  • [9] W. D. Yin, Y. H. Wei, T. Y. Liu, and Y. Wang, “A novel orthogonalized fractional order filtered-x normalized least mean squares algorithm for feedforward vibration rejection,” Mechanical Systems and Signal Processing, vol. 119, pp. 138–154, 2019.
  • [10] Y. Tan, Z. Q. He, and B. Y. Tian, “A novel generalization of modified LMS algorithm to fractional order,” IEEE Signal Processing Letters, vol. 22, no. 9, pp. 1244–1248, 2015.
  • [11] Y. F. Pu, J. L. Zhou, Y. Zhang, N. Zhang, G. Huang, and P. Siarry, “Fractional extreme value adaptive training method: fractional steepest descent approach,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 4, pp. 653–662, 2015.
  • [12] C. C. Tseng and S. L. Lee, “Designs of fractional derivative constrained 1-D and 2-D FIR filters in the complex domain,” Signal Processing, vol. 95, pp. 111–125, 2014.
  • [13] S. M. Shah, R. Samar, N. M. Khan, and M. A. Z. Raja, “Design of fractional-order variants of complex LMS and NLMS algorithms for adaptive channel equalization,” Nonlinear Dynamics, vol. 88, no. 2, pp. 839–858, 2017.
  • [14] Y. Q. Chen, Q. Gao, Y. H. Wei, and Y. Wang, “Study on fractional order gradient methods,” Applied Mathematics and Computation, vol. 314, pp. 310–321, 2017.
  • [15] S. S. Cheng, Y. H. Wei, D. Sheng, Y. Q. Chen, and Y. Wang, “Identification for Hammerstein nonlinear ARMAX systems based on multi-innovation fractional order stochastic gradient ¡î,” Signal Processing, vol. 142, pp. 1–10, 2017.
  • [16] S. Zubair, N. I. Chaudhary, Z. A. Khan, and W. Wang, “Momentum fractional LMS for power signal parameter estimation,” Signal Processing, vol. 142, pp. 441–449, 2018.
  • [17] J. Wang, Y. Q. Wen, Y. D. Gou, Z. Y. Ye, and H. Chen, “Fractional-order gradient descent learning of BP neural networks with Caputo derivative,” Neural Networks, vol. 89, pp. 19–30, 2017.
  • [18] Z. A. Khan, N. I. Chaudhary, and S. Zubair, “Fractional stochastic gradient descent for recommender systems,” Electronic Markets, pp. 1–11, 2018.
  • [19] I. Podlubny, Fractional Differential Equations: an Introduction to Fractional Derivatives, Fractional Differential Equations, to Methods of Their Solution and Some of Their Applications.   San Diego: Academic Press, 1998.
  • [20] Y. H. Wei, Y. Q. Chen, S. S. Cheng, and Y. Wang, “A note on short memory principle of fractional calculus,” Fractional Calculus and Applied Analysis, vol. 20, no. 6, pp. 1382–1404, 2017.
  • [21] H. Rosenbrock, “An automatic method for finding the greatest or least value of a function,” The Computer Journal, vol. 3, no. 3, pp. 175–184, 1960.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
362616
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description