Accelerating the DC algorithm for smooth functions

Accelerating the DC algorithm for smooth functions

Francisco J. Aragón Artacho Department of Mathematics, University of Alicante, Spain. E-mail: francisco.aragon@ua.es    Ronan M.T. Fleming Systems Biochemistry Group, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, 4362 Esch-sur-Alzette, Luxembourg. E-mail: ronan.mt.fleming@gmail.com    Phan T. Vuong Systems Biochemistry Group, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Campus Belval, 4362 Esch-sur-Alzette, Luxembourg. E-mail: vuongphantu@gmail.com
13th July 2019
Abstract

We introduce two new algorithms to minimise smooth difference of convex (DC) functions that accelerate the convergence of the classical DC algorithm (DCA). We prove that the point computed by DCA can be used to define a descent direction for the objective function evaluated at this point. Our algorithms are based on a combination of DCA together with a line search step that uses this descent direction. Convergence of the algorithms is proved and the rate of convergence is analysed under the Łojasiewicz property of the objective function. We apply our algorithms to a class of smooth DC programs arising in the study of biochemical reaction networks, where the objective function is real analytic and thus satisfies the Łojasiewicz property. Numerical tests on various biochemical models clearly show that our algorithms outperforms DCA, being on average more than four times faster in both computational time and the number of iterations. Numerical experiments show that the algorithms are globally convergent to a non-equilibrium steady state of various biochemical networks, with only chemically consistent restrictions on the network topology.

plus 1fil

1 Introduction

Many problems arising in science and engineering applications require the development of algorithms to minimise a nonconvex function. If a nonconvex function admits a decomposition, this may be exploited to tailor specialised optimisation algorithms. Our main focus is the following optimisation problem

(1)

where are continuously differentiable convex functions and

(2)

In our case, as we shall see in Section 4, this problem arises in the study of biochemical reaction networks. In general, is a nonconvex function. The function in problem (1) belongs to two important classes of functions: the class of functions that can be decomposed as a sum of a convex function and a differentiable function (composite functions) and the class of functions that are representable as difference of convex functions (DC functions).

In 1981, M. Fukushima and H. Mine [7, 17] introduced two algorithms to minimise a composite function. In both algorithms, the main idea is to linearly approximate the differentiable part of the composite function at the current point and then minimise the resulting convex function to find a new point. The difference between the new and current points provides a descent direction with respect to the composite function, when it is evaluated at the current point. The next iteration is then obtained through a line search procedure along this descent direction. Algorithms for minimising composite functions have been extensively investigated and found applications to many problems such as: inverse covariance estimate, logistic regression, sparse least squares and feasibility problems, see e.g. [9, 14, 15, 19] and the references quoted therein.

In 1986, T. Pham Dinh and S. El Bernoussi [24] introduced an algorithm to minimise DC functions. In its simplified form, the Difference of Convex functions Algorithm (DCA) linearly approximates the concave part of the objective function ( in (1)) about the current point and then minimises the resulting convex approximation to the DC function to find the next iteration, without recourse to a line search. The main idea is similar to Fukushima–Mine approach but was extended to the non-differentiable case. This algorithm has been extensively studied by H.A. Le Thi, T. Pham Dinh and their collaborators, see e.g. [12, 13, 23, 10]. DCA has been successfully applied in many fields, such as machine learning, financial optimisation, supply chain management and telecommunication [6, 22, 10]. Nowadays, DC programming plays an important role in nonconvex programming and DCA is commonly used because of its key advantages: simplicity, inexpensiveness and efficiency [10]. Some results related to the convergence rate for special classes of DC programs have been also established [11, 13].

In this paper we introduce two new algorithms to find stationary points of DC programs, called Boosted Difference of Convex function Algorithms (BDCA), which accelerate DCA with a line search using an Armijo type rule. The first algorithm directly uses a backtracking technique, while the second uses a quadratic interpolation of the objective function together with backtracking. Our algorithms are based on both DCA and the proximal point algorithm approach of Fukushima–Mine. First, we compute the point generated by DCA. Then, we use this point to define the search direction. This search direction coincides with the one employed by Fukushima–Mine in [7]. The key difference between their method and ours is the starting point used for the line search: in our algorithms we use the point generated by DCA, instead of using the previous iteration. This scheme works thanks to the fact that the defined search direction is not only a descent direction for the objective function at the previous iteration, as observed by Fukushima–Mine, but is also a descent direction at the point generated by DCA. Unfortunately, as shown in Remark 3.1, this scheme cannot be extended in general for nonsmooth functions, as the defined search direction might be an ascent direction at the point generated by DCA.

Moreover, it is important to notice that the iterations of Fukushima–Mine and BDCA never coincide, as the largest step size taken in their algorithm is equal to one (which gives the DCA iteration). In fact, for smooth functions, the iterations of Fukushima–Mine usually coincide with the ones generated by DCA, as the step size equal to one is normally accepted by their Armijo rule.

We should point out that DCA is a descent method without line search. This is something that is usually claimed to be advantageous in the large-scale setting. Our purpose here is the opposite: we show that a line search can increase the performance even for high-dimensional problems.

Further, we analyse the rate of convergence under the Łojasiewicz property [16] of the objective function. It should be mentioned that the Łojasiewicz property is recently playing an important role for proving the convergence of optimisation algorithms for analytic cost functions, see e.g. [1, 3, 4, 13].

We have performed numerical experiments in functions arising in the study of biochemical reaction networks. We show that the problem of finding a steady state of these networks, which plays a crucial role in the modelling of biochemical reaction systems, can be reformulated as a minimisation problem involving DC functions. In fact, this is the main motivation and starting point of our work: when one applies DCA to find a steady state of these systems, the rate of convergence is usually quite slow. As these problems commonly involve hundreds of variables (even thousands in the most complex systems, as Recon 2111Recon 2 is the most comprehensive representation of human metabolism that is applicable to computational modelling [25]. This biochemical network model involves more than four thousand molecular species and seven thousand reversible elementary reactions.), the speed of convergence becomes crucial. In our numerical tests we have compared BDCA and DCA for finding a steady state in various biochemical network models of different size. On average, DCA needed five times more iterations than BDCA to achieve the same accuracy, and what is more relevant, our implementation of BDCA was more than four times faster than DCA to achieve the same accuracy. Thus, we prove both theoretically and numerically that BDCA results more advantageous than DCA. Luckily, the objective function arising in these biochemical reaction networks is real analytic, a class of functions which is known to satisfy the Łojasiewicz property [16]. Therefore, the above mentioned convergence analysis results can be applied in this setting.

The rest of this paper is organised as follows. In Section 2, we recall some preliminary facts used throughout the paper and we present the main optimisation problem. Section 3 describes our main results, where the new algorithms (BDCA) and their convergence analysis for solving DC programs are established. A DC program arising in biochemical reaction network problems is introduced in Section 4. Numerical results comparing BDCA and DCA on various biochemical network models are reported in Section 5. Finally, conclusions are stated in the last section.

2 Preliminaries

Throughout this paper, the inner product of two vectors is denoted by , while denotes the induced norm, defined by . The nonnegative orthant in  is denoted by  and denotes the closed ball of center and radius . The gradient of a differentiable function at some point is denoted by .

Recall that a function is said to be convex if

Further, is called strongly convex with modulus if

or, equivalently, when is convex. The function is said to be coercive if   whenever

On the other hand, a function is said to be monotone when

Further, is called strongly monotone with modulus when

The function is called Lipschitz continuous if there is some constant such that

is called locally Lipschitz continuous if for every in , there exists a neighbourhood of such that restricted to is Lipschitz continuous.

We have the following well-known result.

Proposition 2.1.

Let be a differentiable function. Then f is (strongly) convex if and only if  is (strongly) monotone.

To establish our convergence results, we will make use of the Łojasiewicz property, defined next.

Definition 2.1.

Let be a differentiable function.

  1. The function is said to have the Łojasiewicz property if for any critical point , there exist constants and such that

    (3)

    where we adopt the convention . The constant is called Łojasiewicz exponent of at

  2. The function is said to be real analytic if for every ,  may be represented by a convergent power series in some neighbourhood of .

Proposition 2.2.

[16] Every real analytic function satisfies the Łojasiewicz property with exponent .

Problem (1) can be easily transformed into an equivalent problem involving strongly convex functions. Indeed, choose any and consider the functions and . Then and are strongly convex functions with modulus and , for all . In this way, we obtain the equivalent problem

(4)

The key step to solve  with DCA is to approximate the concave part of the objective function  by its affine majorisation and then minimise the resulting convex function. The algorithm proceeds as follows.

ALGORITHM 1: (DCA, [12]) Let be any initial point and set . Solve the strongly convex optimisation problem to obtain the unique solution . If then STOP and RETURN , otherwise set , set , and go to Step 2.

In [7] Fukushima and Mine adapted their original algorithm reported in [17] by adding a proximal term to the objective of the convex optimisation subproblem. As a result they obtain an optimisation subproblem that is identical to the one in Step 2 of DCA, when one transforms (1) into (4) by adding to each convex function. In contrast to DCA, Fukushima–Mine algorithm [7] also includes a line search along the direction to find the smallest nonnegative integer such that the Armijo type rule

(5)

is satisfied, where and . Thus, when  satisfies (5), i.e. when

one has  and the iterations of both algorithms coincide. As we shall see in Proposition 3.1, this is guaranteed to happen if .

3 Boosted DC Algorithms

Let us introduce our first algorithm to solve , which we call a Boosted DC Algorithm with Backtracking. The algorithm is a combination of Algorithm 1 and the algorithm of Fukushima–Mine [7].

ALGORITHM 2: (BDCA-Backtracking) Fix , and . Let be any initial point and set . Solve the strongly convex minimisation problem to obtain the unique solution . Set . If , STOP and RETURN . Otherwise, go to Step 4. Set . WHILE DO . Set . If then STOP and RETURN , otherwise set , and go to Step 2.

The next proposition shows that the solution of , which coincides with the DCA subproblem in Algorithm 1, provides a decrease in the value of the objective function. For the sake of completeness, we include its short proof.

Proposition 3.1.

For all , it holds that

(6)
Proof.

Since is the unique solution of the strongly convex problem , we have

(7)

which implies

On the other hand, the strong convexity of implies

Adding the two previous inequalities, we have

which implies (6). ∎

If , the iterations of BDCA-Backtracking coincide with those of DCA, since the latter sets . Next we show that is a descent direction for at . Thus, one can achieve a larger decrease in the value of by moving along this direction. This simple fact, which permits an improvement in the performance of DCA, constitutes the key idea of our algorithms.

Proposition 3.2.

For all , we have

(8)

that is, is a descent direction for at .

Proof.

The function is strongly convex with constant . This implies that is strongly monotone with constant ; whence,

Further, since is the unique solution of the strongly convex problem , we have

which implies,

and completes the proof. ∎

Remark 3.1.

In general, Proposition 3.2 does not remain valid when is not differentiable. In fact, the direction might be an ascent direction, in which case Step 4 in Algorithm 4 could become an infinite loop. For instance, consider and for . If , one has

whose unique solution is . Then, the one-sided directional derivative of at in the direction is given by

Thus, is an ascent direction for at (actually, is the global minimum of ).

As a corollary, we deduce that the backtracking Step 4 of Algorithm 2 terminates finitely when .

Corollary 3.1.

Suppose that . Then, for all there is some such that

(9)
Proof.

If there is nothing to prove. Otherwise, by the mean value theorem, there is some  such that

As  is continuous at , there is some  such that

Since , then for all , we deduce

and the proof is complete. ∎

Remark 3.2.

Notice that . Therefore, Algorithm 2 uses the same direction as the Fukushima–Mine algorithm [7], where for some and some nonnegative integer . The iterations would be the same if . Nevertheless, as , the step size chosen in the Fukushima–Mine algorithm [7] is always less than or equal to zero, while in Algorithm 2, only step sizes are explored. Moreover, observe that the Armijo type rule (5), as used in [7], searches for an such that , whereas Algorithm 2 searches for a such that . We know from (6) and (9) that

thus, Algorithm 2 results in a larger decrease in the value of at each iteration than DCA, which sets and . Therefore, a faster convergence of Algorithm 2 compared with DCA is expected, see Figure 1 and Figure 3.

The following convergence results were inspired by [3], which in turn were adapted from the original ideas of Łojasiewicz; see also [5, Section 3.2].

Proposition 3.3.

For any , either Algorithm 2 returns a stationary point of  or it generates an infinite sequence such that the following holds.

  1. is monotonically decreasing and convergent to some .

  2. Any limit point of is a stationary point of . If in addition, is coercive then there exits a subsequence of which converges to a stationary point of .

  3. and .

Proof.

Because of (7), if Algorithm 2 stops at Step 3 and returns , then must be a stationary point of . Otherwise, by Proposition 3.1 and Step 4 of Algorithm 2, we have

(10)

Hence, as the sequence is monotonically decreasing and bounded from below by (2), it converges to some , which proves (i). Consequently, we have

Thus, by (10), one has

Let be any limit point of , and let be a subsequence of converging to . Since , one has

Taking the limit as in (7), as and are continuous, we have .

If is coercive, since the sequence is convergent, then the sequence is bounded. This implies that there exits a subsequence of converging to , a stationary point of , which proves (ii).

To prove (iii), observe that (10) implies that

(11)

Summing this inequality from  to , we obtain

(12)

whence, taking the limit when

so we have . Since

we obtain

and the proof is complete. ∎

We will employ the following useful lemma to obtain bounds on the rate of convergence of the sequences generated by Algorithm 2. This result appears within the proof of [3, Theorem 2] for specific values of  and . See also [13, Theorem 3.3], or very recently, [15, Theorem 3].

Lemma 3.1.

Let  be a sequence in  and let  be some positive constants. Suppose that and that the sequence satisfies

(13)

Then

  1. if , the sequence  converges to  in a finite number of steps;

  2. if , the sequence  converges linearly to  with rate ;

  3. if , there exists  such that

Proof.

If , then (13) implies

and (i) follows.

Assume that . Since , we have that  for all  large enough. Thus, by (13), we have

Therefore, ; i.e., converges linearly to  with rate .

Suppose now that . If  for some , then (13) implies . Then the sequence converges to zero in a finite number of steps, and thus (iii) trivially holds. Hence, we will assume that  and that (13) holds for all , for some positive integer . Consider the decreasing function defined by . By (13), for , we have

As , this implies that

for all . Thus, summing for from to , we have

which gives, for all ,

Therefore, there is some  such that

which completes the proof. ∎

Theorem 3.1.

Suppose that is locally Lipschitz continuous and satisfies the Łojasiewicz property with exponent . For any , consider the sequence generated by Algorithm 2. If the sequence has a cluster point , then the whole sequence converges to  which is a stationary point of . Moreover, denoting , the following estimations hold:

  1. if then the sequences  and converge in a finite number of steps to  and , respectively;

  2. if then the sequences  and converge linearly to  and , respectively;

  3. if then there exist some positive constants and such that

    for all large .

Proof.

By Proposition 3.3, we have . If is a cluster point of , then there exists a subsequence of that converges to . By continuity of , we have that

Hence, is finite and has the same value  at every cluster point of . If for some , then for any , since the sequence is decreasing. Therefore, for all and Algorithm 2 terminates after a finite number of steps. From now on, we assume that for all .

As  satisfies the Łojasiewicz property, there exist and such that

(14)

Further, as is locally Lipschitz around , there are some constants and such that

(15)

Let . Since and , we can find an index large enough such that

(16)

By Proposition 3.3(iii), we know that . Then, taking a larger if needed, we can assure that

We now prove that, for all , whenever it holds

(17)

Indeed, consider the concave function defined as . Then, we have

Substituting in this inequality by and by and using (14) and then (11), one has

(18)

On the other hand, since and

using (15), we obtain

(19)

Combining (18) and (19), we obtain (17).

From (17), as , we deduce

(20)

for all such that

We prove by induction that for all . Indeed, from (16) the claim holds for . We suppose that it also holds for , with . Then (20) is valid for . Therefore

where the last inequality follows from (16).

Adding (20) from to one has

(21)

Taking the limit as , we can conclude that

(22)

This means that is a Cauchy sequence. Therefore, since  is a cluster point of , the whole sequence converges to . By Proposition 3.3, must be a stationary point of .

For , it follows from (14), (15) and (11) that

(23)

where . By applying Lemma 3.1 with , and , statements (i)-(iii) regarding the sequence  easily follow from (23).

We know that is finite by (22). Notice that  by the triangle inequality. Therefore, the rate of convergence of  to can be deduced from the convergence rate of to 0. Adding (20) from  to  with , we have

where . Then by (14) and (15), we get

Hence, taking , for all  we have

By applying Lemma 3.1 with  and , we see that the statements in (i)-(iii) regarding the sequence  hold. ∎

Example 3.1.

Consider the function . The iteration given by DCA (Algorithm 1) satisfies

that is, . On the other hand, the iteration defined by Algorithm 2 is

If , we have , while . For any , we have . The optimal step size is attained at with , which is the global minimiser of .

Figure 1: Plot of for