Self-concordant analysis of Frank-Wolfe algorithms

Self-concordant analysis of Frank-Wolfe algorithms

Abstract

Projection-free optimization via different variants of the Frank-Wolfe (FW) method has become one of the cornerstones in optimization for machine learning since in many cases the linear minimization oracle is much cheaper to implement than projections and some sparsity needs to be preserved. In a number of applications, e.g. Poisson inverse problems or quantum state tomography, the loss is given by a self-concordant (SC) function having unbounded curvature, implying absence of theoretical guarantees for the existing FW methods. We use the theory of SC functions to provide a new adaptive step size for FW methods and prove global convergence rate , being the iteration counter. If the problem can be represented by a local linear minimization oracle, we are the first to propose a FW method with linear convergence rate without assuming neither strong convexity nor a Lipschitz continuous gradient.

1 Introduction

Statistical analysis using (generalized) self-concordant (SC) functions as a loss function is gaining increasing attention in the machine learning community [Bac10, Owe13, OstBac18, MarBacOstRu19]. This class of loss functions allows to obtain faster statistical rates akin to least-squares. At the same time, the minimization of empirical risk in this setting is a challenging optimization problem in high dimensions. The reason is that for large-scale problems usually first-order methods are the methods of choice. Yet, self-concordant functions do not satisfy the cornerstone assumption for the First-order method (FOM), namely, their gradient is not guaranteed to be Lipschitz continuous and they are not strongly convex in general. Further, the canonical machine learning problem involves some sparsity constraints usually admitting a Linear minimization oracle (LMO) [Jag13], meaning that for a reasonable computational price one can minimize a linear function over the feasible set. The examples include regularization, Spectrahedron constraints and ordered weighted regularization [FigZen14]. In such settings an attractive method of choice is the Frank-Wolfe (FW) method [FraWol56], also known as the Conditional Gradient (CG). The modern convergence analysis of FW, starting with [Jag13], relied on bounded curvature of the objective function . A sufficient condition for this is that the objective function has a Lipschitz continuous gradient over the domain of interest. In general, SC functions have unbounded curvature and are also not strongly convex, in general. In fact, to the best of our knowledge, there is no existing convergence guarantee for FW when minimizing a SC function. Given the plethora of examples of SC functions and the importance of FW in machine learning, our aim is to clarify the convergence issue of FW for this important class of optimization problems.

Our Contributions. In this paper we demonstrate that FW indeed works when minimizing a SC function over a compact polyhedron. Key to this is the exploitation of the inherent problem structure. However, since the standard arguments for FW are not applicable, we have to develop some new ideas. Section 3 constructs new adaptive variants of the FW method, whose main innovation is the construction of new step-size policies ensuring global convergence and standard sublinear convergence rates. The derivation of these step-size schemes fully relies on basic properties of SC functions, and is key to show potentials for acceleration of standard FW. Section 4 gives the first linear convergence results for FW for the problems of interest in this paper, under the assumption that we have access to a strong Local Linear minimization oracle (LLOO), first constructed in [GarHaz16].

Previous work. Despite its great scalability properties, FW is plagued by slow convergence rates, in general. FraWol56 showed an convergence rate of the function values to the optimal value. Later on LevPol66 generalized this result to arbitrary compact convex sets, and CanCul68 proved that this rate is actually tight (see also Lan13). However, a standard result in convex optimization states that projected gradient descent converges linearly to the minimizer of a strongly convex and Lipschitz smooth (i.e. well-conditioned) function. Subsequently, an important line of research emerged with the aim to accelerate FW. GueMar86 obtained linear convergence rates in well conditioned problems under the a-priori assumption that the solution lies in the relative interior of the feasible set, and the rate of convergence explicitly depends on the distance the solution is away from the boundary (see also EpeFreu00, BecTeb04). If no a-priori information on the location of the solution is available, there are essentially two known twists of the vanilla FW to boost the convergence rates. One twist is to modify the search directions. Probably the most prominent modification along these lines is the use of “away-steps”, but alternative strategies are possible [JagLac15, BecSht17, FreGriMaz17]. The alternative twist is to use a tighter definition on the LMO. This last strategy has been used in HarJudNem15, GarHaz16. In particular the latter paper is fundamental to parts of our analysis. None of the existing references studied the important case of SC minimization, and established linear convergence rates under reasonable hypothesis. Recently, there have been some attempts to relax the typical assumptions on the objective function. Nes18CG requires the gradient of the objective function to be Hölder continuous. Implicitly it is assumed that , an assumption we do not make, and is also not satisfied in important applications (e.g. in the Poisson inverse problem below, but ). Specialized to solving a quadratic Poisson inverse problem in phase retrieval, Odor16 provided a globally convergent FW method. They studied this problem via estimate sequence techniques [Nes83] in order to match the proof technique of Nes18CG. Consequently, the proof is fairly involved, and largely ignores the SC structure of the problem. However, as we show in this work, the SC structure is key to obtain (i) a unified analysis of FW for SC-minimization, paralleling the recent advances made in a proximal setting by [CevKyrTra15, SunTran18]; (ii) accelerated rates under a stronger LMO in the spirit of GarHaz16. All the algorithms we develop in this paper are characterized by new explicit and adaptive step-size policies which guarantee global convergence and rate estimates. We also provide some preliminary numerical performance of our method in the context of the logistic regression and show that our newly proposed step sizes provide substantial benefit over existing ones.

After the first version of our paper appeared on Arxiv on February 11, 2020, a concurrent paper [LiuCevTra20] proposed a Frank-Wolfe type of method for self-concordant functions involving a quadratic model of the original objective. The main difference is that the algorithm in [LiuCevTra20] uses the Hessian, and thus can be seen as a kind of Newton method, of the objective and that their rate is slightly different than ours.

2 Preliminaries

We consider the minimization problem

(P)

where is a compact convex set in with nonempty interior. We assume that and the function is represented by a LMO returning at point the target vector

(2.1)

In case of multiple solutions of the LP we assume that ties are broken by some arbitrary mechanism. The function is assumed to be SC [NesIPM94] in the following sense. Let be a closed convex function with open domain . For a fixed and direction , define for .

Definition 2.1.

A proper closed convex function with open domain is said to be self-concordant SC with parameter (i.e. ) if

(2.2)

If we call standard self-concordant.

2.1 Examples

Poisson Inverse Problem

As a stylized application of (P), consider the low-light imaging problem in signal processing, where the imaging data is collected by counting photons hitting a detector over time. In this setting, we wish to accurately reconstruct an image in low-light, which leads to noisy measurements due to low photon count. A standard assumption in such a setting is that the data-generating process of the observations follows as Poisson distribution

where is the true image, is a linear operator that projects the scene onto observations, is the -th row of and is the vector of observed photon counts. The maximum likelihood formulation of the Poisson inverse problem [HarMarWil11] under sparsity constraints leads to the objective function

Set for , so that is self-concordant with parameter . Then, the function is self-concordant with domain and parameter (SunTran18, Prop.1).

Learning Gaussian Markov random fields.

We consider learning a Gaussian graphical random field of nodes/variables from a data set [SpeKli86, CevDinKyr13Graph]. Each random vector is an i.i.d realization from a -dimensional Gaussian distribution with mean and covariance matrix . Let be the precision matrix. To satisfy conditional dependencies between the random variables, must have zero in if and are not connected in the underlying dependency graph. To learn the graphical model via an -regularization framework from an empirical estimator , we minimize the loss function

for and , where and is the symmetric matrix constructed from the -dimensional vector . It can be seen that . Treating the penalty function as a constraint, we naturally arrive at the SC optimization problem

where .

Logistic Regression.

We consider the following regularized logistic regression problem

(2.3)

where is the logistic loss, is a given intercept, and are given as input data for , and is a given regularization parameter. According to Prop. 5 in SunTran18, this function can be regarded as a self-concordant function with parameter . As in the previous example, we can add an regularization to promote sparsity to the problem formulation via an appropriate definition of the constraints.

Portfolio Optimization. In this problem there are assets with returns in period of the investment horizon. The goal is to minimize the utility function of the investor by choosing the weights of the assets in the portfolio. Our task is to design a portfolio solving the problem

Note that this problem can be cast into an online optimization model [AroHaz06]. It can be seen that . We remark that this problem appears also in the classical universal prediction problem in information theory and online learning [MerFed98].

2.2 The limits of standard Fw

In this section we explain why the standard analysis of FW does not work for SC functions. For this, we need to briefly recapitulate the main steps in the modern analysis due to Jag13. The typical merit function employed in FW algorithms is the dual gap function

(2.4)

It is easy to see that for all , with equality if and only is a solution to (P).

  Input: initial state, Step size policy (either , or via line-search); tolerance level
  for  do
      if  then
          Obtain
          Set
      end if
  end for
Algorithm 1 Standard Frank-Wolfe method

The convergence analysis of Algorithm 1 relies on the curvature constant [Jag13]

where we define the Bregman divergence of the smooth convex function as

for all . Under the assumption that , Jag13 proved sublinear rate of convergence in terms of the dual gap function . Unfortunately, minimizing a self-concordant function over a compact set does not necessarily give us a finite curvature constant, as the following simple example illustrates.

Example 2.1.

Consider the function where satisfy . This function is the standard self-concordant barrier for the positive orthant (the log-barrier) and its Bregman divergence is easily calculated as

where . We see that . The function , nor its gradient, is Lipschitz continuous over the set of interest.

The above example is quite generic for the class of optimization problems of interest in this work since the logarithm is the prime example for a self-concordant function. It is thus clear that the standard analysis based on finite curvature estimates (or, as a particular case, Lipschitz continuous gradient) cannot be applied to analyze FW when applied to (P).

2.3 Background on self-concordant functions

If is nonsingular for every , then we say that is a non-degenerate SC function. From Nes18, Thm. 5.1.6, we know that if contains no straight lines, then every function is non-degenerate. Since our interest will be to minimize self-concordant functions over a compact set, all the SC functions we will consider are automatically non-degenerate. Define the local norm of at as . The dual norm at is defined as . Given , we define the distance-like function

Let us further define

It is not hard to observe that for all and for every ; Moreover, and are increasing and strictly convex in and , respectively. For all , it is known (Thms. 5.1.8 and 5.1.9 in Nes18) that

(2.5)
(2.6)

where in the latter inequality we assume .
An attractive feature of SC functions, in particular from the point of view of FW algorithms [Jag13], is that self-concordance is invariant under affine transformations and re-parametrizations of the domain. More precisely,

Lemma 2.2 (Nes18, Thm. 5.1.2).

Let and a linear operator. Then .

When we apply FW to the minimization of a function , the search direction at position is determined by the target state defined in (2.1). If is a surjective linear re-parametrization of the domain , then the new optimization problem is still within the frame of problem (P). Furthermore, the updates produced by FW are not affected by this re-parametrization since for .

We have the following existence result for solutions of problem (P).

Proposition 2.3.

Suppose there exists such that . Then (P) admits a unique solution.

The proof is in Section A.1 We note that there can be at most one solution to problem (P) because on . Henceforth, we will assume that (P) admits a solution .

3 Fw works for self-concordant functions

In this section we derive two new adaptive step size policies, both of which are equipped with global theoretical convergence guarantees, and they both display similar (sublinear) convergence rates. Variant 1 of the adaptive FW is based on the local sufficient decrease property (2.6), and its derivation shares some interesting similarities with recent developments of proximal methods for SC minimization [CevKyrTra15, SunTran18]. Variant 2 exploits local regularity properties of SC functions. Indeed, its entire construction is based on the important insight that a descent method applied to SC functions behaves as if we would minimize a strongly convex and Lipschitz smooth function. Such an implicit globalization property has been exploited in, for instance, CevKyrTra15, Lu17. This observation allows us to estimate the function value decrease via a quadratic model, and the step size is obtained by optimizing the per-iteration decrease. Hence, Variant 2 of Algorithm 2 appears as standard FW, with the important difference that it does not rely on any global Lipschitz-like information. However, it still requires knowledge of a localized Lipschitz constant, which depends on the initial condition . In many practical applications this constant can be estimated, and good guesses of the initial condition might be very beneficial to the performance of the algorithm. It should be highlighted that both step-size variants are given in closed form expressions, which makes the algorithm free of any globalization strategies like line search. Both adaptive variants are summarized compactly in Algorithm 2.

  Input: initial state,
  for  do
      if  then
          Obtain
          Obtain
          if Variant 1: then
              Set
              Set
          end if
          if Variant 2:  then
              Compute
              Set
              Set
          end if
      end if
  end for
Algorithm 2 Adaptive Frank-Wolfe method for SC functions

3.1 Analysis of Variant 1

We need a preliminary error bound.

Lemma 3.1.

For all we have:

For , define the target vector as in (2.1), and as in (2.4). Moreover, for all , let us define

(3.1)

Given and , set . Assume that . By construction,

iff . Choosing , we conclude from (2.6)

This is the basic descent inequality which allows us to derive an explicit formula for the step-size .

Proposition 3.2.

For , define

This function is concave and uniquely maximized at the value

(3.2)

If where is used as a step-size in Variant 1 of Algorithm 1, and define , then

(3.3)

Moreover, this step-size rule is almost optimal in a worst-case analytic sense.

We now construct the step size sequence by setting for all .The damping parameter must be included in order to ensure that the iterates stay within a unit Dikin ellipsoid , which guarantees that the algorithm remains in (cf. [Nes18], Thm. 5.1.5). Convexity of and actually guarantees that under the step size policy is used. Thus, at each iteration, we reduce the objective function value by at least the quantity , so that

Proposition 3.3.

The following assertions hold for Variant 1 of Algorithm 2:

  • is non-increasing;

  • , and hence the sequence converges to 0;

  • For all we have .

Proposition 3.3(a) immediately implies , or equivalently,

For all , it follows from Lemma 3.1 that . Consequently,

Since is continuous and increasing, is a closed, bounded and convex set. Accordingly, defining by and the largest, respectively smallest, eigenvalue of the Hessian , the numbers

are well defined and finite. In particular, for all we have a restricted strong convexity of the function , in the sense of the next lemma.

Lemma 3.4.

For all we have

(3.4)

In terms of these quantities, we can bound the sequence , defined as , as

(3.5)

where .

Remark 3.1.

Algorithm 2 requires the distance measure , which is defined in terms of the local norm. Computing this local norm, requires forming the Hessian , which might be costly in large-scale applications. Thanks to (3.5), we can replace the local norm by the Euclidean norm, provided that the parameters are available, or can be estimated at reasonable costs [Lu17].

In order to derive convergence rates, we need to lower bound the per-iteration decrease in the objective function. A detailed analysis, given in the supplementary materials, reveals an explicit lower bound on the per-iteration decrease which relates the gap function to the sequence .

Lemma 3.5.

For all we have

(3.6)

where and .

Define the approximation error . By convexity, we have . Therefore, the lower bound for in Lemma 3.5 can be estimated by the more symmetric quantity which implies

(3.7)

Given this recursion, we can identify two phases characterizing the process . In Phase I, the approximation error is at least , and in Phase II the approximation error falls below this value. The cut-off value determines the nature of the recursion (3.7), and yields immediately an estimate for the iteration complexity of Variant 1 of Algorithm 2.

Theorem 3.6.

For all , define the stopping time

(3.8)

Then,

(3.9)

The proof is in Section B.1.2

3.2 Analysis of Variant 2

For the analysis of Variant 2 of Algorithm 1 we first define

(3.10)

This is seen to be the largest step size guaranteeing feasibility and decrease of the objective function value. Hence, for all , we have , and . To continue with the analysis, we need the next estimate for the local Lipschitz smoothness of the function on the level set .

Lemma 3.7.

Assume that for all . For all , it holds true that , and

Using this estimate, we can provide a localized version of the descent Lemma, which reads as

for all Departing from this localized estimate, we update the current position , conceptually as follows: First, given the current iterate , we retrieve . For , this gives sufficient decrease in the form of

Forgetting about the constraint for the moment, the above quadratic majorization is optimized in an unconstrained way at the point

(3.11)

A simple argument, outlined in the Section B.2 in the supplementary material, shows that we can actually choose the step-size , so that Variant 2 of Algorithm 2 is well-posed. We would like to emphasize that in the effective implementation of this Algorithm, the search step actually need not be computed. As for Variant 1, this variant is defined via a compact closed-form representation, requiring only knowledge of the local Lipschitz constant .

Theorem 3.8.

Let be generated by Variant 2 of Algorithm 2. Then, in terms of the approximation error , we have

The proof is in Section B.2.

4 Inducing linear convergence

In this section we use the construction of a local linear minimization oracle given in [GarHaz16] to deduce an accelerated version of the based FW Algorithm 2. In that reference Garber and Hazan give an explicit construction of such an oracle for any polytope of form

(4.1)

where and . Accordingly, we assume in this section that we are minimizing a SC function over a compact polytope of the form (4.1). Let .

Definition 4.1 ([GarHaz16], Def. 2.5).

A procedure , where is a LLOO with parameter for the polytope if returns a point . such that

(4.2)
(4.3)

Assuming the availability of a procedure for any point , we run Algorithm 3, whose pseudo-code reads as follows:

  Input: -LLOO with parameter for polytope ,
, and let
  for  do
      if  then
          Obtain and
          Update
          Obtain
          Set
          Set
      end if
  end for
Algorithm 3 LLOO-based convex optimization

Our analysis of Algorithm 3 departs from eq. (2.6), saying that for we have

To exploit the power of the LLOO, we need to control the right-hand side of the previous display by bounding carefully the radius of a single step made by the algorithm. Via a delicate induction argument, based on estimates for SC functions and GarHaz16, we are able to demonstrate linear convergence of the thus defined scheme.

Theorem 4.2.

Let be generated by Algorithm 3. Then for all we have and

(4.4)

The proof is in Section C.3

The constants affecting the rates of the algorithm are related to problem (P) via the objective function and the diameter of the set . The local Lipschitz-like constant is just the largest eigenvalue of the Hessian matrix , and thus relatively easy to compute via iterative power methods (typically 10 iterations suffice). The proof of Theorem 4.2 established that Algorithm 3 is a descent method, thus . Hence, for each . If we define accordingly the global constant , we get a more simplified, less precise, convergence rate of the form

which is a rate bound similar to GarHaz16. The algorithm also does not need precise information about the value of , but rather only an upper bound. Such an upper bound can be obtained by a single call of the LMO (cf. [GarHaz16]). The most difficult quantity involved in the estimate (4.4) is the convexity parameter . This quantity is also needed in the proximal-based approaches of Lu17, or the DISCO algorithm [LinZha15], and in the analysis of [GarHaz16] as well. It would be very attractive to get rid of this global parameter in the implementation of Algorithm 3, and we leave this difficult question for future research.

5 Numerical Experiments


Figure 1: Percent of data sets in which relative error value was achieved within 10,000 iterations. Adaptive is Variant 1 of Algorithm 2. Lipschitz is Variant 2 of Algorithm 2. Line Search is Algorithm 1 with a line search.

We report some preliminary numerical experiments for the logistic regression. As described in Section 2.1, we move the regularization term into the constraint, and add a variable that will upper bound the regularized term. Note that in this example the function gradient is Lipschitz continuous, with upper bound , where is the matrix which rows are given by . We implemented Algorithm 2 in both of its variants and compared it to the standard FW (Algorithm 1). All methods were implemented with MATLAB v2019b, on a Server PowerEdge R740xd with 2 Intel Xeon Cold 6254 3.1GHzs, each with 18 cores, and a total of 384GB RAM. Each method was allowed to run on only one core.

(a) Average ratio between number of iterations to minimal number of iterations to achieve a certain relative error value.

(b) Average ratio between time to minimal time to achieve a certain relative error value.
Figure 2: Adaptive is Variant 1 of Algorithm 2. Lipschitz is Variant 2 of Algorithm 2. Line Search is Algorithm 1 with a line search.

We test the three algorithms on a binary classification datasets downloaded from LIBSVM at https://www.csie.ntu.edu.tw/ cjlin/libsvm/. We ran the three algorithms on 17 datasets (a1a-a9a, w1a-w8a), and the average performance over all these data sets is presented in Figures 1-1(b). In all these plots the method “adaptive” corresponds to Variant 1 of Algorithm 2, while “Lipschitz” represents Variant 2 of the same algorithm. “Line Search” is Algoritm 1) with exact line search. We estimate by the maximal lower bound on the function value achieved by any of the algorithms, and compute the relative error as . Figure 1 reports the the relative error of the three algorithms achieved in all 17 datasets in less than 10,000 iterations. Note that the adaptive method has the highest percentage of reaching all values of relative error. Figures 1(a) and 1(b) show the average ratio between number if iterations/time taken to reach a certain relative error to the methods which reached this error in the minimal iteration/time. We see that indeed the adaptive method (Variant 1), which uses local properties of the function, is faster than Lipschitz (Variant 2) in both number of iterations and time. Moreover, the adaptive methods takes more iterations to achieve low accuracy than the line search, but less iterations to achieve higher accuracy, and is much less computationally expensive. Overall this is strong support for the superiority of the adaptive variant. In Figure 3 we show the convergence for the dataset a4a, in which we see the performance of each method with respect to both iteration and time, and the benefit of the adaptive choice of step-size for higher accuracy.

(a) Iterations vs. Relative Error

(b) Time vs. Relative Error
Figure 3: Convergence for dataset a4a.

6 Conclusion

CG is a much appraised FOM for minimizing smooth convex functions over convex compact sets. The main merit of this method is the relative simplicity of its implementation and projection-free iterations. This yields great scalability properties making it a very attractive method for large-scale optimization. The price to pay for iteration-simplicity are, in general, slow convergence rates, and bounded curvature estimates over the polytope. In this work, we show that for SC functions, which are neither strongly convex, nor have bounded curvature, we can obtain a novel step-size policy which, when coupled with local linear minimization oracles, features linear convergence rates. To the best of our knowledge, this is the first unified analysis with provable global convergence guarantees for the class of function under consideration here, despite the fact that self-concordance is a very frequently encountered structure in machine learning.

In future work, we will generalize our Algorithms to handle generalized self-concordant functions, as recently defined in SunTran18. With this generalization we will be able to handle challenging optimization problems in scientific computing, such as the matrix balancing problem, which is known to be not a well-conditioned problem.

Acknowledgments

The authors would like to thank Professor Quoc Tranh-Dinh for sharing MATLAB codes with us. M. Staudigl gratefully acknowledges support from the COST Action CA16228 "European Network for Game Theory".

Supplementary Material

Appendix A Proof of Proposition 2.3

For all we know that

Define the level set and pick . For such a point, we get

Consider the function for . For it is true that , and so we need that . Since , it follows that is bounded. By the Weierstrass theorem, existence of a solution follows (see e.g. [Ber99]). If is a solution, we know that

Hence, if would be any alternative solution, we immediately conclude that .

Appendix B Proofs of Section 3

b.1 Proofs for Variant 1 of Algorithm 2

Proof of Lemma 3.1.

If , then eq. (5) shows

Proof of Proposition 3.2.

For , define

(B.1)

We easily compute . Hence, the function is concave and uniquely maximized at

(B.2)

Furthermore, one can easily check that , and , whenever . Hence, it follows that

(B.3)

Proof of Lemma 3.4.

Lemma 3.1 gives . Observe that for all