Frank-Wolfe methods (FW) have gained significant interest in the machine learning community due to its ability to efficiently solve large problems that admit a sparse structure (e.g. sparse vectors and low-rank matrices). However the performance of the existing FW method hinges on the quality of the linear approximation. This typically restricts FW to smooth functions for which the approximation quality, indicated by a global curvature measure, is reasonably good.
In this paper, we propose a modified FW algorithm amenable to nonsmooth functions by optimizing for approximation quality over all affine approximations given a neighborhood of interest. We analyze theoretical properties of the proposed algorithm and demonstrate that it overcomes many issues associated with existing methods in the context of nonsmooth low-rank matrix estimation.
Nonsmooth Frank-Wolfe using Uniform Affine Approximations
Edward Cheung Yuying Li
University of Waterloo University of Waterloo
We are interested in solving problems of the form,
where the trace norm , i.e., the sum of the singular values of . This problem is well studied in the case where is a smooth convex function. For example, in matrix completion, many efficient algorithms have been proposed, including Frank-Wolfe [Jaggi, 2011], active set methods [Hsieh and Olsen, 2014], and proximal methods [Parikh et al., 2014]. In [Cheung and Li, 2017], it has been shown that FW can be made to scale competitively by maintaining a low-rank intermediate solution at each iteration, and only requiring simple projection-free updates, avoiding a full SVD required by proximal methods or projected gradient methods.
Recently, there has also been interest in solving the trace norm constrained problem where the objective function is not differentiable, e.g.,
where is an empirical loss function. Problem (1) has been found useful [Richard et al., 2012] for sparse covariance estimation and graph link prediction, for which solutions are expected to exhibit simultaneously sparse and low-rank structure. For problem (1), proximal methods will likely fail to scale due to the full SVD required at each iteration. In addition, the active set method in [Hsieh and Olsen, 2014] utilizes second order information, making it unclear how to develop a scalable algorithm when the function is not differentiable, let alone twice differentiable. Since differentiability is not explicitly required for Frank-Wolfe, as shown in [Jaggi, 2011, White, 1993], it appears to be a promising candidate to yield an efficient solver to nonsmooth trace norm constrained problems.
However, nondifferentiability in the objective function often leads to an unbounded curvature constant and regions in which local linear approximations are poor substitutes for the true objective. Moreover, it becomes unclear how to define the linear approximation appropriately since choosing an arbitrary subgradient often leads to inadequate local approximations.
We propose a variant of the Frank-Wolfe algorithm to address nonsmooth objectives with a focus on low-rank matrix estimation problems. We replace the traditional linear minimization problem based on a Taylor approximation by a Chebyshev uniform affine approximation. This modification allows for a well-defined linear optimization problem even when the objective is nonsmooth or has unbounded curvature.
We demonstrate that, for matrix estimation problems, the uniform affine approximation can be found in a closed form, contrasting with existing approximate subgradient methods which may require solving a costly optimization problem at each iteration. Moreover, we show that our proposed Chebyshev affine approximations correspond to a sequence of smooth functions which converge uniformly to the original objective, yielding a smoothing schedule parameterized only by the stepsize. We demonstrate experimentally that this carefully selected linear minimization leads to significant improvement over a variety of matrix estimation problems, such as sparse covariance estimation, graph link prediction, and -loss matrix completion.
2.1 FW for Nonsmooth Functions
The FW algorithm is a first-order method for solving , where is a convex function and is a convex and compact set [Frank and Wolfe, 1956]. The algorithm is motivated by replacing the objective function with its first-order Taylor expansion and solving the first-order surrogate on the domain . Formally, the Frank-Wolfe algorithm solves the following linear minimization problem at the iteration ,
The next iterate is then given by a convex combination of and , which guarantees that the resulting iterate remains feasible, assuming the initial .
For smooth convex functions, the Frank-Wolfe algorithm is known to converge at a rate of . The convergence analysis relies on the concept of curvature constant [Clarkson, 2010, Jaggi, 2011], which measures the quality of the linear approximation.
Let be a convex and differentiable function , and let be a convex and compact subset of . Then, the curvature constant is defined as
Since is convex, . The curvature constant is the maximum relative deviation of the linear approximation from on the domain. When the value of is large, it suggests that there are regions in where the local linear approximation is poor. It can be seen in [Clarkson, 2010, Jaggi, 2011] that the curvature constant is closely related to the Lipschitz constant of the gradient.
Suppose is convex but nondifferentiable. We may consider redefining the curvature constant by replacing with a substitute . However, even for simple functions, we show that curvature derived from any linear approximation in this fashion will be unbounded.
Let and let be some convex and compact set that contains an open ball around the origin. Assume and . Then for any and any , we have,
Note that there always exists such that . It follows that,
This example shows that, for , no matter what surrogate function (in particular, any subgradient) is chosen for , the curvature constant is unbounded. This seems to indicate that FW may not be suited for minimizing the objective function in (1). However, the norm function is a max of linear functions and the linear approximation is exact in a local neighborhood except at the points of nondifferentiability. We propose a variant of Frank-Wolfe to exploit the problem structure.
2.2 Existing Work
The earliest work we have found on applying FW to the nonsmooth function minimization is [White, 1993], in which the nonsmoothness is handled by using approximate subdifferential, . Recently [Ravi et al., 2017] further extended this idea and proposed a generalized curvature constant to guide determination of better linear subproblems. Formally, this generalized curvature constant is given below,
We first note that, since is not necessarily a subgradient of at , the difference can become negative. Thus, taking the supremum over these values may not measure the maximum deviation of the linear approximation. In particular, it does not measure deviations from overestimating linear approximations. Furthermore, since is minimized in the inner optimization, the choice of linear approximation now depends on both and , making it no longer straightforward to define the FW subproblem using this notion of curvature.
Instead, in [Ravi et al., 2017], the linear minimization is replaced instead by finding which satisfies
Since and are jointly considered, we are not aware of an efficient way to utilize this algorithm for problems such as (1).
for the Smoothed Composite Conditional Gradient (SCCG) algorithm in [Pierucci et al., 2014], and
for the Hybrid Conditional Gradient with Smoothing (HCGS) algorithm in [Argyriou et al., 2014], where
is the soft-thresholding operator. In both cases, the objective is smoothed by parameters and respectively, with a clear tradeoff between approximation quality and convergence rate. We show in the experimental results, this compromise is very evident in practice.
2.3 Achieving a better linear approximation
The previous work on FW for nonsmooth minimization (1) shares a common idea, i.e., finding a meaningful way to define an appropriate linear optimization subproblem. In this paper, we motivate and propose a more direct choice of the linear approximation for FW. Instead of choosing an approximate subgradient to minimize some curvature constant [Ravi et al., 2017], we directly minimize the approximation error over all possible affine functions for a specified neighborhood.
Given some and , the uniform affine approximation to a function over a convex and compact set , is defined as where,
and is the closed infinity norm ball of radius around .
The above uniform affine approximation can be viewed as choosing an optimal affine approximation in a given neighborhood of . The optimality is defined in the uniform sense, by minimizing the maximum absolute deviation from the original objective .
This motivates a natural variant of FW where, at each iteration, the linear subproblem using a subgradient is replaced with the uniform affine approximation. In particular, we can view the FW iterates as,
for some . Thus, and . For FW, this implies that we can restrict our attention to a neighborhood around the current iterate where the neighborhood has radius .
Specifically, we observe that in FW there exist stepsize schedules, e.g., , which are independent of the current iterate and guarantee convergence. Our proposed approach is to assume that such a stepsize schedule is specified a priori and to use the uniform affine approximation for the FW subproblems.
It is worth noting that the proposed uniform affine approximations improves upon the existing methods in the following ways.
The linear optimization subproblem is defined in a meaningful way by choosing the best affine approximation in the neighborhood of interest rather than an arbitrarily smoothed objective.
The method minimizes the absolute deviation, accounting for overestimators unlike the generalized curvature constant .
By considering all affine approximations, we no longer require to have a bounded curvature constant or even to be differentiable in the analysis, extending applicability to a broader class of problems than the standard FW algorithm.
3 Frank-Wolfe With Uniform Approximations
We propose a Frank-Wolfe algorithm with Uniform Approximation (FWUA) which follows the original Frank-Wolfe algorithm with a few minor changes. We assume a step-size schedule , which is independent of the current iterate , is specified a priori. In the linear subproblem, we replace a linear approximation using a subgradient by a uniform affine approximation.
3.1 Chebyshev Approximations
We first consider the Chebyshev characterization for the best polynomial of degree not exceeding , denoted as that approximates on an interval in the uniform sense,
where is the set of polynomials of degree at most .
Theorem 3.1 (Chebyshev Equioscillation Theorem).
Let be a continuous function from and let be the set of polynomials of degree less than or equal to . Then
if and only if there exists points such that such that,
Although the equioscillation theorem only applies to a function of one variable, under the separability Assumption 3.2 below, we can construct the best uniform affine approximation by determining the best affine approximation on an interval for each component function.
Assume that can be separated into a sum of component functions, i.e.,
In addition, when and under the separability assumption on , we can construct a closed form solution, when , to the uniform affine approximation problem.
Suppose is a continuous function that satisfies Assumption 3.2. For a given and ,
Assuming additionally that is convex, we can further characterize the uniform affine approximation.
4 Fwua and Convergence
Using the uniform affine approximation, we propose a FW variant with Uniform Approximations (FWUA), which is described in Algorithm 1. The function update_tau will be described in full in Section 4.1, where a specific update rule for will be required to guarantee convergence.
To establish convergence, subsequently we make the following assumptions.
Assume that satisfies Assumption 3.2. In addition, each component function has the following properties:
is a Lipschitz continuous function from with Lipschitz constant .
is not differentiable on at most a finite set.
If is differentiable at , then it is also twice differentiable at .
While the above set of assumptions appears restrictive, our main goal in this work is to efficiently solve (1) in the context of trace-norm constrained matrix estimation problem which has a combination of and loss/regularization, for which these assumptions are satisfied. Although it may be possible to carry out the analysis with less restrictive assumptions, we choose to make these assumptions since they lead to simpler characterizations of the uniform affine approximations and highlight their roles in convergence for the problems of interest. We begin with the following definition.
The uniform slope function can be viewed as a surrogate for the gradient which does not rely on differentiability of . To establish convergence, we begin with the following theorem.
Then the following statements hold:
is convex in ,
is Lipschitz continuous with respect to the -norm with the Lipschitz constant , where is the maximum Lipschitz constant of all ,
The difference between and is uniformly bounded, i.e.,
Theorem 4.3 states that the sequence of uniform affine approximations generated by the FWUA algorithm correspond to a sequence of smooth functions that uniformly converge to the original objective if the stepsizes converge to zero. Since the sequence of smooth approximation functions have a Lipschitz continuous gradient, we can leverage standard Frank-Wolfe convergence arguments even if the original function is not smooth while maintaining an upper bound on the approximation quality of the solution. In particular, given a sequence of neighborhood sizes , we consider the sequence of smooth approximations given by
Since is differentiable with a -Lipschitz gradient, it can be shown that the curvature constant for is bounded,
as seen in [Jaggi, 2011].
To make these concepts concrete, consider and is the trace norm ball of radius . The expression for is,
where is the indicator function. Figure 1 illustrates the component functions and their detailed derivation can be found in the supplementary material.
When , its uniform affine approximation has an attractive property that . In general, this property may not hold. However from the uniform error bound, there always exists some constant such that , and the function still satisfies all the consequences of Theorem 4.3, except the error bound is at most doubled, i.e., . Thus, we redefine the sequence of approximations as follows.
The difficulty with allowing the function to become arbitrarily close to is that the Lipschitz constant for can grow arbitrarily large given a nonsmooth . However, if only an -accurate solution is required, one can always stop refining the approximation at some iteration since there exists an explicit upper bound on the approximation error. We show next that for any , FWUA can find an -accurate solution at a convergence rate.
A simple updating rule for to reflect the neighborhood of interest is to set ,
From Theorem 4.5, an -accurate solution is guaranteed if, at some suitable iteration , refining the size of neighborhood is terminated. Using the uniform error bound from Theorem 4.3 and a stepsize of , an expression for is provided below,
From this, it follows that . We observe that the required , indicated above, is typically much larger than what is needed in practice to achieve the desired accuracy; the derivation for here is more of theoretical, rather than practical, interest.
We note that the quantity defines a much larger neighborhood than necessary. In implementation, we consider a neighborhood defined by the previous Frank-Wolfe steps, , and consider the update,
This takes the maximum deviation over the past five iterations as an approximation for the neighborhood size and found that this better reflects the deviations at each iteration.
5 Experimental Results
5.1 Sparse and Low-Rank Structure
To highlight benefits of the proposed FWUA, we first compare it against other state-of-the-art solvers for the following problem,
In [Richard et al., 2012], the above formulation has been shown to yield empirical improvements for the sparse covariance matrix estimation and graph link prediction.
For each problem instance, the same value is used by all three methods and this value is tuned, by searching over a grid of parameter values, to yield the best test performance for GenFB. The bound for the trace norm, used in all FW variants, is then set to the trace norm of the solution given by GenFB. For SCCG, the smoothing parameter is additionally tuned to yield the smallest average objective value. HCGS sets as suggeseted by the authors. Additionally, we compare the limiting behavior for SCCG where . This corresponds to a specific subgradient, denoted as SCCG (SG). The parameters used are provided in supplementary material.
5.1.1 Sparse Covariance Estimation
We replicate the synthetic experiments described in [Richard et al., 2012], where the goal is to recover a block diagonal matrix. We consider square matrices where using MATLAB notation. The true underlying matrix is generated with 5 blocks, where the entries are i.i.d. and uniformly sampled from . Gaussian noise, is then added with . The loss function is where is the observed data matrix.
In Figure 2, we plot the objective value vs. iteration for . The full set of results can be found in the supplementary material, but the patterns are similar throughout. We remark that since the GenFB algorithm is a regularized algorithm, the intermediate iterates are not feasible for the constrained problem used for Frank-Wolfe, so only the performance of the solution at convergence should be compared.
5.1.2 Graph Link Prediction
Next we consider predicting links in a noisy social graph. The input data is a matrix corresponding to an undirected graph, where the entry indicates that user and are friends and otherwise. We consider the Facebook dataset from [Leskovec and Krevl, 2014] which consists of a graph with 4,039 nodes and 88,234 edges, and assume 50% of the entries are given. The goal is to recover the remaining edges in the graph. Additionally, a proportion denoted as , of the labels in the observed set are flipped, removing or adding labels to the graph. We report the AUC performance measure of the link prediction on the remaining entries of the graph as well as the average CPU time over 5 random initializations summarized in Table 1. In (1), the loss function is , where is the observed graph, is the set of observed indices, and projects the loss onto .
For both applications, the results agree with our initial intuition that FWUA can improve the performance of the Frank-Wolfe variants while scaling much better than the GenFB algorithm. We observe in the covariance plots in Figure 3, the sparsity patterns for HCGS and SCCG are much noisier than FWUA and in Table 1, the AUC for SCCG and HCGS methods are lower than GenFB and FWUA.
An interesting phenomenon we have observed for SCCG is that refining typically adds a large delay for many iterations before any progress is made, as highlighted in Figure 4. We hypothesize this delay occurs since the step sizes in the earlier iterations are too large for the level of refinement in the model.
We note that this hypothesis also agrees with what is observed in the graph link prediction example for both SCCG and HCGS. We observe that the most substantial improvements of FWUA over SCCG and HCGS occur when and , which in this case correspond to when and are largest respectively. Since the FW steps are scaled by , we conjecture that when either or is too large, the problem becomes too nonsmooth for the corresponding stepsize. This gives further support to FWUA, which solves for the optimal affine approximation given the step size.
5.2 Loss Matrix Completion
The last experiment we consider is matrix completion with an loss function on the MovieLens datasets. Here, we consider the objective function below
which is proposed in [Cambier and Absil, 2016] for robustness to outliers. Here the regularization penalizes entries in the complement of , potentially preventing overfitting.
We compare with the Robust Low-Rank Matrix Completion (RLRMC) algorithm proposed in [Cambier and Absil, 2016], which solves a nonconvex fixed rank problem by the smoothing term. For additional comparison, we also report the results using the Frank-Wolfe algorithm on the smooth squared Frobenius norm loss function to validate the importance of the loss. To address the issue of scalability of the Frank-Wolfe algorithms, we utilize the Rank-Drop variant (RDFW) proposed in [Cheung and Li, 2017] in both the smooth Frank-Wolfe and FWUA algorithm.
The rank for RLRMC is tuned over a grid search of rank and the parameters are tuned using a 50% of the data for training, and 25% for testing and validation each. We report the RMSE of the converged solutions. We found that for the FWUA algorithm, the impact of is unimportant. The trace norm constraint for both RDFW and FWUA methods are independently tuned.
We observe that FWUA performs slightly better than RLRMC in RMSE, but RLRMC is much faster, which is not surprising since RLRMC is a nonconvex fixed rank model. Since the performance of RLRMC is very sensitive to both rank and , extensive parameter tuning is required. Thus, FWUA can be an attractive alternative when the underlying data and a good estimate of the true rank is not known a prioi. We also note that the out-of-sample RMSE performances of FWUA and RDFW are very similar. It appears that the use of loss, instead of , does not make any significant difference in RMSE for this particular dataset.
We propose a variant of the Frank-Wolfe algorithm for a nonsmooth objective, by replacing the first order Taylor approximation with the Chebyshev uniform affine approximation in the FW subproblem. We show that for nonsmooth matrix estimation problems, this uniform approximation is easy to compute and allows for convergence analysis without assuming a bounded curvature constant. Experimentally we demonstrate that the FWUA algorithm can improve both speed and classification performance in a variety of sparse and low-rank learning tasks, while providing a viable convex alternative for loss matrix completion when little is known about the underlying data.
- [Argyriou et al., 2014] Argyriou, A., Signoretto, M., and Suykens, J. A. (2014). Hybrid conditional gradient-smoothing algorithms with applications to sparse and low rank regularization. In Regularization, Optimization, Kernels, and Support Vector Machines, pages 53–82. Chapman and Hall/CRC.
- [Cambier and Absil, 2016] Cambier, L. and Absil, P.-A. (2016). Robust low-rank matrix completion by riemannian optimization. SIAM Journal on Scientific Computing, 38(5):S440–S460.
- [Cheung and Li, 2017] Cheung, E. and Li, Y. (2017). Projection free rank-drop steps. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 1539–1545.
- [Clarkson, 2010] Clarkson, K. L. (2010). Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):63.
- [Frank and Wolfe, 1956] Frank, M. and Wolfe, P. (1956). An algorithm for quadratic programming. Naval Research Logistics (NRL), 3(1-2):95–110.
- [Hsieh and Olsen, 2014] Hsieh, C.-J. and Olsen, P. (2014). Nuclear norm minimization via active subspace selection. In International Conference on Machine Learning, pages 575–583.
- [Jaggi, 2011] Jaggi, M. (2011). Sparse Convex Optimization Methods for Machine Learning. PhD thesis, ETH Zurich.
- [Leskovec and Krevl, 2014] Leskovec, J. and Krevl, A. (2014). SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data.
- [Parikh et al., 2014] Parikh, N., Boyd, S., et al. (2014). Proximal algorithms. Foundations and Trends® in Optimization, 1(3):127–239.
- [Pierucci et al., 2014] Pierucci, F., Harchaoui, Z., and Malick, J. (2014). A smoothing approach for composite conditional gradient with nonsmooth loss.
- [Ravi et al., 2017] Ravi, S. N., Collins, M. D., and Singh, V. (2017). A deterministic nonsmooth frank wolfe algorithm with coreset guarantees. arXiv preprint arXiv:1708.06714.
- [Richard et al., 2012] Richard, E., Savalle, P.-A., and Vayatis, N. (2012). Estimation of simultaneously sparse and low rank matrices. arXiv preprint arXiv:1206.6474.
- [Wegge, 1974] Wegge, L. L. (1974). Mean value theorem for convex functions. Journal of Mathematical Economics, 1(2):207–208.
- [White, 1993] White, D. (1993). Extension of the frank-wolfe algorithm to concave nondifferentiable objective functions. Journal of optimization theory and applications, 78(2):283–301.
Proof of Theorem 3.3
Proof of Theorem 3.4
We begin with the formal statement of the convex mean value theorem.
Lemma .1 (Convex Mean Value Theorem ).
If is a closed proper convex function from for in a convex set , then and implies that there exists , and a vector , where , such that .
The affine function defines the line that connects to . Since is convex, on .
From Lemma .1, there exists such that
The function is the line tangent to at and is parallel to . Since is convex, on .
By construction, is a line parallel and equidistant to the lines and . Thus, it is easy to verify that
satisfying the equioscillation property. Thus, is the minimax affine approximation to on . ∎
Proof of Theorem 4.3
Before we establish Theorem 4.3, we require the following lemmas.
Let be a function that satisfies Assumption 3.2 with convex and let be the corresponding uniform slope function for . Then
for some where is a subgradient of .
From Theorem 3.4, the slope function has the form,
which is simply the slope of the secant line of from to . Thus, the desired result follows immediately from the convex mean value theorem. ∎
Using intermediate value theorem for integrals, there exists such that
From Theorem 3.4
Following Lagrange Remainder Theorem, we have
If is not twice differentiable, from Corollary .2, we have that for some . Since and are just specific subgradients on the evaluated on the interval , we have
and the result follows. ∎
For notational simplicity, we will drop the dependency on for and .
We establish that is convex by showing that each is convex. Since is a differentiable function of one variable, is convex if and only if is nondecreasing in . We have that for any ,
Since is convex, we have that the slope of any secant,
is nondecreasing in either or .
Thus, is nondecreasing follows immediately since,
Note we can expand the maximum as follows,