Biconvex Landscape In SDP-Related Learning
Many machine learning problems can be reduced to learning a low-rank positive semidefinite matrix (denoted as ), which encounters semidefinite program (SDP). Existing SDP solvers are often expensive for large-scale learning. To avoid directly solving SDP, some works convert SDP into a nonconvex program by factorizing as . However, this would bring higher-order nonlinearity, resulting in scarcity of structure in subsequent optimization. In this paper, we propose a novel surrogate for SDP-related learning, in which the structure of subproblem is exploited. More specifically, we surrogate unconstrained SDP by a biconvex problem, through factorizing as and using a Courant penalty to penalize the difference of and , in which the resultant subproblems are convex. Furthermore, we provide a theoretical bound for the associated penalty parameter under the assumption that the objective function is Lipschitz-smooth, such that the proposed surrogate will solve the original SDP when the penalty parameter is larger than this bound. Experiments on two SDP-related machine learning applications demonstrate that the proposed algorithm is as accurate as the state-of-the-art, but is faster on large-scale learning.
In this paper, we consider the minimization of a smooth convex function over symmetric positive semidefinite (PSD) matrices, that is
where is differentiable convex, and is the cone of symmetric PSD matrices, in which denotes the Löwner partial order. Many machine learning problems can be reduced as (1). Prominent examples include adaptive clustering [\citeauthoryearRoyer2017], sparse PCA [\citeauthoryearD’aspremont et al.2007], connected subgraph detection [\citeauthoryearAksoylar et al.2017], nonlinear dimensionality reduction [\citeauthoryearWeinberger et al.2004], kernel learning [\citeauthoryearZhuang et al.2011], distance metric learning [\citeauthoryearXing et al.2002], multitask learning [\citeauthoryearObozinski et al.2009], streaming model [\citeauthoryearTropp et al.2017], matrix completion [\citeauthoryearSrebro et al.2005], and inference in graphical model [\citeauthoryearErdogdu et al.2017], etc.
Generally speaking, PSD constraint is the most challenging aspect of solving (1). In order to overcome this difficulty, some works [\citeauthoryearBurer and Monteiro2003, \citeauthoryearJourneé et al.2010, \citeauthoryearLi and Tang2016] proposed to factorize as , and surrogate (1) by a nonconvex program as
To solve problem (2), we refer to the existing results in [\citeauthoryearBurer and Monteiro2003, \citeauthoryearJourneé et al.2010, \citeauthoryearLi and Tang2016], and then from the perspectives therein we can summarize the connections between (1) and (2) as follows:
The above results show that a local minimizer of (2) will solve (1) if we set (i.e., is rank deficient, refer to [\citeauthoryearLi and Tang2016] for more details). The recent work [\citeauthoryearGe et al.2017] further claimed that: for (2), all local minimizers are also globally optimal, and no high-order saddle points exist. Nevertheless, the factorized formulation (2) would bring higher-order nonlinearity, and then result in scarcity of structure in subsequent optimization.
Different from (2), in this paper we propose a novel surrogate of (1), in which the subproblem structure is convex. Specifically, we convert SDP (1) into a biconvex problem using Courant penalty [\citeauthoryearNocedal and Wright2006], in which the resultant subproblems are convex. We will show that the proposed surrogate contributes to faster computation to optimize (1) especially when is a quadratic function.
Empirical results on two applications demonstrate that our proposal can bring computational gain, which is substantial without loss of accuracy.
The interior point method [\citeauthoryearNesterov and Nemirovski1994] can produce highly accurate solutions for SDP problem. However, it is not scalable because at least time and space are needed in each iteration. Moreover, the interior point method is difficult to utilize additional information about the problem structure, such as that the target matrix is low-rank. To exploit low-rank structure, a popular approach for solving SDP is the Frank-Wolfe (FW) algorithm [\citeauthoryearJaggi2011], which is based on sparse greedy approximation. For smooth problems, the complexity of FW is arithmetic operations [\citeauthoryearJaggi2011], that is: iterations are demanded to converge to an -accurate solution, and in each iteration, arithmetic operations are needed, where is the number of nonzero entries in gradient. When the gradient of objective is dense, FW can still be expensive.
Alternatively, as in (1) is symmetric and positive semidefinite, it can be factorized as . Thus, problem (1) can then be rewritten as a nonconvex program as (2) [\citeauthoryearBurer and Monteiro2003]. It is known that a local minimizer of (2) corresponds to a minimizer of (1) if is rank-deficient [\citeauthoryearBurer and Monteiro2003, \citeauthoryearJourneé et al.2010, \citeauthoryearLi and Tang2016]. For linear objective and linear constraints, the nonconvex program (2) has been solved with L-BFGS [\citeauthoryearBurer and Monteiro2003, \citeauthoryearNocedal and Wright2006]. However, the convergence properties are unclear when solving a general nonconvex program by L-BFGS. Block-cyclic coordinate minimization has been used in [\citeauthoryearHu et al.2011] to solve a special nonconvex program of SDP, but a closed-form solution is preferred in each block coordinate update, which might be overly restrictive.
The transpose of vector or matrix is denoted by the superscript . The identity matrix is denoted by with appropriate size. For a matrix , is its trace, is its Frobenius norm, and unrolls into a column vector. For two matrices , , and is their Kronecker product. For a function , is the derivative w.r.t. variable , and is -smooth and -strongly convex if for any , there exists a Lipschitz constant such that . If is a function w.r.t. and , then and are functions w.r.t. with constant and w.r.t. with constant respectively.
For current large-scale optimization in machine learning, the first-order optimization methods, such as (accelerated/stochastic) gradient descent, are popular because of their computational simplicity. In the class of these methods, the most expensive operation usually lies in repeatedly searching a stepsize for objective descent at each iteration, which involves computation of the objective many times. If a good stepsize is possible in an analytical and simple form, the computation complexity would be decreased greatly.
However, a simple stepsize is impossible in (2) even if is a simple quadratic function. This is because that when in (1) is quadratic w.r.t. , the objective in (2) rises into a quartic function w.r.t. , such that the stepsize search in (2) involves some complex computations including constructing the coefficients of a quartic polynomial, solving a cubic equation and comparing the resulted different solutions [\citeauthoryearBurer and Choi2006]. Again, this needs compute the objective many times.
The above difficulties motivate us to propose biconvex formulation (3) instead of (2). In each subproblem of (3), the stepsize search is usually easier than in (2). For example, when is a quadratic function, the subobjective or in (3) is still quadratic w.r.t. or respectively. Hence, the optimal stepsize to descend objective w.r.t or is analytical, simple and unique for (3) ((16) is an example).
Instead of factorizing as , we factorize as , and penalize the difference between and , which can be formulated as
where is a penalty parameter and is the symmetrization of in terms of and , namely is satisfied. For example, can be defined as
and we will see (in proof of Theorem 1) that the symmetry of is key to bound . Note that the penalty term in (3) is the classic quadratic (or Courant) penalty for the constraint . In general, problem (3) approaches problem (2) only when goes to infinity. However, we will show (in Theorem 1) that can be bounded if and are Lipschitz-smooth, such that when is larger than that bound, (3) is equivalent to (2) in that a stationary point of (3) provides as a stationary point of (2). Hence, any local minimizer of (3) will produce optimal solution of (1) if is set as large as enough. Moreover, it will be shown that the objective in (3) is biconvex, and thus we can choose a convex optimization algorithm to decrease the objective function w.r.t. and alternately. Consequently, this opens a door to surrogate an unconstrained SDP by biconvex optimization.
Our main contributions can be summarized as follows:
We propose a biconvex penalty formulation (3) to surrogate unconstrained SDP. This opens a door to solve many SDP-based learning problems using biconvex optimization.
We first give the definition of biconvexity as follows
is biconvex if for any , and any () such that , we have . Namely, is convex in with fixed and convex in with fixed respectively.
We assume that in (3) is coercive and satisfies the following assumption.
is -smooth and -strongly convex, and is -smooth and -strongly convex.
The Theoretical Bound
If is a stationary point of (3), then when .
Now we focus on solving (3) under the assumption that the penalty parameter satisfies Theorem 1. Depending on the convexity of subproblems, we can solve (3) in alternate convex search. Specifically, we can solve the convex subproblem w.r.t. with fixed and the convex subproblem w.r.t. with fixed alternately, that is
In fact, it makes sense to solve (5) and (6) only inexactly. An example is multiconvex optimization [\citeauthoryearXu and Yin2013], in which an alternating proximal gradient descent was proposed to solve multiconvex objective.
We first give a brief introduction to accelerated proximal gradient (APG) descent method, and then fuse it into our algorithm. A well-known APG method is FISTA [\citeauthoryearBeck and Teboulle2009], which was proposed to solve convex composite objective as like as
where is differentiable loss and is nondifferentiable regularizer. When (i.e., ’proximal’ is absent, as used in our case later), FISTA is degenerated into computing the sequence via the iteration
where is stepsize for descent.
We can apply (7) (as a degenerated FISTA) to (5) and (6). Since is convex and differentiable, regarding it as we can inexactly solve (5) by using (7) as the process of inner iterations. Analogously, we can inexactly solve (6) by using (7) as inner procedure. The smaller the number of inner iterations, the faster the interaction between the updates of and each other. Specially, if we set as the number of inner iterations to inexactly solve (5) and (6) respectively, optimization in them can be realized as Algorithm 1 (except steps , and ), that is our alternating accelerated gradient descent (AAGD) algorithm, where steps and inexactly solve (5) and steps and inexactly solve (6).
From another perspective, Algorithm 1 (except steps and and ) can be regarded as a degenerated realization (with a slight difference in rectified ) of alternating proximal gradient method in multiconvex optimization [\citeauthoryearXu and Yin2013]. Hence, the convergence of Algorithm 1 is solid following [\citeauthoryearXu and Yin2013].
NPKL: nonparametric kernel learning
Given patterns, let be the must-link set containing pairs that should belong to the same class, and be the cannot-link set containing pairs that should not belong to the same class. We use one of formulations in [\citeauthoryearZhuang et al.2011], which learns a kernel matrix using the following SDP problem
where is the target kernel matrix, is the graph Laplacian matrix of the data, , with if or , and 0 if , and is a tradeoff parameter. The first term of objective in (8) measures the difference between and , and the second term encourages smoothness on the data manifold by aligning with .
Let , we have
and is denoted as the identity matrix with appropriate size, is the -th column of , is a constant.
Clearly, is symmetric in terms of and with and . Plugging into (3), we have
So, the subproblem about (corresponding to (5)) is
where , .
Estimating the bound of
From (9), the Hessian of in terms of variable is
So the Lipschitz smooth constant (in Assumption 1) can be estimated as the maximal eigenvalue of , that is
where is the spectral norm, and the last inequality uses , and the last equality uses . In practice, can be estimated dynamically, such that at -th iteration, can be approximated as
From the symmetry of , similar to (13), can be approximated as
Computing the optimal stepsizes
CMVU:colored maximum variance unfolding
The colored maximum variance unfolding (CMVU) is a âcoloredâ variants of maximum variance unfolding [\citeauthoryearWeinberger et al.2004], subjected to class labels information. The slack CMVU [\citeauthoryearSong et al.2008] was proposed as a SDP as
where denotes the square Euclidean distance between the -th and -th objects, denotes the set of neighbor pairs, whose distances are to be preserved in the unfolding space, is a kernel matrix of the labels, centers the data and the labels in the unfolding space, and controls the tradeoff between dependence maximization and distance preservation. Let , (17) can be reformulated as
The iterations in Algorithms 1 are inexpensive. It mainly costs operations for estimating penalty parameter in and since is a constant needed to be computed only once. As in Proposition 2, it is still operations for calculating the optimal stepsizes. The space complexity is scalable because only space is needed indeed.
In this part, we perform experiments on two SDP-related applications in the previous section: NPKL, CMVU and SD.
All these algorithms are implemented in Matlab. The stopping condition of FW is reached when the relative change in objective value is smaller than , and then the stopping conditions of the others are reached when their objective values are not larger than the final objective value of FW or the number of iterations exceeds . All experiments are run on a PC with a 3.07GHz CPU and 24GB RAM.
The (rank of the solution) is automatically chosen by the FW algorithm, and is then used by all the other algorithms except SQLP. How to estimate the rank in (2) and (3) is beyond the scope of this paper. Interested readers can refer to [\citeauthoryearJourneé et al.2010, \citeauthoryearBurer and Monteiro2003] for details.
Results on NPKL
Experiments are performed on the adult data sets
As in [\citeauthoryearZhuang et al.2011], the learned kernel matrix is then used for kernelized -means clustering with the number of clusters equals to the number of classes. We set and repeat times clustering on each data set with random start point and random pair constraints , , and then the average results ( standard deviations) are reported.
|number of samples||1,605||2,265||3,185||4,781||6,414||11,220||16,100||22,696||32,561|
|number of must-link pairs||1204||2,269||2,389||3,586||4,811||8,415||12,075||17,022||24,421|
|number of cannot-link pairs||1204||2,269||2,389||3,586||4,811||8,415||12,075||17,022||24,421|
|a1a||98.40 0.48||98.39 0.48||98.38 0.48||98.38 0.53||98.42 0.50|
|a2a||98.07 0.39||98.07 0.39||98.07 0.38||98.07 0.44||98.13 0.47|
|a3a||97.70 0.43||97.69 0.43||97.69 0.42||97.72 0.45||97.710.41|
|a4a||97.58 0.31||97.59 0.29||97.58 0.30||97.60 0.23||97.63 0.36|
|a5a||97.63 0.24||97.62 0.24||97.63 0.24||97.58 0.27||–|
|a6a||97.64 0.17||97.64 0.18||97.64 0.18||97.64 0.20||–|
|a7a||97.67 0.13||97.66 0.14||97.66 0.13||–||–|
|a8a||97.64 0.11||97.64 0.12||97.64 0.11||–||–|
|a9a||97.55 0.13||97.55 0.13||97.55 0.13||–||–|
|a1a||59.84 20.07||52.43 39.44||37.62 31.14||131.29 11.49||339.0115.11|
|a2a||102.94 28.08||78.32 44.18||53.15 30.46||237.75 16.34||858.3426.95|
|a3a||219.08 47.97||93.47 56.90||83.10 61.77||464.53 34.75||2049.20 105.34|
|a4a||318.17 75.12||111.64 44.97||104.29 41.90||1008.17 49.09||7342.42467.26|
|a5a||402.82 94.79||126.36 73.75||106.57 41.87||1806.99 107.91||–|
|a6a||747.39 151.68||224.87 144.58||165.63 46.01||4961.67 389.22||–|
|a7a||1030.07 226.39||275.92 88.91||207.83 60.28||–||–|
|a8a||1518.96 270.53||396.67 195.11||290.96 67.80||–||–|
|a9a||2134.26 544.91||529.23 131.37||389.74 91.00||–||–|
As in [\citeauthoryearRand1971], clustering accuracy is measured by the Rand index , where is the number of pattern pairs belonging to the same class and are placed in the same cluster by -means, is the number of pattern pairs belonging to different classes and are placed in different clusters, and the denominator is the total number of all different pairs. The higher the Rand index, the better the accuracy.
Results on the clustering accuracy and running CPU time are shown in Tables 3 and 3, respectively. As can be seen, all algorithms obtain almost the same accuracy, and AAGD is the fastest. Figure 1 shows convergence of the objective on the smallest a1a data set and the largest a9a data set. It can be observed that, to arrive the final objective-value (a broken mauve line), AAGD converges significantly much faster than the others. The optimal stepsizes exist as simple and closed forms (i.e., as (16)) for AAGD, while this is unavailable for AGD, which is one of reasons that AAGD is faster than AGD.
Figure 2 shows the progress of with iterations. It converges to zero clearly, implying that the penalty term will vanish after convergence.
Results on CMVU
Two benchmark data sets
Figure 3 shows the convergences of the objective on the USPS Digits and Newsgroups 20 data sets respectively. It can be observed that, to arrive the final objective-value (a broken mauve line), AAGD converges significantly much faster than the others. On these two data sets, the run time of SQLP exceeds seconds, so we don’t plot them.
Many machine learning problems can be reduced to SDP formulation, which would be expensive for large-scale learning. In this paper, we take the scalable learning of SDP problem into consideration. More specifically, we reformulate unconstrained SDP as a biconvex problem with the addition of a Courant penalty, which can be easily optimized using alternating accelerated gradient descent algorithm. Furthermore, we show that when the penalty parameter is larger than our theoretical bound, any local minimizer of the biconvex surrogate will provide a global minimizer of the original SDP problem if the condition of rank deficiency is satisfied. Experiments on two SDP-related problems: nonparametric kernel learning, colored maximum variance unfolding and spectral decomposition, demonstrate that the proposed algorithm is both accurate and scalable.
For the future, we would generalize the biconvex surrogate to more complex circumstances.
- Downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html.
- Downloaded from https://www.cc.gatech.edu/~lsong/code.html..
- Cem Aksoylar, Lorenzo Orecchia, and Venkatesh Saligrama. Connected subgraph detection with mirror descent on SDPs. In Proceedings of the 34th International Conference on Machine Learning, pages 51–59, 2017.
- A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
- S. Burer and C. Choi. Computational enhancements in low-rank semidefinite programming. Optimization Methods and Software, 21(3):493–512, 2006.
- S. Burer and R.D.C. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming (Series B), 95:329–357, 2003.
- A. D’aspremont, El L. Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434–48, 2007.
- Murat A. Erdogdu, Yash Deshpande, and Andrea Montanari. Inference in graphical models via semidefinite programming hierarchies. In Advances in Neural Information Processing Systems 30, pages 416–424. 2017.
- Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning, pages 1233–1242, 2017.
- Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
- J. Gorski, F. Pfeuffer, and K. Klamroth. Biconvex sets and optimization with biconvex functions: A survey and extensions. Mathematical Methods of Operations Research, 66(3):373–407, 2007.
- En-Liang Hu, Bo Wang, and Song-Can Chen. BCDNPKL: Scalable non-parametric kernel learning using block coordinate descent. In Proceedings of the 28th International Conference on Machine Learning, pages 209–216, Bellevue, WA, USA, June 2011.
- M. Jaggi. Convex optimization without projection steps. Preprint arXiv:1108.1170, 2011.
- Michel Journeé, F. Bach, Pierre-Antoine Absil, and Rodolphe Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization, pages 2327–2351, 2010.
- Soeren Laue. A hybrid algorithm for convex semidefinite optimization. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML-12), ICML ’12, pages 177–184, New York, NY, USA, July 2012. Omnipress.
- Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for nonconvex programming. In In Advances in Neural Information Processing Systems (NIPS), pages 379–387, 2015.
- Qiuwei Li and Gongguo Tang. The nonconvex geometry of low-rank matrix optimizations with general objective functions. CoRR, abs/1611.03060, 2016.
- Xin Liu, Zaiwen Wen, and Yin Zhang. An efficient gauss-newton algorithm for symmetric low-rank product matrix approximations. SIAM Journal on Optimization, 25(3):1571–1608, 2015.
- Y. Nesterov and A. Nemirovski. Interior-Point Polynomial Algorithms in Convex Programming. SIAM, 1994.
- J. Nocedal and S.J. Wright. Numerical Optimization. Springer, 2006.
- G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, 2009.
- Michael L. Overton and Robert S. Womersley. On the sum of the largest eigenvalues of a symmetric matrix. SIAM Journal on Matrix Analysis and Applications, 13(1):41–45, 1992.
- W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846–850, 1971.
- Martin Royer. Adaptive clustering through semidefinite programming. In Advances in Neural Information Processing Systems 30, pages 1793–1801. 2017.
- L. Song, A.J. Smola, K.M. Borgwardt, and A. Gretton. Colored maximum variance unfolding. In Advances in Neural Information Processing Systems 20, 2008.
- N. Srebro, J.D.M. Rennie, and T.S. Jaakola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 17, pages 1329–1336, 2005.
- Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Fixed-rank approximation of a positive-semidefinite matrix from streaming data. In Advances in Neural Information Processing Systems 30, pages 1225–1234. 2017.
- Kilian Q. Weinberger, Fei Sha, and Lawrence K. Saul. Learning a kernel matrix for nonlinear dimensionality reduction. In Proceedings of the Twenty-first International Conference on Machine Learning, pages 839–846, 2004.
- Zaiwen Wen, Chao Yang, Xin Liu, and Yin Zhang. Trace-penalty minimization for large-scale eigenspace computation. J. Sci. Comput., 66(3):1175–1203, 2016.
- X.-M. Wu, A. M.-C. So, Z. Li, and S.-Y.R. Li. Fast graph Laplacian regularized kernel learning via semidefinite-quadratic-linear programming. In Advances in Neural Information Processing Systems 22, pages 1964–1972, 2009.
- E.P. Xing, A.Y. Ng, M.I. Jordan, and S.J. Russell. Distance metric learning with application to clustering with side-information. In Advances in Neural Information Processing Systems, pages 505–512, 2002.
- Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758–1789, 2013.
- J. Zhuang, I.W. Tsang, and S.C.H. Hoi. A family of simple non-parametric kernel learning algorithms. Journal of Machine Learning Research, 12:1313–1347, 2011.