Biconvex Landscape In SDPRelated Learning
Abstract
Many machine learning problems can be reduced to learning a lowrank positive semidefinite matrix (denoted as ), which encounters semidefinite program (SDP). Existing SDP solvers are often expensive for largescale learning. To avoid directly solving SDP, some works convert SDP into a nonconvex program by factorizing as . However, this would bring higherorder nonlinearity, resulting in scarcity of structure in subsequent optimization. In this paper, we propose a novel surrogate for SDPrelated learning, in which the structure of subproblem is exploited. More specifically, we surrogate unconstrained SDP by a biconvex problem, through factorizing as and using a Courant penalty to penalize the difference of and , in which the resultant subproblems are convex. Furthermore, we provide a theoretical bound for the associated penalty parameter under the assumption that the objective function is Lipschitzsmooth, such that the proposed surrogate will solve the original SDP when the penalty parameter is larger than this bound. Experiments on two SDPrelated machine learning applications demonstrate that the proposed algorithm is as accurate as the stateoftheart, but is faster on largescale learning.
Introduction
In this paper, we consider the minimization of a smooth convex function over symmetric positive semidefinite (PSD) matrices, that is
(1) 
where is differentiable convex, and is the cone of symmetric PSD matrices, in which denotes the Löwner partial order. Many machine learning problems can be reduced as (1). Prominent examples include adaptive clustering [\citeauthoryearRoyer2017], sparse PCA [\citeauthoryearD’aspremont et al.2007], connected subgraph detection [\citeauthoryearAksoylar et al.2017], nonlinear dimensionality reduction [\citeauthoryearWeinberger et al.2004], kernel learning [\citeauthoryearZhuang et al.2011], distance metric learning [\citeauthoryearXing et al.2002], multitask learning [\citeauthoryearObozinski et al.2009], streaming model [\citeauthoryearTropp et al.2017], matrix completion [\citeauthoryearSrebro et al.2005], and inference in graphical model [\citeauthoryearErdogdu et al.2017], etc.
Generally speaking, PSD constraint is the most challenging aspect of solving (1). In order to overcome this difficulty, some works [\citeauthoryearBurer and Monteiro2003, \citeauthoryearJourneé et al.2010, \citeauthoryearLi and Tang2016] proposed to factorize as , and surrogate (1) by a nonconvex program as
(2) 
where is a real matrix and if it is lowrank. Compared with (1), problem (2) does not have PSD constraint, but the objective becomes a nonconvex function.
To solve problem (2), we refer to the existing results in [\citeauthoryearBurer and Monteiro2003, \citeauthoryearJourneé et al.2010, \citeauthoryearLi and Tang2016], and then from the perspectives therein we can summarize the connections between (1) and (2) as follows:
The above results show that a local minimizer of (2) will solve (1) if we set (i.e., is rank deficient, refer to [\citeauthoryearLi and Tang2016] for more details). The recent work [\citeauthoryearGe et al.2017] further claimed that: for (2), all local minimizers are also globally optimal, and no highorder saddle points exist. Nevertheless, the factorized formulation (2) would bring higherorder nonlinearity, and then result in scarcity of structure in subsequent optimization.
Different from (2), in this paper we propose a novel surrogate of (1), in which the subproblem structure is convex. Specifically, we convert SDP (1) into a biconvex problem using Courant penalty [\citeauthoryearNocedal and Wright2006], in which the resultant subproblems are convex. We will show that the proposed surrogate contributes to faster computation to optimize (1) especially when is a quadratic function.
Empirical results on two applications demonstrate that our proposal can bring computational gain, which is substantial without loss of accuracy.
Related Work
The interior point method [\citeauthoryearNesterov and Nemirovski1994] can produce highly accurate solutions for SDP problem. However, it is not scalable because at least time and space are needed in each iteration. Moreover, the interior point method is difficult to utilize additional information about the problem structure, such as that the target matrix is lowrank. To exploit lowrank structure, a popular approach for solving SDP is the FrankWolfe (FW) algorithm [\citeauthoryearJaggi2011], which is based on sparse greedy approximation. For smooth problems, the complexity of FW is arithmetic operations [\citeauthoryearJaggi2011], that is: iterations are demanded to converge to an accurate solution, and in each iteration, arithmetic operations are needed, where is the number of nonzero entries in gradient. When the gradient of objective is dense, FW can still be expensive.
Alternatively, as in (1) is symmetric and positive semidefinite, it can be factorized as . Thus, problem (1) can then be rewritten as a nonconvex program as (2) [\citeauthoryearBurer and Monteiro2003]. It is known that a local minimizer of (2) corresponds to a minimizer of (1) if is rankdeficient [\citeauthoryearBurer and Monteiro2003, \citeauthoryearJourneé et al.2010, \citeauthoryearLi and Tang2016]. For linear objective and linear constraints, the nonconvex program (2) has been solved with LBFGS [\citeauthoryearBurer and Monteiro2003, \citeauthoryearNocedal and Wright2006]. However, the convergence properties are unclear when solving a general nonconvex program by LBFGS. Blockcyclic coordinate minimization has been used in [\citeauthoryearHu et al.2011] to solve a special nonconvex program of SDP, but a closedform solution is preferred in each block coordinate update, which might be overly restrictive.
Notation
The transpose of vector or matrix is denoted by the superscript . The identity matrix is denoted by with appropriate size. For a matrix , is its trace, is its Frobenius norm, and unrolls into a column vector. For two matrices , , and is their Kronecker product. For a function , is the derivative w.r.t. variable , and is smooth and strongly convex if for any , there exists a Lipschitz constant such that . If is a function w.r.t. and , then and are functions w.r.t. with constant and w.r.t. with constant respectively.
Problem Formulation
Motivation
For current largescale optimization in machine learning, the firstorder optimization methods, such as (accelerated/stochastic) gradient descent, are popular because of their computational simplicity. In the class of these methods, the most expensive operation usually lies in repeatedly searching a stepsize for objective descent at each iteration, which involves computation of the objective many times. If a good stepsize is possible in an analytical and simple form, the computation complexity would be decreased greatly.
However, a simple stepsize is impossible in (2) even if is a simple quadratic function. This is because that when in (1) is quadratic w.r.t. , the objective in (2) rises into a quartic function w.r.t. , such that the stepsize search in (2) involves some complex computations including constructing the coefficients of a quartic polynomial, solving a cubic equation and comparing the resulted different solutions [\citeauthoryearBurer and Choi2006]. Again, this needs compute the objective many times.
The above difficulties motivate us to propose biconvex formulation (3) instead of (2). In each subproblem of (3), the stepsize search is usually easier than in (2). For example, when is a quadratic function, the subobjective or in (3) is still quadratic w.r.t. or respectively. Hence, the optimal stepsize to descend objective w.r.t or is analytical, simple and unique for (3) ((16) is an example).
Problem Formulation
Instead of factorizing as , we factorize as , and penalize the difference between and , which can be formulated as
(3) 
where is a penalty parameter and is the symmetrization of in terms of and , namely is satisfied. For example, can be defined as
(4) 
and we will see (in proof of Theorem 1) that the symmetry of is key to bound . Note that the penalty term in (3) is the classic quadratic (or Courant) penalty for the constraint . In general, problem (3) approaches problem (2) only when goes to infinity. However, we will show (in Theorem 1) that can be bounded if and are Lipschitzsmooth, such that when is larger than that bound, (3) is equivalent to (2) in that a stationary point of (3) provides as a stationary point of (2). Hence, any local minimizer of (3) will produce optimal solution of (1) if is set as large as enough. Moreover, it will be shown that the objective in (3) is biconvex, and thus we can choose a convex optimization algorithm to decrease the objective function w.r.t. and alternately. Consequently, this opens a door to surrogate an unconstrained SDP by biconvex optimization.
Contributions
Our main contributions can be summarized as follows:

We propose a biconvex penalty formulation (3) to surrogate unconstrained SDP. This opens a door to solve many SDPbased learning problems using biconvex optimization.
Proposed Algorithm
Biconvex Penalty
We first give the definition of biconvexity as follows
Definition 1.
is biconvex if for any , and any () such that , we have . Namely, is convex in with fixed and convex in with fixed respectively.
The following Proposition shows that in (4) and the objective in (3) are biconvex if in (1) is convex.
We assume that in (3) is coercive and satisfies the following assumption.
Assumption 1.
is smooth and strongly convex, and is smooth and strongly convex.
The Theoretical Bound
In the following Theorem, we provide a theoretical bound of penalty parameter to connect (3) to (2), and thus to (1).
Theorem 1.
If is a stationary point of (3), then when .
Corollary 1.
Remark 1.
Following Corollary 1, a local minimizer of (3) provides as a local minimizer of (2) if . Furthermore, the produced will solve (1) if the condition of rank deficiency (i.e., ) is satisfied [\citeauthoryearLi and Tang2016].
Our Algorithm
Now we focus on solving (3) under the assumption that the penalty parameter satisfies Theorem 1. Depending on the convexity of subproblems, we can solve (3) in alternate convex search. Specifically, we can solve the convex subproblem w.r.t. with fixed and the convex subproblem w.r.t. with fixed alternately, that is
(5)  
(6) 
In fact, it makes sense to solve (5) and (6) only inexactly. An example is multiconvex optimization [\citeauthoryearXu and Yin2013], in which an alternating proximal gradient descent was proposed to solve multiconvex objective.
We first give a brief introduction to accelerated proximal gradient (APG) descent method, and then fuse it into our algorithm. A wellknown APG method is FISTA [\citeauthoryearBeck and Teboulle2009], which was proposed to solve convex composite objective as like as
where is differentiable loss and is nondifferentiable regularizer. When (i.e., ’proximal’ is absent, as used in our case later), FISTA is degenerated into computing the sequence via the iteration
(7) 
where is stepsize for descent.
We can apply (7) (as a degenerated FISTA) to (5) and (6). Since is convex and differentiable, regarding it as we can inexactly solve (5) by using (7) as the process of inner iterations. Analogously, we can inexactly solve (6) by using (7) as inner procedure. The smaller the number of inner iterations, the faster the interaction between the updates of and each other. Specially, if we set as the number of inner iterations to inexactly solve (5) and (6) respectively, optimization in them can be realized as Algorithm 1 (except steps , and ), that is our alternating accelerated gradient descent (AAGD) algorithm, where steps and inexactly solve (5) and steps and inexactly solve (6).
From another perspective, Algorithm 1 (except steps and and ) can be regarded as a degenerated realization (with a slight difference in rectified ) of alternating proximal gradient method in multiconvex optimization [\citeauthoryearXu and Yin2013]. Hence, the convergence of Algorithm 1 is solid following [\citeauthoryearXu and Yin2013].
Applications
NPKL: nonparametric kernel learning
Given patterns, let be the mustlink set containing pairs that should belong to the same class, and be the cannotlink set containing pairs that should not belong to the same class. We use one of formulations in [\citeauthoryearZhuang et al.2011], which learns a kernel matrix using the following SDP problem
(8) 
where is the target kernel matrix, is the graph Laplacian matrix of the data, , with if or , and 0 if , and is a tradeoff parameter. The first term of objective in (8) measures the difference between and , and the second term encourages smoothness on the data manifold by aligning with .
Biconvex reformulation
Let , we have
(9)  
where
and is denoted as the identity matrix with appropriate size, is the th column of , is a constant.
Estimating the bound of
From (9), the Hessian of in terms of variable is
So the Lipschitz smooth constant (in Assumption 1) can be estimated as the maximal eigenvalue of , that is
(12)  
where is the spectral norm, and the last inequality uses , and the last equality uses . In practice, can be estimated dynamically, such that at th iteration, can be approximated as
(13) 
From the symmetry of , similar to (13), can be approximated as
(14) 
Thus, in Algorithm 1 we can dynamically set the penalty parameter by Theorem 1 as
(15) 
Additionally, instead of (13) and (14), we can obtain Lipschitz smooth constants and by line search or backtracking.
Computing the optimal stepsizes
CMVU:colored maximum variance unfolding
The colored maximum variance unfolding (CMVU) is a âcoloredâ variants of maximum variance unfolding [\citeauthoryearWeinberger et al.2004], subjected to class labels information. The slack CMVU [\citeauthoryearSong et al.2008] was proposed as a SDP as
(17) 
where denotes the square Euclidean distance between the th and th objects, denotes the set of neighbor pairs, whose distances are to be preserved in the unfolding space, is a kernel matrix of the labels, centers the data and the labels in the unfolding space, and controls the tradeoff between dependence maximization and distance preservation. Let , (17) can be reformulated as
Let , it is easy to know that satisfies the symmetry. Let and viewing as in (8), the subproblems of (17) w.r.t. and can be derived and they are similar to (10) and (11) respectively.
Complexity Analysis
The iterations in Algorithms 1 are inexpensive. It mainly costs operations for estimating penalty parameter in and since is a constant needed to be computed only once. As in Proposition 2, it is still operations for calculating the optimal stepsizes. The space complexity is scalable because only space is needed indeed.
Experiments
In this part, we perform experiments on two SDPrelated applications in the previous section: NPKL, CMVU and SD.
The proposed algorithm, AAGD in Algorithm 1, will be compared with the following stateoftheart baselines in solving SDP in (1).

FW: accelerated FrankWolf algorithm [\citeauthoryearLaue2012] to solve (1);

AGD: transform (1) to (2), and solve (2) using the accelerated gradient descent [\citeauthoryearGhadimi and Lan2016, \citeauthoryearLi and Lin2015]. The stepsize search is followed as in [\citeauthoryearBurer and Choi2006].

LBFGS: transform (1) to (2), and solve (2) using limited memory BroydenFletcherGoldfarbShanno method (as similar as done in [\citeauthoryearBurer and Monteiro2003]).

SQLP: semidefinitequadraticlinear program [\citeauthoryearWu et al.2009], an interior point method, is specified to quadratic SDP. Naturally, SQLP would be faster than a general purpose SDPsolver to solve a quadratic SDP problem like (8) or (17).
All these algorithms are implemented in Matlab. The stopping condition of FW is reached when the relative change in objective value is smaller than , and then the stopping conditions of the others are reached when their objective values are not larger than the final objective value of FW or the number of iterations exceeds . All experiments are run on a PC with a 3.07GHz CPU and 24GB RAM.
The (rank of the solution) is automatically chosen by the FW algorithm, and is then used by all the other algorithms except SQLP. How to estimate the rank in (2) and (3) is beyond the scope of this paper. Interested readers can refer to [\citeauthoryearJourneé et al.2010, \citeauthoryearBurer and Monteiro2003] for details.
Results on NPKL
Experiments are performed on the adult data sets
As in [\citeauthoryearZhuang et al.2011], the learned kernel matrix is then used for kernelized means clustering with the number of clusters equals to the number of classes. We set and repeat times clustering on each data set with random start point and random pair constraints , , and then the average results ( standard deviations) are reported.
a1a  a2a  a3a  a4a  a5a  a6a  a7a  a8a  a9a  

number of samples  1,605  2,265  3,185  4,781  6,414  11,220  16,100  22,696  32,561 
number of mustlink pairs  1204  2,269  2,389  3,586  4,811  8,415  12,075  17,022  24,421 
number of cannotlink pairs  1204  2,269  2,389  3,586  4,811  8,415  12,075  17,022  24,421 
FW  AGD  AAGD  LBFGS  SQLP  

a1a  98.40 0.48  98.39 0.48  98.38 0.48  98.38 0.53  98.42 0.50 
a2a  98.07 0.39  98.07 0.39  98.07 0.38  98.07 0.44  98.13 0.47 
a3a  97.70 0.43  97.69 0.43  97.69 0.42  97.72 0.45  97.710.41 
a4a  97.58 0.31  97.59 0.29  97.58 0.30  97.60 0.23  97.63 0.36 
a5a  97.63 0.24  97.62 0.24  97.63 0.24  97.58 0.27  – 
a6a  97.64 0.17  97.64 0.18  97.64 0.18  97.64 0.20  – 
a7a  97.67 0.13  97.66 0.14  97.66 0.13  –  – 
a8a  97.64 0.11  97.64 0.12  97.64 0.11  –  – 
a9a  97.55 0.13  97.55 0.13  97.55 0.13  –  – 
FW  AGD  AAGD  LBFGS  SQLP  

a1a  59.84 20.07  52.43 39.44  37.62 31.14  131.29 11.49  339.0115.11 
a2a  102.94 28.08  78.32 44.18  53.15 30.46  237.75 16.34  858.3426.95 
a3a  219.08 47.97  93.47 56.90  83.10 61.77  464.53 34.75  2049.20 105.34 
a4a  318.17 75.12  111.64 44.97  104.29 41.90  1008.17 49.09  7342.42467.26 
a5a  402.82 94.79  126.36 73.75  106.57 41.87  1806.99 107.91  – 
a6a  747.39 151.68  224.87 144.58  165.63 46.01  4961.67 389.22  – 
a7a  1030.07 226.39  275.92 88.91  207.83 60.28  –  – 
a8a  1518.96 270.53  396.67 195.11  290.96 67.80  –  – 
a9a  2134.26 544.91  529.23 131.37  389.74 91.00  –  – 
As in [\citeauthoryearRand1971], clustering accuracy is measured by the Rand index , where is the number of pattern pairs belonging to the same class and are placed in the same cluster by means, is the number of pattern pairs belonging to different classes and are placed in different clusters, and the denominator is the total number of all different pairs. The higher the Rand index, the better the accuracy.
Results on the clustering accuracy and running CPU time are shown in Tables 3 and 3, respectively. As can be seen, all algorithms obtain almost the same accuracy, and AAGD is the fastest. Figure 1 shows convergence of the objective on the smallest a1a data set and the largest a9a data set. It can be observed that, to arrive the final objectivevalue (a broken mauve line), AAGD converges significantly much faster than the others. The optimal stepsizes exist as simple and closed forms (i.e., as (16)) for AAGD, while this is unavailable for AGD, which is one of reasons that AAGD is faster than AGD.
Figure 2 shows the progress of with iterations. It converges to zero clearly, implying that the penalty term will vanish after convergence.
Results on CMVU
Two benchmark data sets
Figure 3 shows the convergences of the objective on the USPS Digits and Newsgroups 20 data sets respectively. It can be observed that, to arrive the final objectivevalue (a broken mauve line), AAGD converges significantly much faster than the others. On these two data sets, the run time of SQLP exceeds seconds, so we don’t plot them.
Conclusion
Many machine learning problems can be reduced to SDP formulation, which would be expensive for largescale learning. In this paper, we take the scalable learning of SDP problem into consideration. More specifically, we reformulate unconstrained SDP as a biconvex problem with the addition of a Courant penalty, which can be easily optimized using alternating accelerated gradient descent algorithm. Furthermore, we show that when the penalty parameter is larger than our theoretical bound, any local minimizer of the biconvex surrogate will provide a global minimizer of the original SDP problem if the condition of rank deficiency is satisfied. Experiments on two SDPrelated problems: nonparametric kernel learning, colored maximum variance unfolding and spectral decomposition, demonstrate that the proposed algorithm is both accurate and scalable.
For the future, we would generalize the biconvex surrogate to more complex circumstances.
Footnotes
 Downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html.
 Downloaded from https://www.cc.gatech.edu/~lsong/code.html..
References
 Cem Aksoylar, Lorenzo Orecchia, and Venkatesh Saligrama. Connected subgraph detection with mirror descent on SDPs. In Proceedings of the 34th International Conference on Machine Learning, pages 51–59, 2017.
 A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
 S. Burer and C. Choi. Computational enhancements in lowrank semidefinite programming. Optimization Methods and Software, 21(3):493–512, 2006.
 S. Burer and R.D.C. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via lowrank factorization. Mathematical Programming (Series B), 95:329–357, 2003.
 A. D’aspremont, El L. Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434–48, 2007.
 Murat A. Erdogdu, Yash Deshpande, and Andrea Montanari. Inference in graphical models via semidefinite programming hierarchies. In Advances in Neural Information Processing Systems 30, pages 416–424. 2017.
 Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning, pages 1233–1242, 2017.
 Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(12):59–99, 2016.
 J. Gorski, F. Pfeuffer, and K. Klamroth. Biconvex sets and optimization with biconvex functions: A survey and extensions. Mathematical Methods of Operations Research, 66(3):373–407, 2007.
 EnLiang Hu, Bo Wang, and SongCan Chen. BCDNPKL: Scalable nonparametric kernel learning using block coordinate descent. In Proceedings of the 28th International Conference on Machine Learning, pages 209–216, Bellevue, WA, USA, June 2011.
 M. Jaggi. Convex optimization without projection steps. Preprint arXiv:1108.1170, 2011.
 Michel Journeé, F. Bach, PierreAntoine Absil, and Rodolphe Sepulchre. Lowrank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization, pages 2327–2351, 2010.
 Soeren Laue. A hybrid algorithm for convex semidefinite optimization. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML12), ICML ’12, pages 177–184, New York, NY, USA, July 2012. Omnipress.
 Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for nonconvex programming. In In Advances in Neural Information Processing Systems (NIPS), pages 379–387, 2015.
 Qiuwei Li and Gongguo Tang. The nonconvex geometry of lowrank matrix optimizations with general objective functions. CoRR, abs/1611.03060, 2016.
 Xin Liu, Zaiwen Wen, and Yin Zhang. An efficient gaussnewton algorithm for symmetric lowrank product matrix approximations. SIAM Journal on Optimization, 25(3):1571–1608, 2015.
 Y. Nesterov and A. Nemirovski. InteriorPoint Polynomial Algorithms in Convex Programming. SIAM, 1994.
 J. Nocedal and S.J. Wright. Numerical Optimization. Springer, 2006.
 G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, 2009.
 Michael L. Overton and Robert S. Womersley. On the sum of the largest eigenvalues of a symmetric matrix. SIAM Journal on Matrix Analysis and Applications, 13(1):41–45, 1992.
 W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846–850, 1971.
 Martin Royer. Adaptive clustering through semidefinite programming. In Advances in Neural Information Processing Systems 30, pages 1793–1801. 2017.
 L. Song, A.J. Smola, K.M. Borgwardt, and A. Gretton. Colored maximum variance unfolding. In Advances in Neural Information Processing Systems 20, 2008.
 N. Srebro, J.D.M. Rennie, and T.S. Jaakola. Maximummargin matrix factorization. In Advances in Neural Information Processing Systems 17, pages 1329–1336, 2005.
 Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Fixedrank approximation of a positivesemidefinite matrix from streaming data. In Advances in Neural Information Processing Systems 30, pages 1225–1234. 2017.
 Kilian Q. Weinberger, Fei Sha, and Lawrence K. Saul. Learning a kernel matrix for nonlinear dimensionality reduction. In Proceedings of the Twentyfirst International Conference on Machine Learning, pages 839–846, 2004.
 Zaiwen Wen, Chao Yang, Xin Liu, and Yin Zhang. Tracepenalty minimization for largescale eigenspace computation. J. Sci. Comput., 66(3):1175–1203, 2016.
 X.M. Wu, A. M.C. So, Z. Li, and S.Y.R. Li. Fast graph Laplacian regularized kernel learning via semidefinitequadraticlinear programming. In Advances in Neural Information Processing Systems 22, pages 1964–1972, 2009.
 E.P. Xing, A.Y. Ng, M.I. Jordan, and S.J. Russell. Distance metric learning with application to clustering with sideinformation. In Advances in Neural Information Processing Systems, pages 505–512, 2002.
 Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758–1789, 2013.
 J. Zhuang, I.W. Tsang, and S.C.H. Hoi. A family of simple nonparametric kernel learning algorithms. Journal of Machine Learning Research, 12:1313–1347, 2011.