Stochastic Nonconvex Ordinal Embedding with
Stabilized BarzilaiBorwein Step Size
Abstract
Learning representation from relative similarity comparisons, often called ordinal embedding, gains rising attention in recent years. Most of the existing methods are batch methods designed mainly based on the convex optimization, say, the projected gradient descent method. However, they are generally timeconsuming due to that the singular value decomposition (SVD) is commonly adopted during the update, especially when the data size is very large. To overcome this challenge, we propose a stochastic algorithm called SVRGSBB, which has the following features: (a) SVDfree via dropping convexity, with good scalability by the use of stochastic algorithm, i.e., stochastic variance reduced gradient (SVRG), and (b) adaptive step size choice via introducing a new stabilized BarzilaiBorwein (SBB) method as the original version for convex problems might fail for the considered stochastic nonconvex optimization problem. Moreover, we show that the proposed algorithm converges to a stationary point at a rate in our setting, where is the number of total iterations. Numerous simulations and realworld data experiments are conducted to show the effectiveness of the proposed algorithm via comparing with the stateoftheart methods, particularly, much lower computational cost with good prediction performance.
Stochastic Nonconvex Ordinal Embedding with
Stabilized BarzilaiBorwein Step Size
Ke Ma, Jinshan Zeng, Jiechao Xiong, Qianqian Xu, Xiaochun Cao, Wei Liu, Yuan Yao State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences School of Computer Information Engineering, Jiangxi Normal University Department of Mathematics, Hong Kong University of Science and Technology, Tencent AI Lab {make, xuqianqian, caoxiaochun}@iie.ac.cn, jsh.zeng@gmail.com jcxiong@tencent.com, wliu@ee.columbia.edu, yuany@ust.hk
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Introduction
Ordinal embedding aims to learn representation of data objects as points in a lowdimensional space. The distances among these points agree with a set of relative similarity comparisons as well as possible. Relative comparisons are often collected by workers who are asked to answer the following question:
“Is the similarity between object and larger than the similarity between and ?”
The feedback of individuals provide us a set of quadruplets, i.e., , which can be treated as the supervision for ordinal embedding. Without prior knowledge, the relative similarity comparisons always involve all objects and the number of potential quadruplet could be .
The ordinal embedding problem was firstly studied by (?; ?; ?; ?) in the psychometric society. In recent years, it has drawn a lot of attention in machine learning (?; ?; ?; ?), statistic ranking (?; ?; ?), artificial intelligence (?; ?), information retrieval (?), and computer vision (?; ?), etc.
One of the typical methods for ordinal embedding problem is the wellknown Generalized NonMetric Multidimensional Scaling (GNMDS) (?), which aims at finding a lowrank Gram (or kernel) matrix in Euclidean space such that the pairwise distances between the embedding of objects in Reproducing Kernel Hilbert Space (RKHS) satisfy the relative similarity comparisons. As GNMDS uses hinge loss to model the relative similarity between the objects, it neglects the information provided by satisfied constraints in finding the underlying structure in the lowdimensional space. To alleviate this issue, the Crowd Kernel Learning (CKL) was proposed by (?) via employing a scaleinvariant loss function. However, the objective function used in CKL only considers the constraints which are strongly violated. Latter, (?) proposed the Stochastic Triplet Embedding (STE) that jointly penalizes the violated constraints and rewards the satisfied constraints, via using the logistic loss function. Note that the aforementioned three typical methods are based on the convex formulations, and also employ the projected gradient descent method and singular value decomposition (SVD) to obtain the embedding. However, a huge amount of comparisons and the computational complexity of SVD significantly inhibit their usage to large scale and online applications. Structure Preserving Embedding (SPE) (?) and Local Ordinal Embedding (LOE) (?) embed unweighted nearest neighbor graphs to Euclidean spaces with convex and nonconvex objective functions. The nearest neighbor adjacency matrix can be transformed into ordinal constraints, but it is not a standard equipment in those scenarios which involve relative comparisons. With this limitation, SPE and LOE are not suitable for ordinal embedding via quadruplets or triple comparisons.
In contrast to the kernellearning or convex formulation of ordinal embedding, the aforementioned methods have the analogous nonconvex counterparts. The nonconvex formulations directly obtain the embedding instead of the Gram matrix. Batch gradient descent is not suitable for solving these large scale ordinal embedding problems because of the expense of full gradients in each iteration. Stochastic gradient descent (SGD) is a common technology in this situation as it takes advantage of the stochastic gradient to devise fast computation per iteration. In (?), the convergence rate of SGD for the stochastic nonconvex optimization problem was established, in the sense of convergence to a stationary point, where is the total number of iterations. As SGD has slow convergence due to the inherent variance, stochastic variance reduced gradient (SVRG) method was proposed in (?) to accelerate SGD. For the strongly convex function, the linear convergence of SVRG with OptionII was established in (?), and latter the linear convergence rates of SVRG with OptionI and SVRG incorporating with the BarzilaiBorwein (BB) step size were shown in (?). In the nonconvex case, the convergence rates of SVRG in the sense of convergence to a stationary point were shown in (?; ?) under certain conditions.
Although the BB step size has been incorporated into SVRG and its effectiveness has been shown in (?) for the strongly convex case, it might not work when applied to some stochastic nonconvex optimization problems. Actually, in our latter simulations, we found that the absolute value of the original BB step size is unstable when applied to the stochastic nonconvex ordinal embedding problem studied in this paper (see, Figure 1(a)). The absolute value of the original BB step size varies dramatically with respect to the epoch number. Such phenomenon is mainly due to without the strong convexity, the denominator of BB step size might be very close 0, and thus the BB step size broken up. This motivates us to investigate some new stable and adaptive strategies of step size for SVRG when applied to the stochastic nonconvex ordinal embedding problem.
In this paper, we introduce a new adaptive step size strategy called stabilized BB (SBB) step size via adding another positive term to the absolute of the denominator of the original BB step size to overcome the instability of the BB step size, and then propose a new stochastic algorithm called SVRGSBB via incorporating the SBB step size for fast solving the considered nonconvex ordinal embedding model. In a summary, our main contribution can be shown as follows:

We propose a nonconvex framework for the ordinal embedding problem via considering the optimization problem with respect to the original embedding variable but not its Gram matrix. By exploiting this idea, we get rid of the positive semidefinite (PSD) constraint on the Gram matrix, and thus, our proposed algorithm is SVDfree and has good scalability.

The introduced SBB step size can overcome the instability of the original BB step size when the original BB step size does not work. More importantly, the proposed SVRGSBB algorithm outperforms most of the the stateoftheart methods as shown by numerous simulations and realworld data experiments, in the sense that SVRGSBB often has better generalization performance and significantly reduces the computational cost.

We establish convergence rate of SVRGSBB in the sense of convergence to a stationary point, where is the total number of iterations. Such convergence result is comparable with the existing best convergence results in literature.
Stochastic Ordinal Embedding
A. Problem Description
There is a set of objects in abstract space . We assume that a certain but unknown dissimilarity function assigns the dissimilarity value for a pair of objects . With the dissimilarity function , we can define the ordinal constraint from a set , where
and is the set of . Our goal is to obtain the representations of in Euclidean space where is the desired embedding dimension. The embedding should preserve the ordinal constraints in as much as possible, which means
where is the squared Euclidean distance between and , and is the row of .
Let be the distance matrix of . There are some existing methods for recovering given ordinal constraints on distance matrix . It is known that can be determined by the Gram matrix as and
where is the column vector composed of the diagonal of and . As and it always holds , these methods (?; ?; ?) can be generalized by a semidefinite programming (SDP) with low rank constraint,
(1) 
where is a convex function of which satisfies
is the trace of matrix . To obtain the embedding , the projected gradient descent is performed. The basic idea of the projected gradient descent method is: the batch gradient descent step with all is firstly used to learn the Gram matrix ,
where denotes the current iteration, is the step size; then is projected onto a positive semidefinite (PSD) cone , and latter, once the iterates converge, the embedding is obtained by projecting onto the subspace spanned by the largest eigenvectors of via SVD.
B. Stochastic Nonconvex Ordinal Embedding
Although the SDP (1) is a convex optimization problem, there are some disadvantages of this approach: (i) the projection onto PSD cone , which is performed by an expensive SVD due to the absence of any prior knowledge on the structure of , is a computational bottleneck of optimization; and (ii) the desired dimension of the embedding is and we hope the Gram matrix satisfies . If , the freedom degree of is much larger than with overfitting. Although is a global optimal solution of (1), the subspace spanned by the largest eigenvectors of will produce less accurate embedding. We can tune the regularization parameter to force to be lowrank and crossvalidation is the most utilized technology. This needs extra computational cost. In summary, projection and parameter tuning render gradient descent methods computationally prohibitive for learning the embedding with ordinal information . To overcome these challenges, we will exploit the nonconvex and stochastic optimization techniques for the ordinal embedding problem.
To avoid projecting the Gram matrix onto the PSD cone and tuning the parameter , we directly optimize and propose the unconstrained optimization problem of learning embedding ,
(2) 
where evaluates
and
The loss function can be chosen as the hinge loss (?)
(3) 
the scaleinvariant loss (?)
(4) 
where is a scalar which overcomes the problem of degeneracy and preserve numerical stable, the logistic loss (?)
(5) 
and replacing the Gaussian kernel in (5) by the Student kernel with degree (?)
(6) 
Since (2) is an unconstrained optimization problem, it is obvious that SVD and parameter are avoided. Moreover, instead of the batch methods like the gradient descent method, we use a fast stochastic gradient descent algorithm like SVRG to solve the nonconvex problem (2).
SVRG with Stabilized BB Step Size
A. Motivation
One open issue in stochastic optimization is how to choose an appropriate step size for SVRG in practice. The common method is either to use a constant step size to track, a diminishing step size to enforce convergence, or to tune a step size empirically, which can be time consuming. Recently, (?) proposed to use the BarzilaiBorwein (BB) method to automatically compute step sizes in SVRG for strongly convex objective function shown as follows
(7) 
where is the th iterate of the outer loop of SVRG and . However, if the objective function is nonconvex, the denominator of (7) might be close to 0 and even negative that fail the BB method. For example, Figure 1(a) shows that in simulations one can observe the instability of the absolute value of the original BB step size (called SBB henceforth) in nonconvex problems. Due to this issue, the original BB step size might not be suitable for the nonconvex ordinal embedding problem.
B. Stabilized BB step size
An intuitive way to overcome the flaw of BB step size is to add another positive term in the absolute value of the denominator of the original BB step size, which leads to our introduced stabilized BarzilaiBorwein (SBB) step size shown as follows,
(8)  
By the use of SBB step size, the SVRGSBB algorithm is presented in Algorithm 1.
Actually, as shown by our latter theorem (i.e., Theorem 1), if the Hessian of the objective function is nonsingular and the magnitudes of its eigenvalues are lower bounded by some positive constant , then we can take . In this case, we call the referred step size SBB henceforth. Even if we have no information of the Hessian of the objective function in practice, the SBB step size with an is just a more consecutive step size of SBB step size.
From (8), if the gradient is Lipschitz continuous with constant , then the SBB step size can be bounded as follows
(9) 
where the lower bound is obtained by the Lipschitz continuity of , and the upper bound is directly derived by its specific form. If further is nonsingular and the magnitudes of its eigenvalues has a lower bound , then the bound of SBB becomes
(10) 
As shown in Figure 1 (b), SBB step size with a positive can make SBB step size more stable when SBB step size is unstable and varies dramatically. Moreover, SBB step size usually changes significantly only at the initial several epoches, and then quickly gets very stable. This is mainly because there are many iterations in an epoch of SVRGSBB, and thus, the algorithm might close to a stationary point after only one epoch, and starting from the second epoch, the SBB step sizes might be very close to the inverse of the sum of the curvature of objective function and the parameter used.
C. Convergence Results
In this subsection, we establish the convergence rate of SVRGSBB as shown in the following theorem.
Theorem 1.
Let be a sequence generated by Algorithm 1. Suppose that is smooth, and is Lipschitz continuous with Lipschitz constant and bounded. For any , if
(11) 
then for the output of Algorithm 1, we have
(12) 
where is an optimal solution of (2), is the total number of iterations, is some positive constant satisfying
and are SBB step sizes specified in (8).
Theorem 1 is an adaptation of (?, Theorem 2) via noting that the used SBB step size specified in (8) satisfies (9). The proof of this theorem is presented in supplementary material. Theorem 1 shows certain nonasymptotic rate of convergence of the Algorithm 1 in the sense of convergence to a stationary point. Similar convergence rates of SVRG under different settings have been also shown in (?; ?).
Note that the Lipschitz differentiability of the objective function is crucial for the establishment of the convergence rate of SVRGSBB in Theorem 1. In the following, we give a lemma to show that a part of aforementioned objective functions (4), (5) and (6) in the ordinal embedding problem are Lipschitz differentiable. Proof details are provided in Supplementary material.
Experiments
In this section, we conduct a series of simulations and realworld data experiments to demonstrate the effectiveness of the proposed algorithms. Three models including GNMDS, STE and TSTE are taken into consideration.
A. Simulations
GNMDS  

min  mean  max  std  
cvx  1.6140  1.7881  1.9390  0.0844 
ncvx Batch  4.5070  4.9372  5.2910  0.1857 
ncvx SGD  5.3070  5.5882  5.8970  0.1216 
ncvx SVRG  2.3020  2.8919  3.5280  0.2911 
ncvx SVRGSBB  0.3500  0.4347  0.5340  0.0367 
ncvx SVRGSBB  0.3570  0.4858  0.6070  0.0621 
STE  
min  mean  max  std  
cvx  3.6740  3.9442  4.1870  0.1709 
ncvx Batch  2.4610  2.6326  2.9110  0.1162 
ncvx SGD  1.5000  2.1312  2.7190  0.2740 
ncvx SVRG  1.9930  2.4068  2.8350  0.1935 
ncvx SVRGSBB  0.5000  0.6052  0.6980  0.0660 
ncvx SVRGSBB  0.4510  0.5773  0.6780  0.0515 
TSTE  
min  mean  max  std  
ncvx Batch  3.9380  4.1815  4.4790  0.1146 
ncvx SGD  6.0410  8.2870  9.4770  0.6863 
ncvx SVRG  1.6090  1.9250  2.3470  0.1807 
ncvx SVRGSBB  0.4580  0.7906  1.2480  0.1969 
ncvx SVRGSBB  0.3800  0.4726  0.5470  0.0420 
We start with a smallscale synthetic experiment to show how the methods perform in an idealized setting, which provides sufficient ordinal information in noiseless case.
Settings. The synthesized dataset consists of points , where , where is the identity matrix. The possible similarity triple comparisons are generated based on the Euclidean distances between . As (?) has proved that the Gram matrix can be recovered from triplets, we randomly choose triplets as the training set and the rest as test set. The regularization parameter and step size settings for the convex formulation follow the default setting of the STE/TSTE implementation^{1}^{1}1http://homepage.tudelft.nl/19j49/ste/Stochastic_Triplet_Embedding.html, so we do not choose the step size by line search or the halving heuristic for convex formulation. The embedding dimension is selected just to be equal to without variations, because the results of different embedding dimensions have been discussed in the original papers of GNMDS, STE and TSTE.
Evaluation Metrics. The metrics that we used in the evaluation of various algorithms include the generalization error and running time. As the learned embedding from partial triple comparisons set may be generalized to unknown triplets, the percentage of heldout triplets which is satisfied in the embedding is used as the main metric for evaluating the quality. The running time is the duration of a algorithm when the training error is larger than .
Competitors. We evaluate both convex and nonconvex formulations of three objective functions (i.e. GNMDS, STE and TSTE). We set the two baselines as : () the convex objective function whose results are denoted as “convex”, and () these nonconvex objective functions solved by batch gradient descent denoted as “ncvx batch”. We compare the performance of SVRGSBB with SGD, SVRG with a fixed step size (called SVRG for short henceforth) as well as the batch gradient descent methods.
As SVRG and its variant (SVRGSBB) runs times of (sub)gradient in each epoch, the batch and SGD solutions are evaluated with the same numbers of (sub)gradient of SVRG. In Figure 2, the axis is the computational cost measured by the number of gradient evaluations divided by the total number of triplewise constraints . The generalization error is the result of trials with different initial . For each epoch, the median number of generalization error over 50 trials with [0.25, 0.75] confidence interval are plotted. The experiment results are shown in Figure 2 and Table 1.
Results.
From to Figure 2, the following phenomena can be observed.
First, the algorithm SVRGSBB will be unstable at the initial several epoches for three models, and latter get very stable. The eventual performance of SVRGSBB and that of SVRGSBB are almost the same in three cases.
Second, compared to the batch methods, all the stochastic methods including SGD, SVRG and SVRGSBB ( or ) converge fast at the initial several epoches and quickly get admissible results with relatively small generalization error. This is one of our main motivations to use the stochastic methods. Particularly, for all three models, SVRGSBB outperforms all the other methods in the sense that it not only converges fastest but also achieves almost the best generalization error. Moreover, the outperformance of SVRGSBB in terms of the cpu time can be also observed from Table 1. Specifically, the speedup of SVRGSBB over SVRG is about 4 times for all three models.
Table 1 shows the computational complexity achieved by SGD, SVRRSBB and batch gradient descent for convex and nonconvex objective functions. All computation is done using MATLAB R2016b, on a desktop PC with Windows SP bit, with GHz Intel Xeon E31226 v3 CPU, and GB MHz DDR3 memory. It is easy to see that for all objective functions, SVRGSBB gains speedup compared to the other methods. Besides, we notice that the convex methods could be effective when is small as the projection operator will not be the bottleneck of the convex algorithm.
MAP  Precision@40  MAP  Precision@40  MAP  Precision@40  
GNMDS  
cvx  0.2691  0.3840  0.2512  0.3686  0.2701  0.3883 
ncvx Batch  0.3357  0.4492  0.3791  0.4914  0.3835  0.4925 
ncvx SGD  0.3245  0.4379  0.3635  0.4772  0.3819  0.4931 
ncvx SVRG  0.3348  0.4490  0.3872  0.4974  0.3870  0.4965 
ncvx SVRGSBB  0.3941  0.5040  0.3700  0.4836  0.3550  0.4689 
ncvx SVRGSBB  0.3363  0.4500  0.3887  0.4981  0.3873  0.4987 
STE  
cvx  0.2114  0.3275  0.1776  0.2889  0.1989  0.3190 
ncvx Batch  0.2340  0.3525  0.2252  0.3380  0.2297  0.3423 
ncvx SGD  0.3369  0.4491  0.2951  0.4125  0.2390  0.3488 
ncvx SVRG  0.3817  0.4927  0.3654  0.4804  0.3245  0.4395 
ncvx SVRGSBB  0.3968  0.5059  0.3958  0.5054  0.3895  0.5002 
ncvx SVRGSBB  0.3940  0.5036  0.3921  0.5012  0.3896  0.4992 
TSTE  
ncvx Batch  0.2268  0.3470  0.2069  0.3201  0.2275  0.3447 
ncvx SGD  0.2602  0.3778  0.2279  0.3415  0.2402  0.3514 
ncvx SVRG  0.3481  0.4617  0.3160  0.4332  0.2493  0.3656 
SVRGSBB  0.3900  0.4980  0.3917  0.5018  0.3914  0.5007 
ncvx SVRGSBB  0.3625  0.4719  0.3845  0.4936  0.3897  0.5013 
B. Image Retrieval on SUN397
Here, we apply the proposed SVRGSBB algorithm for a realworld dataset, i.e., SUN 397, which is generally used for the image retrieval task.
In this experiment, we wish to see how the learned representation characterizes the “relevance” of the same image category and the “discrimination” of different image categories. Hence, we use the image representation obtained by ordinal embedding for image retrieval.
Settings. We evaluate the effectiveness of the proposed stochastic nonconvex ordinal embedding method for visual search on the SUN397 dataset. SUN consists of around 108K images from scene categories. In SUN, each image is represented by a dimensional feature vector extracted by principle component analysis (PCA) from dimensional Deep Convolution Activation Features (?). We form the training set by randomly sampling images from categories with images in each category. Only the training set is used for learning an ordinal embedding and a nonlinear mapping from the original feature space to the embedding space whose dimension is . The nonlinear mapping is used to predict the embedding of images which do not participate in the relative similarity comparisons. We use Regularized Least Square and Radial basis function kernel to obtain the nonlinear mapping. The test set consists of images randomly chose from categories with images in each category. We use ground truth category labels of training images to generate the triplewise comparisons without any error. The ordinal constraints are generated like (?): if two images are from the same category and image is from the other categories, the similarity between and is larger than the similarity between and , which is indicated by a triplet . The total number of such triplets is . Errors are then synthesized to simulate the human error in crowdsourcing. We randomly sample and triplets to exchange the positions of and in each triplet .
Evaluation Metrics. To measure the effectiveness of various ordinal embedding methods for visual search, we consider three evaluation metrics, i.e., precision at topK positions (Precision@K), recall at topK positions (Recall@K), and Mean Average Precision (MAP). Given the mapping feature of test images and chosen an image belonging to class as a query, we sort the other images according to the distances between their embeddings and in an ascending order as . True positives () are images correctly labeled as positives, which involve the images belonging to and list within the top K in . False positives () refer to negative examples incorrectly labeled as positives, which are the images belonging to and list within the top K in . True negatives () correspond to negatives correctly labeled as negatives, which refer to the images belonging to and list after the top K in . Finally, false negatives () refer to positive examples incorrectly labeled as negatives, which are relevant to the images belonging to class and list after the top K in . We are able to define Precision@K and Recall@K used in this paper as:
and
Precision and recall are singlevalued metrics based on the whole ranking list of images determined by the Euclidean distances among the embedding . It is desirable to also consider the order in which the images from the same category are embedded. By computing precision and recall at every position in the ranked sequence of the images for query , one can plot a precisionrecall curve, plotting precision as a function of recall . Average Precision(AP) computes the average value of over the interval from to : , which is the area under precisionrecall curve. This integral can be replaced with a finite sum over every position in the ranked sequence of the embedding: , where is the change in recall from items to . The MAP used in this paper is defined as .
Results. The experiment results are shown in Table 2 and Figure 3.
As shown in Table 2 and Figure 3 with varying from to , we observe that nonconvex SVRGSBB consistently achieves the superior Precision@K, Recall@K and MAP results comparing against the other methods with the same gradient calculation. The results of GNMDS illustrate that SVRGSBB is more suitable for nonconvex objective functions than the other methods. Therefore, SVRGSBB has a very promising potential in practice, because it generates appropriate step sizes automatically while running the algorithm and the result is robust. Moreover, under our setting and with small noise, all the ordinal embedding methods achieve the reasonable results for image retrieval. It illustrates that highquality relative similarity comparisons can be used for learning meaningful representation of massive data, thereby making it easier to extract useful information in other applications.
Conclusions
In this paper, we propose a stochastic nonconvex framework for the ordinal embedding problem. We propose a novel stochastic gradient descent algorithm called SVRGSBB for solving this nonconvex framework. The proposed SVRGSBB is a variant of SVRG method incorporating with the socalled stabilized BB (SBB) step size, a new, stable and adaptive step size introduced in this paper. The main idea of the SBB step size is adding another positive term in the absolute value of the denominator of the original BB step size such that SVRGSBB can overcome the instability of the original BB step size when applied to such nonconvex problem. A series of simulations and realworld data experiments are implemented to demonstrate the effectiveness of the proposed SVRGSBB for the ordinal embedding problem. It is surprising that the proposed SVRGSBB outperforms most of the stateoftheart methods in the perspective of both generalization error and computational cost. We also establish the convergence rate of SVRGSBB in terms of the convergence to a stationary point. Such convergence rate is comparable to the existing best convergence results of SVRG in literature.
References
 [Agarwal et al. 2007] Agarwal, S.; Wills, J.; Cayton, L.; Lanckriet, G. R.; Kriegman, D. J.; and Belongie, S. 2007. Generalized nonmetric multidimensional scaling. International Conference on Artificial Intelligence and Statistics 11–18.
 [AllenZhu and Hazan 2016] AllenZhu, Z., and Hazan, E. 2016. Variance reduction for faster nonconvex optimization. International Conference on Machine Learning 699–707.
 [AriasCastro 2017] AriasCastro, E. 2017. Some theory for ordinal embedding. Bernoulli 23(3):1663–1693.
 [Ghadimi and Lan 2013] Ghadimi, S., and Lan, G. 2013. Stochastic first and zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization 23(4):2341–2368.
 [Gong et al. 2014] Gong, Y.; Wang, L.; Guo, R.; and Lazebnik, S. 2014. Multiscale orderless pooling of deep convolutional activation features. European Conference on Computer Vision 392–407.
 [Heikinheimo and Ukkonen 2013] Heikinheimo, H., and Ukkonen, A. 2013. The crowdmedian algorithm. AAAI Conference on Human Computation and Crowdsourcing 69–77.
 [Jain, Jamieson, and Nowak 2016] Jain, L.; Jamieson, K. G.; and Nowak, R. 2016. Finite sample prediction and recovery bounds for ordinal embedding. Annual Conference on Neural Information Processing Systems 2711–2719.
 [Jamieson and Nowak 2011a] Jamieson, K., and Nowak, R. 2011a. Active ranking using pairwise comparisons. Annual Conference on Neural Information Processing Systems 2240–2248.
 [Jamieson and Nowak 2011b] Jamieson, K., and Nowak, R. 2011b. Lowdimensional embedding using adaptively selected ordinal data. Annual Allerton Conference on Communication, Control, and Computing 1077–1084.
 [Johnson and Zhang 2013] Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. Annual Conference on Neural Information Processing Systems 315–323.
 [Kleindessner and Luxburg 2014] Kleindessner, M., and Luxburg, U. 2014. Uniqueness of ordinal embedding. Conference on Learning Theory 40–67.
 [Kruskal 1964a] Kruskal, J. B. 1964a. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1):1–27.
 [Kruskal 1964b] Kruskal, J. B. 1964b. Nonmetric multidimensional scaling: A numerical method. Psychometrika 29(2):115–129.
 [Lan et al. 2012] Lan, Y.; Guo, J.; Cheng, X.; and yan Liu, T. 2012. Statistical consistency of ranking methods in a rankdifferentiable probability space. Annual Conference on Neural Information Processing Systems (NIPS) 1241–1249.
 [McFee and Lanckriet 2011] McFee, B., and Lanckriet, G. 2011. Learning multimodal similarity. Journal of Machine Learning Research (JMLR) 12:491–523.
 [Reddi et al. 2016] Reddi, S. J.; Hefny, A.; Sra, S.; Poczos, B.; and Smola, A. 2016. Stochastic variance reduction for nonconvex optimization. International Conference on Machine Learning 314–323.
 [Shaw and Jebara 2009] Shaw, B., and Jebara, T. 2009. Structure preserving embedding. International Conference on Machine Learning 937–944.
 [Shepard 1962a] Shepard, R. N. 1962a. The analysis of proximities: Multidimensional scaling with an unknown distance function. i. Psychometrika 27(2):125–140.
 [Shepard 1962b] Shepard, R. N. 1962b. The analysis of proximities: Multidimensional scaling with an unknown distance function. ii. Psychometrika 27(3):219–246.
 [Song et al. 2015] Song, D.; Liu, W.; Ji, R.; Meyer, D. A.; and Smith, J. R. 2015. Top rank supervised binary coding for visual search. IEEE International Conference on Computer Vision 1922–1930.
 [Tamuz et al. 2011] Tamuz, O.; Liu, C.; Shamir, O.; Kalai, A.; and Belongie, S. 2011. Adaptively learning the crowd kernel. International Conference on Machine Learning 673–680.
 [Tan et al. 2016] Tan, C.; Ma, S.; Dai, Y.H.; and Qian, Y. 2016. Barzilaiborwein step size for stochastic gradient descent. Annual Conference on Neural Information Processing Systems 685–693.
 [Terada and Luxburg 2014] Terada, Y., and Luxburg, U. 2014. Local ordinal embedding. International Conference on Machine Learning 847–855.
 [van der Maaten and Weinberger 2012] van der Maaten, L., and Weinberger, K. 2012. Stochastic triplet embedding. IEEE International Workshop on Machine Learning for Signal Processing 1–6.
 [Wah et al. 2014] Wah, C.; Horn, G. V.; Branson, S.; Maji, S.; Perona, P.; and Belongie, S. 2014. Similarity comparisons for interactive finegrained categorization. IEEE Conference on Computer Vision and Pattern Recognition 859–866.
 [Wilber et al. 2015] Wilber, M.; Kwak, I.; Kriegman, D.; and Belongie, S. 2015. Learning concept embeddings with combined humanmachine expertise. IEEE International Conference on Computer Vision 981–989.
 [Wilber, Kwak, and Belongie 2014] Wilber, M.; Kwak, S.; and Belongie, S. 2014. Costeffective hits for relative similarity comparisons. AAAI Conference on Human Computation and Crowdsourcing 227–233.
Supplementary Materials for
”Stochastic Nonconvex Ordinal Embedding with Stabilized BarzilaiBorwein Step Size”
Proof of Lemma 1
Proof.
Let
(13) 
where , and
(14) 
where is the identity matrix.
The firstorder gradient of CKL is
(15)  
and the secondorder Hessian matrix of CKL is
(16)  
As is bounded, , and is bounded. So any element of is bounded , and has bounded eigenvalues. By the definition of Lipschitz continuity, we have the loss function of CKL (4) has Lipschitz continuous gradient with bounded .
(17) 
(18)  
The firstorder gradient of STE (5) is
(19) 
and the secondorder Hessian matrix of STE (5) is
(20)  
All elements of are bounded. So the eigenvalues of are bounded. By the definition of Lipschitz continuity, (5) has Lipschitz continuous gradient with bounded .
The firstorder gradient of GNMDS (3) is
(21) 
and the secondorder Hessian matrix of GNMDS (3) is
(22) 
If for all , is continuous on and the Hessian matrix has bounded eigenvalues. So GNMDS has Lipschitz continuous gradient in some quadruple set as . As the special case of which satisfied is exceedingly rare, we split the embedding into pieces and employ SGD and SVRG to optimize the objective function of GNMDS (3). The empirical results are showed in the Experiment section.
Note , and the firstorder gradient of TSTE is (17). The secondorder Hessian matrix of TSTE is (18). The boundedness of eigenvalues of The loss function of can infer that the TSTE loss function (6) has Lipschitz continuous gradient with bounded .
We focus on a special case of quadruple comparisons as and in the Experiment section. To verify the Lipschitz continuous gradient of ordinal embedding objective functions with as , we introduction the matrix as
(23) 
By chain rule for computing the derivative, we have
(24)  
where . As is a constant matrix and is bounded, all elements of the Hessian matrix are bounded. So the eigenvalues of is also bounded. The ordianl embedding functions of CKL, STE and TSTE with triplewise compsrisons have Lipschitz continuous gradient with bounded . ∎
Proof of Theorem 1
In this section, we prove our main theorem, i.e., Theorem 1. Before doing this, we need some lemmas.
Lemma 2.
Given some positive integer , then for any , the following holds
Proof.
Note that
where the last inequality holds for for any . Thus, we get the first inequality. Let for any . It is easy to check that for any . Thus we get the second inequality. ∎
In the following, we provide a key lemma that shows the convergence behavior of the inner loop of the proposed algorithm. Before presenting it, we define several constants and sequences. For any , let
(25)  
(26) 
Let be a nonnegative sequence satisfying and for ,
It is obvious that is monotonically decreasing as increasing, and
Then we define
(27) 
for any and .
With the help of the sequences defined above, we present a key lemma as follows.
Lemma 3 (Convergence behavior of inner loop).
Let be a sequence generated by Algorithm 1 at the th inner loop, . If the following conditions hold
(28)  
(29) 
then
where ,
Proof.
This lemma is a special case of (?, Lemma 1) with some specific choices of . Thus, we only need to check the defined is positive for any . In order to do this, we firs check that
(30) 
under the condition (29) of this lemma.