Stochastic Non-convex Ordinal Embedding with Stabilized Barzilai-Borwein Step Size

# Stochastic Non-convex Ordinal Embedding with Stabilized Barzilai-Borwein Step Size

Ke Ma, Jinshan Zeng, Jiechao Xiong, Qianqian Xu, Xiaochun Cao, Wei Liu, Yuan Yao
State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences
School of Cyber Security, University of Chinese Academy of Sciences
School of Computer Information Engineering, Jiangxi Normal University
Department of Mathematics, Hong Kong University of Science and Technology, Tencent AI Lab
{make, xuqianqian, caoxiaochun}@iie.ac.cn, jsh.zeng@gmail.com
jcxiong@tencent.com, wliu@ee.columbia.edu, yuany@ust.hk
The corresponding authors.
###### Abstract

Learning representation from relative similarity comparisons, often called ordinal embedding, gains rising attention in recent years. Most of the existing methods are batch methods designed mainly based on the convex optimization, say, the projected gradient descent method. However, they are generally time-consuming due to that the singular value decomposition (SVD) is commonly adopted during the update, especially when the data size is very large. To overcome this challenge, we propose a stochastic algorithm called SVRG-SBB, which has the following features: (a) SVD-free via dropping convexity, with good scalability by the use of stochastic algorithm, i.e., stochastic variance reduced gradient (SVRG), and (b) adaptive step size choice via introducing a new stabilized Barzilai-Borwein (SBB) method as the original version for convex problems might fail for the considered stochastic non-convex optimization problem. Moreover, we show that the proposed algorithm converges to a stationary point at a rate in our setting, where is the number of total iterations. Numerous simulations and real-world data experiments are conducted to show the effectiveness of the proposed algorithm via comparing with the state-of-the-art methods, particularly, much lower computational cost with good prediction performance.

Stochastic Non-convex Ordinal Embedding with
Stabilized Barzilai-Borwein Step Size

Ke Ma, Jinshan Zeng, Jiechao Xiong, Qianqian Xu, Xiaochun Cao, Wei Liu, Yuan Yao State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences School of Computer Information Engineering, Jiangxi Normal University Department of Mathematics, Hong Kong University of Science and Technology, Tencent AI Lab {make, xuqianqian, caoxiaochun}@iie.ac.cn, jsh.zeng@gmail.com jcxiong@tencent.com, wliu@ee.columbia.edu, yuany@ust.hk

## Introduction

Ordinal embedding aims to learn representation of data objects as points in a low-dimensional space. The distances among these points agree with a set of relative similarity comparisons as well as possible. Relative comparisons are often collected by workers who are asked to answer the following question:

“Is the similarity between object and larger than the similarity between and ?”

The feedback of individuals provide us a set of quadruplets, i.e., , which can be treated as the supervision for ordinal embedding. Without prior knowledge, the relative similarity comparisons always involve all objects and the number of potential quadruplet could be .

The ordinal embedding problem was firstly studied by (????) in the psychometric society. In recent years, it has drawn a lot of attention in machine learning (????), statistic ranking (???), artificial intelligence (??), information retrieval (?), and computer vision (??), etc.

One of the typical methods for ordinal embedding problem is the well-known Generalized Non-Metric Multidimensional Scaling (GNMDS) (?), which aims at finding a low-rank Gram (or kernel) matrix in Euclidean space such that the pairwise distances between the embedding of objects in Reproducing Kernel Hilbert Space (RKHS) satisfy the relative similarity comparisons. As GNMDS uses hinge loss to model the relative similarity between the objects, it neglects the information provided by satisfied constraints in finding the underlying structure in the low-dimensional space. To alleviate this issue, the Crowd Kernel Learning (CKL) was proposed by (?) via employing a scale-invariant loss function. However, the objective function used in CKL only considers the constraints which are strongly violated. Latter, (?) proposed the Stochastic Triplet Embedding (STE) that jointly penalizes the violated constraints and rewards the satisfied constraints, via using the logistic loss function. Note that the aforementioned three typical methods are based on the convex formulations, and also employ the projected gradient descent method and singular value decomposition (SVD) to obtain the embedding. However, a huge amount of comparisons and the computational complexity of SVD significantly inhibit their usage to large scale and on-line applications. Structure Preserving Embedding (SPE) (?) and Local Ordinal Embedding (LOE) (?) embed unweighted nearest neighbor graphs to Euclidean spaces with convex and non-convex objective functions. The nearest neighbor adjacency matrix can be transformed into ordinal constraints, but it is not a standard equipment in those scenarios which involve relative comparisons. With this limitation, SPE and LOE are not suitable for ordinal embedding via quadruplets or triple comparisons.

In contrast to the kernel-learning or convex formulation of ordinal embedding, the aforementioned methods have the analogous non-convex counterparts. The non-convex formulations directly obtain the embedding instead of the Gram matrix. Batch gradient descent is not suitable for solving these large scale ordinal embedding problems because of the expense of full gradients in each iteration. Stochastic gradient descent (SGD) is a common technology in this situation as it takes advantage of the stochastic gradient to devise fast computation per iteration. In (?), the convergence rate of SGD for the stochastic non-convex optimization problem was established, in the sense of convergence to a stationary point, where is the total number of iterations. As SGD has slow convergence due to the inherent variance, stochastic variance reduced gradient (SVRG) method was proposed in (?) to accelerate SGD. For the strongly convex function, the linear convergence of SVRG with Option-II was established in (?), and latter the linear convergence rates of SVRG with Option-I and SVRG incorporating with the Barzilai-Borwein (BB) step size were shown in (?). In the non-convex case, the convergence rates of SVRG in the sense of convergence to a stationary point were shown in (??) under certain conditions.

Although the BB step size has been incorporated into SVRG and its effectiveness has been shown in (?) for the strongly convex case, it might not work when applied to some stochastic non-convex optimization problems. Actually, in our latter simulations, we found that the absolute value of the original BB step size is unstable when applied to the stochastic non-convex ordinal embedding problem studied in this paper (see, Figure 1(a)). The absolute value of the original BB step size varies dramatically with respect to the epoch number. Such phenomenon is mainly due to without the strong convexity, the denominator of BB step size might be very close 0, and thus the BB step size broken up. This motivates us to investigate some new stable and adaptive strategies of step size for SVRG when applied to the stochastic non-convex ordinal embedding problem.

In this paper, we introduce a new adaptive step size strategy called stabilized BB (SBB) step size via adding another positive term to the absolute of the denominator of the original BB step size to overcome the instability of the BB step size, and then propose a new stochastic algorithm called SVRG-SBB via incorporating the SBB step size for fast solving the considered non-convex ordinal embedding model. In a summary, our main contribution can be shown as follows:

• We propose a non-convex framework for the ordinal embedding problem via considering the optimization problem with respect to the original embedding variable but not its Gram matrix. By exploiting this idea, we get rid of the positive semi-definite (PSD) constraint on the Gram matrix, and thus, our proposed algorithm is SVD-free and has good scalability.

• The introduced SBB step size can overcome the instability of the original BB step size when the original BB step size does not work. More importantly, the proposed SVRG-SBB algorithm outperforms most of the the state-of-the-art methods as shown by numerous simulations and real-world data experiments, in the sense that SVRG-SBB often has better generalization performance and significantly reduces the computational cost.

• We establish convergence rate of SVRG-SBB in the sense of convergence to a stationary point, where is the total number of iterations. Such convergence result is comparable with the existing best convergence results in literature.

## Stochastic Ordinal Embedding

### A. Problem Description

There is a set of objects in abstract space . We assume that a certain but unknown dissimilarity function assigns the dissimilarity value for a pair of objects . With the dissimilarity function , we can define the ordinal constraint from a set , where

 P={(i,j,l,k) | if exist oi,oj,ok,ol % satisfy ξij<ξlk}

and is the set of . Our goal is to obtain the representations of in Euclidean space where is the desired embedding dimension. The embedding should preserve the ordinal constraints in as much as possible, which means

 0(i,j,l,k)∈P⇔ξij<ξlk⇔d2ij(X)

where is the squared Euclidean distance between and , and is the row of .

Let be the distance matrix of . There are some existing methods for recovering given ordinal constraints on distance matrix . It is known that can be determined by the Gram matrix as and

 D=diag(G)⋅1T−2G+1⋅diag(G)T

where is the column vector composed of the diagonal of and . As and it always holds , these methods (???) can be generalized by a semidefinite programming (SDP) with low rank constraint,

 minG∈Rn×n  L(G)+λ⋅tr(G)s.t.G⪰0 (1)

where is a convex function of which satisfies

 lp(G):{>0,d2ij(X)>d2lk(X)≤0,otherwise.

is the trace of matrix . To obtain the embedding , the projected gradient descent is performed. The basic idea of the projected gradient descent method is: the batch gradient descent step with all is firstly used to learn the Gram matrix ,

 G′t=Gt−1−ηt(∇L(Gt−1)+λI)

where denotes the current iteration, is the step size; then is projected onto a positive semi-definite (PSD) cone , and latter, once the iterates converge, the embedding is obtained by projecting onto the subspace spanned by the largest eigenvectors of via SVD.

### B. Stochastic Non-convex Ordinal Embedding

Although the SDP (1) is a convex optimization problem, there are some disadvantages of this approach: (i) the projection onto PSD cone , which is performed by an expensive SVD due to the absence of any prior knowledge on the structure of , is a computational bottleneck of optimization; and (ii) the desired dimension of the embedding is and we hope the Gram matrix satisfies . If , the freedom degree of is much larger than with over-fitting. Although is a global optimal solution of (1), the subspace spanned by the largest eigenvectors of will produce less accurate embedding. We can tune the regularization parameter to force to be low-rank and cross-validation is the most utilized technology. This needs extra computational cost. In summary, projection and parameter tuning render gradient descent methods computationally prohibitive for learning the embedding with ordinal information . To overcome these challenges, we will exploit the non-convex and stochastic optimization techniques for the ordinal embedding problem.

To avoid projecting the Gram matrix onto the PSD cone and tuning the parameter , we directly optimize and propose the unconstrained optimization problem of learning embedding ,

 minX∈Rn×d F(X):=1|P| ∑p∈P fp(X) (2)

where evaluates

 △p=d2ij(X)−d2lk(X), p=(i,j,l,k).

and

 fp(X):{≤0,△p≤0>0,otherwise.

The loss function can be chosen as the hinge loss (?)

 fp(X)=max{0,1+△p}, (3)

the scale-invariant loss (?)

 fp(X)=logd2lk(X)+δd2ij(X)+d2lk(X)+2δ, (4)

where is a scalar which overcomes the problem of degeneracy and preserve numerical stable, the logistic loss (?)

 fp(X)=−log(1+exp(△p)), (5)

and replacing the Gaussian kernel in (5) by the Student- kernel with degree (?)

 fp(X)=−log(1+d2ij(X)α)−α+12(1+d2ij(X)α)−α+12+(1+d2lk(X)α)−α+12. (6)

Since (2) is an unconstrained optimization problem, it is obvious that SVD and parameter are avoided. Moreover, instead of the batch methods like the gradient descent method, we use a fast stochastic gradient descent algorithm like SVRG to solve the non-convex problem (2).

## SVRG with Stabilized BB Step Size

### A. Motivation

One open issue in stochastic optimization is how to choose an appropriate step size for SVRG in practice. The common method is either to use a constant step size to track, a diminishing step size to enforce convergence, or to tune a step size empirically, which can be time consuming. Recently, (?) proposed to use the Barzilai-Borwein (BB) method to automatically compute step sizes in SVRG for strongly convex objective function shown as follows

 ηs=1m∥~Xs−~Xs−1∥2Fvec(~Xs−~Xs−1)T% vec(gs−gs−1), (7)

where is the -th iterate of the outer loop of SVRG and . However, if the objective function is non-convex, the denominator of (7) might be close to 0 and even negative that fail the BB method. For example, Figure 1(a) shows that in simulations one can observe the instability of the absolute value of the original BB step size (called SBB henceforth) in non-convex problems. Due to this issue, the original BB step size might not be suitable for the non-convex ordinal embedding problem.

### B. Stabilized BB step size

An intuitive way to overcome the flaw of BB step size is to add another positive term in the absolute value of the denominator of the original BB step size, which leads to our introduced stabilized Barzilai-Borwein (SBB) step size shown as follows,

 ηs = 1m⋅∥∥~Xs−~Xs−1∥∥2F (8) × (∣∣vec(~Xs−~Xs−1)Tvec(gs−gs−1)∣∣ + ϵ∥∥~Xs−~Xs−1∥∥2F)−1,for some ϵ>0.

By the use of SBB step size, the SVRG-SBB algorithm is presented in Algorithm 1.

Actually, as shown by our latter theorem (i.e., Theorem 1), if the Hessian of the objective function is nonsingular and the magnitudes of its eigenvalues are lower bounded by some positive constant , then we can take . In this case, we call the referred step size SBB henceforth. Even if we have no information of the Hessian of the objective function in practice, the SBB step size with an is just a more consecutive step size of SBB step size.

From (8), if the gradient is Lipschitz continuous with constant , then the SBB step size can be bounded as follows

 1m(L+ϵ)≤ηk≤1mϵ, (9)

where the lower bound is obtained by the -Lipschitz continuity of , and the upper bound is directly derived by its specific form. If further is nonsingular and the magnitudes of its eigenvalues has a lower bound , then the bound of SBB becomes

 1mL≤ηk≤1mμ. (10)

As shown in Figure 1 (b), SBB step size with a positive can make SBB step size more stable when SBB step size is unstable and varies dramatically. Moreover, SBB step size usually changes significantly only at the initial several epoches, and then quickly gets very stable. This is mainly because there are many iterations in an epoch of SVRG-SBB, and thus, the algorithm might close to a stationary point after only one epoch, and starting from the second epoch, the SBB step sizes might be very close to the inverse of the sum of the curvature of objective function and the parameter used.

### C. Convergence Results

In this subsection, we establish the convergence rate of SVRG-SBB as shown in the following theorem.

###### Theorem 1.

Let be a sequence generated by Algorithm 1. Suppose that is smooth, and is Lipschitz continuous with Lipschitz constant and bounded. For any , if

 m>max{2L2(1+2Lϵ),1+√1+8L2ϵ2}⋅ϵ−1, (11)

then for the output of Algorithm 1, we have

 E[∥∇F(Xout)∥2]≤F(~X0)−F(X∗)T⋅γS, (12)

where is an optimal solution of (2), is the total number of iterations, is some positive constant satisfying

 γS≥min0≤s≤S−1{ηs[12−ηs(1+4(m−1)L3η2s)]},

and are SBB step sizes specified in (8).

If further the Hessian exists and is the lower bound of the magnitudes of eigenvalues of for any bounded , then the convergence rate (12) still holds for SVRR-SBB with replaced by . In addition, if , then we can take , and (12) still holds for SVRR-SBB with replaced by .

Theorem 1 is an adaptation of (?, Theorem 2) via noting that the used SBB step size specified in (8) satisfies (9). The proof of this theorem is presented in supplementary material. Theorem 1 shows certain non-asymptotic rate of convergence of the Algorithm 1 in the sense of convergence to a stationary point. Similar convergence rates of SVRG under different settings have been also shown in (??).

Note that the Lipschitz differentiability of the objective function is crucial for the establishment of the convergence rate of SVRG-SBB in Theorem 1. In the following, we give a lemma to show that a part of aforementioned objective functions (4), (5) and (6) in the ordinal embedding problem are Lipschitz differentiable. Proof details are provided in Supplementary material.

###### Lemma 1.

The ordinal embedding functions (4), (5) and (6) are Lipschitz differentiable for any bounded variable .

## Experiments

In this section, we conduct a series of simulations and real-world data experiments to demonstrate the effectiveness of the proposed algorithms. Three models including GNMDS, STE and TSTE are taken into consideration.

### A. Simulations

We start with a small-scale synthetic experiment to show how the methods perform in an idealized setting, which provides sufficient ordinal information in noiseless case.

Settings. The synthesized dataset consists of points , where , where is the identity matrix. The possible similarity triple comparisons are generated based on the Euclidean distances between . As (?) has proved that the Gram matrix can be recovered from triplets, we randomly choose triplets as the training set and the rest as test set. The regularization parameter and step size settings for the convex formulation follow the default setting of the STE/TSTE implementation, so we do not choose the step size by line search or the halving heuristic for convex formulation. The embedding dimension is selected just to be equal to without variations, because the results of different embedding dimensions have been discussed in the original papers of GNMDS, STE and TSTE.

Evaluation Metrics. The metrics that we used in the evaluation of various algorithms include the generalization error and running time. As the learned embedding from partial triple comparisons set may be generalized to unknown triplets, the percentage of held-out triplets which is satisfied in the embedding is used as the main metric for evaluating the quality. The running time is the duration of a algorithm when the training error is larger than .

Competitors. We evaluate both convex and non-convex formulations of three objective functions (i.e. GNMDS, STE and TSTE). We set the two baselines as : () the convex objective function whose results are denoted as “convex”, and () these non-convex objective functions solved by batch gradient descent denoted as “ncvx batch”. We compare the performance of SVRG-SBB with SGD, SVRG with a fixed step size (called SVRG for short henceforth) as well as the batch gradient descent methods. As SVRG and its variant (SVRG-SBB) runs times of (sub)gradient in each epoch, the batch and SGD solutions are evaluated with the same numbers of (sub)gradient of SVRG. In Figure 2, the -axis is the computational cost measured by the number of gradient evaluations divided by the total number of triple-wise constraints . The generalization error is the result of trials with different initial . For each epoch, the median number of generalization error over 50 trials with [0.25, 0.75] confidence interval are plotted. The experiment results are shown in Figure 2 and Table 1.

Results. From to Figure 2, the following phenomena can be observed. First, the algorithm SVRG-SBB will be unstable at the initial several epoches for three models, and latter get very stable. The eventual performance of SVRG-SBB and that of SVRG-SBB are almost the same in three cases. Second, compared to the batch methods, all the stochastic methods including SGD, SVRG and SVRG-SBB ( or ) converge fast at the initial several epoches and quickly get admissible results with relatively small generalization error. This is one of our main motivations to use the stochastic methods. Particularly, for all three models, SVRG-SBB outperforms all the other methods in the sense that it not only converges fastest but also achieves almost the best generalization error. Moreover, the outperformance of SVRG-SBB in terms of the cpu time can be also observed from Table 1. Specifically, the speedup of SVRG-SBB over SVRG is about 4 times for all three models.

Table 1 shows the computational complexity achieved by SGD, SVRR-SBB and batch gradient descent for convex and non-convex objective functions. All computation is done using MATLAB R2016b, on a desktop PC with Windows SP bit, with GHz Intel Xeon E3-1226 v3 CPU, and GB MHz DDR3 memory. It is easy to see that for all objective functions, SVRG-SBB gains speed-up compared to the other methods. Besides, we notice that the convex methods could be effective when is small as the projection operator will not be the bottleneck of the convex algorithm.

### B. Image Retrieval on SUN397

Here, we apply the proposed SVRG-SBB algorithm for a real-world dataset, i.e., SUN 397, which is generally used for the image retrieval task. In this experiment, we wish to see how the learned representation characterizes the “relevance” of the same image category and the “discrimination” of different image categories. Hence, we use the image representation obtained by ordinal embedding for image retrieval.

Settings. We evaluate the effectiveness of the proposed stochastic non-convex ordinal embedding method for visual search on the SUN397 dataset. SUN consists of around 108K images from scene categories. In SUN, each image is represented by a -dimensional feature vector extracted by principle component analysis (PCA) from -dimensional Deep Convolution Activation Features (?). We form the training set by randomly sampling images from categories with images in each category. Only the training set is used for learning an ordinal embedding and a nonlinear mapping from the original feature space to the embedding space whose dimension is . The nonlinear mapping is used to predict the embedding of images which do not participate in the relative similarity comparisons. We use Regularized Least Square and Radial basis function kernel to obtain the nonlinear mapping. The test set consists of images randomly chose from categories with images in each category. We use ground truth category labels of training images to generate the triple-wise comparisons without any error. The ordinal constraints are generated like (?): if two images are from the same category and image is from the other categories, the similarity between and is larger than the similarity between and , which is indicated by a triplet . The total number of such triplets is . Errors are then synthesized to simulate the human error in crowd-sourcing. We randomly sample and triplets to exchange the positions of and in each triplet .

Evaluation Metrics. To measure the effectiveness of various ordinal embedding methods for visual search, we consider three evaluation metrics, i.e., precision at top-K positions (Precision@K), recall at top-K positions (Recall@K), and Mean Average Precision (MAP). Given the mapping feature of test images and chosen an image belonging to class as a query, we sort the other images according to the distances between their embeddings and in an ascending order as . True positives () are images correctly labeled as positives, which involve the images belonging to and list within the top K in . False positives () refer to negative examples incorrectly labeled as positives, which are the images belonging to and list within the top K in . True negatives () correspond to negatives correctly labeled as negatives, which refer to the images belonging to and list after the top K in . Finally, false negatives () refer to positive examples incorrectly labeled as negatives, which are relevant to the images belonging to class and list after the top K in . We are able to define Precision@K and Recall@K used in this paper as: and Precision and recall are single-valued metrics based on the whole ranking list of images determined by the Euclidean distances among the embedding . It is desirable to also consider the order in which the images from the same category are embedded. By computing precision and recall at every position in the ranked sequence of the images for query , one can plot a precision-recall curve, plotting precision as a function of recall . Average Precision(AP) computes the average value of over the interval from to : , which is the area under precision-recall curve. This integral can be replaced with a finite sum over every position in the ranked sequence of the embedding: , where is the change in recall from items to . The MAP used in this paper is defined as .

Results. The experiment results are shown in Table 2 and Figure 3. As shown in Table 2 and Figure 3 with varying from to , we observe that non-convex SVRG-SBB consistently achieves the superior Precision@K, Recall@K and MAP results comparing against the other methods with the same gradient calculation. The results of GNMDS illustrate that SVRG-SBB is more suitable for non-convex objective functions than the other methods. Therefore, SVRG-SBB has a very promising potential in practice, because it generates appropriate step sizes automatically while running the algorithm and the result is robust. Moreover, under our setting and with small noise, all the ordinal embedding methods achieve the reasonable results for image retrieval. It illustrates that high-quality relative similarity comparisons can be used for learning meaningful representation of massive data, thereby making it easier to extract useful information in other applications.

## Conclusions

In this paper, we propose a stochastic non-convex framework for the ordinal embedding problem. We propose a novel stochastic gradient descent algorithm called SVRG-SBB for solving this non-convex framework. The proposed SVRG-SBB is a variant of SVRG method incorporating with the so-called stabilized BB (SBB) step size, a new, stable and adaptive step size introduced in this paper. The main idea of the SBB step size is adding another positive term in the absolute value of the denominator of the original BB step size such that SVRG-SBB can overcome the instability of the original BB step size when applied to such non-convex problem. A series of simulations and real-world data experiments are implemented to demonstrate the effectiveness of the proposed SVRG-SBB for the ordinal embedding problem. It is surprising that the proposed SVRG-SBB outperforms most of the state-of-the-art methods in the perspective of both generalization error and computational cost. We also establish the convergence rate of SVRG-SBB in terms of the convergence to a stationary point. Such convergence rate is comparable to the existing best convergence results of SVRG in literature.

## References

• [Agarwal et al. 2007] Agarwal, S.; Wills, J.; Cayton, L.; Lanckriet, G. R.; Kriegman, D. J.; and Belongie, S. 2007. Generalized non-metric multidimensional scaling. International Conference on Artificial Intelligence and Statistics 11–18.
• [Allen-Zhu and Hazan 2016] Allen-Zhu, Z., and Hazan, E. 2016. Variance reduction for faster non-convex optimization. International Conference on Machine Learning 699–707.
• [Arias-Castro 2017] Arias-Castro, E. 2017. Some theory for ordinal embedding. Bernoulli 23(3):1663–1693.
• [Ghadimi and Lan 2013] Ghadimi, S., and Lan, G. 2013. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23(4):2341–2368.
• [Gong et al. 2014] Gong, Y.; Wang, L.; Guo, R.; and Lazebnik, S. 2014. Multi-scale orderless pooling of deep convolutional activation features. European Conference on Computer Vision 392–407.
• [Heikinheimo and Ukkonen 2013] Heikinheimo, H., and Ukkonen, A. 2013. The crowd-median algorithm. AAAI Conference on Human Computation and Crowdsourcing 69–77.
• [Jain, Jamieson, and Nowak 2016] Jain, L.; Jamieson, K. G.; and Nowak, R. 2016. Finite sample prediction and recovery bounds for ordinal embedding. Annual Conference on Neural Information Processing Systems 2711–2719.
• [Jamieson and Nowak 2011a] Jamieson, K., and Nowak, R. 2011a. Active ranking using pairwise comparisons. Annual Conference on Neural Information Processing Systems 2240–2248.
• [Jamieson and Nowak 2011b] Jamieson, K., and Nowak, R. 2011b. Low-dimensional embedding using adaptively selected ordinal data. Annual Allerton Conference on Communication, Control, and Computing 1077–1084.
• [Johnson and Zhang 2013] Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. Annual Conference on Neural Information Processing Systems 315–323.
• [Kleindessner and Luxburg 2014] Kleindessner, M., and Luxburg, U. 2014. Uniqueness of ordinal embedding. Conference on Learning Theory 40–67.
• [Kruskal 1964a] Kruskal, J. B. 1964a. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1):1–27.
• [Kruskal 1964b] Kruskal, J. B. 1964b. Nonmetric multidimensional scaling: A numerical method. Psychometrika 29(2):115–129.
• [Lan et al. 2012] Lan, Y.; Guo, J.; Cheng, X.; and yan Liu, T. 2012. Statistical consistency of ranking methods in a rank-differentiable probability space. Annual Conference on Neural Information Processing Systems (NIPS) 1241–1249.
• [McFee and Lanckriet 2011] McFee, B., and Lanckriet, G. 2011. Learning multi-modal similarity. Journal of Machine Learning Research (JMLR) 12:491–523.
• [Reddi et al. 2016] Reddi, S. J.; Hefny, A.; Sra, S.; Poczos, B.; and Smola, A. 2016. Stochastic variance reduction for nonconvex optimization. International Conference on Machine Learning 314–323.
• [Shaw and Jebara 2009] Shaw, B., and Jebara, T. 2009. Structure preserving embedding. International Conference on Machine Learning 937–944.
• [Shepard 1962a] Shepard, R. N. 1962a. The analysis of proximities: Multidimensional scaling with an unknown distance function. i. Psychometrika 27(2):125–140.
• [Shepard 1962b] Shepard, R. N. 1962b. The analysis of proximities: Multidimensional scaling with an unknown distance function. ii. Psychometrika 27(3):219–246.
• [Song et al. 2015] Song, D.; Liu, W.; Ji, R.; Meyer, D. A.; and Smith, J. R. 2015. Top rank supervised binary coding for visual search. IEEE International Conference on Computer Vision 1922–1930.
• [Tamuz et al. 2011] Tamuz, O.; Liu, C.; Shamir, O.; Kalai, A.; and Belongie, S. 2011. Adaptively learning the crowd kernel. International Conference on Machine Learning 673–680.
• [Tan et al. 2016] Tan, C.; Ma, S.; Dai, Y.-H.; and Qian, Y. 2016. Barzilai-borwein step size for stochastic gradient descent. Annual Conference on Neural Information Processing Systems 685–693.
• [Terada and Luxburg 2014] Terada, Y., and Luxburg, U. 2014. Local ordinal embedding. International Conference on Machine Learning 847–855.
• [van der Maaten and Weinberger 2012] van der Maaten, L., and Weinberger, K. 2012. Stochastic triplet embedding. IEEE International Workshop on Machine Learning for Signal Processing 1–6.
• [Wah et al. 2014] Wah, C.; Horn, G. V.; Branson, S.; Maji, S.; Perona, P.; and Belongie, S. 2014. Similarity comparisons for interactive fine-grained categorization. IEEE Conference on Computer Vision and Pattern Recognition 859–866.
• [Wilber et al. 2015] Wilber, M.; Kwak, I.; Kriegman, D.; and Belongie, S. 2015. Learning concept embeddings with combined human-machine expertise. IEEE International Conference on Computer Vision 981–989.
• [Wilber, Kwak, and Belongie 2014] Wilber, M.; Kwak, S.; and Belongie, S. 2014. Cost-effective hits for relative similarity comparisons. AAAI Conference on Human Computation and Crowdsourcing 227–233.

## Proof of Lemma 1

###### Proof.

Let

 Xp=(X1X2)=⎛⎜ ⎜ ⎜⎝xixjxlxk⎞⎟ ⎟ ⎟⎠ (13)

where , and

 M=(M1M2)=⎛⎜ ⎜ ⎜⎝I−I−II−III−I⎞⎟ ⎟ ⎟⎠ (14)

where is the identity matrix.

The first-order gradient of CKL is

 ∇fcklp(X) = 2MXpd2ij(X)+d2lk(X)+2δ (15) − 2( OM2X2)d2lk(X)+δ

and the second-order Hessian matrix of CKL is

 ∇2fcklp(X) = 2Md2ij(X)+d2lk(X)+2δ (16) − 4MXpXTpM[d2ij(X)+d2lk(X)+2δ]2 − 2( OM2)d2lk(X)+δ + 4( OM2X2XT2M2)[d2lk(X)+δ]2.

As is bounded, , and is bounded. So any element of is bounded , and has bounded eigenvalues. By the definition of Lipschitz continuity, we have the loss function of CKL (4) has Lipschitz continuous gradient with bounded .

The first-order gradient of STE (5) is

 ∇fstep(X)=2exp(d2ij(X)−d2lk(X))1+exp(d2ij(X)−d2lk(X))MXp (19)

and the second-order Hessian matrix of STE (5) is

 ∇2fstep(X) (20) = 4exp(d2ij(X)−d2lk(X))[1+exp(d2ij(X)−d2lk(X))]2MXpXTpM + 2exp(d2ij(X)−d2lk(X))1+exp(d2ij(X)−d2lk(X))M.

All elements of are bounded. So the eigenvalues of are bounded. By the definition of Lipschitz continuity, (5) has Lipschitz continuous gradient with bounded .

The first-order gradient of GNMDS (3) is

 ∇fgnmdsp(X)={0, if d2ij(X)+1−d2lk(X)<0,2MXp, otherwise, (21)

and the second-order Hessian matrix of GNMDS (3) is

 ∇2fgnmdsp(X)={0,if d2ij(X)+1−d2lk(X)<0,2M, otherwise. (22)

If for all , is continuous on and the Hessian matrix has bounded eigenvalues. So GNMDS has Lipschitz continuous gradient in some quadruple set as . As the special case of which satisfied is exceedingly rare, we split the embedding into pieces and employ SGD and SVRG to optimize the objective function of GNMDS (3). The empirical results are showed in the Experiment section.

Note , and the first-order gradient of TSTE is (17). The second-order Hessian matrix of TSTE is (18). The boundedness of eigenvalues of The loss function of can infer that the TSTE loss function (6) has Lipschitz continuous gradient with bounded .

We focus on a special case of quadruple comparisons as and in the Experiment section. To verify the Lipschitz continuous gradient of ordinal embedding objective functions with as , we introduction the matrix as

 (23)

By chain rule for computing the derivative, we have

 ∇fijk(X) = A∇fijlk(X), (24) ∇2fijk(X) = A∇2fijlk(X)AT.

where . As is a constant matrix and is bounded, all elements of the Hessian matrix are bounded. So the eigenvalues of is also bounded. The ordianl embedding functions of CKL, STE and TSTE with triplewise compsrisons have Lipschitz continuous gradient with bounded . ∎

## Proof of Theorem 1

In this section, we prove our main theorem, i.e., Theorem 1. Before doing this, we need some lemmas.

###### Lemma 2.

Given some positive integer , then for any , the following holds

 (1+x)l≤elx≤1+2lx.
###### Proof.

Note that

 (1+x)l=el⋅ln(1+x)≤elx,

where the last inequality holds for for any . Thus, we get the first inequality. Let for any . It is easy to check that for any . Thus we get the second inequality. ∎

In the following, we provide a key lemma that shows the convergence behavior of the inner loop of the proposed algorithm. Before presenting it, we define several constants and sequences. For any , let

 βs:=4(m−1)L3η2s, (25) ρs:=1+2η2sL2[2(m−1)ηsL+1]. (26)

Let be a nonnegative sequence satisfying and for ,

 cs+1t=cs+1t+1(1+ηsβs+2η2sL2)+η2sL3.

It is obvious that is monotonically decreasing as increasing, and

 cs+11=ηsL3⋅(ρm−1s−1)βs+2ηsL2.

Then we define

 Γs+1t:=ηs[1−cs+1t+1βs−ηs(1+2cs+1t+1)], (27)

for any and .

With the help of the sequences defined above, we present a key lemma as follows.

###### Lemma 3 (Convergence behavior of inner loop).

Let be a sequence generated by Algorithm 1 at the -th inner loop, . If the following conditions hold

 ηs+4(m−1)η3sL3<1/2, (28) (m−1)η2sL2[2(m−1)ηsL+1]<1/2, (29)

then

 E[∥∇F(Xs+1t)∥2]≤Rs+1t−Rs+1t+1Γs+1t,

where ,

###### Proof.

This lemma is a special case of (?, Lemma 1) with some specific choices of . Thus, we only need to check the defined is positive for any . In order to do this, we firs check that

 cs+1t<βs2,t=1,…,m (30)

under the condition (29) of this lemma.

Since is monotonically decreasing, thus (30) is equivalent to showing , which implies

 ηsL3(ρm−1s−1)<12βs(βs+2ηsL2). (31)

By Lemma 2, if (29) holds, then

 ρm−1s<1+4(m−1)η2sL2[2(m−1)ηsL+1].

It implies

 ηsL3(ρm−1s−1) <4(m−1)η3sL5[2(m−1)ηsL+1] =2(m−1)η2sL3[4(m−1)η2sL3+2ηsL2] =12βs(βs+2ηsL2).

Thus, (30) holds.

In the next, we prove , . By the definition of (27), it holds

 Γs+1t ≥ηs[1−cs+11βs−ηs(1+2cs+11)] >ηs[12−ηs(1+βs)],

where the last inequality holds for (30). Thus, if (28) holds, then obviously, ,