Finite Sample Prediction and Recovery Bounds for Ordinal Embedding

Finite Sample Prediction and Recovery Bounds for Ordinal Embedding

Lalit Jain, University of Wisconsin-Madison
Kevin Jamieson, University of California, Berkeley
Robert Nowak, University of Wisconsin-Madison
Abstract

The goal of ordinal embedding is to represent items as points in a low-dimensional Euclidean space given a set of constraints in the form of distance comparisons like “item is closer to item than item ”. Ordinal constraints like this often come from human judgments. To account for errors and variation in judgments, we consider the noisy situation in which the given constraints are independently corrupted by reversing the correct constraint with some probability. This paper makes several new contributions to this problem. First, we derive prediction error bounds for ordinal embedding with noise by exploiting the fact that the rank of a distance matrix of points in is at most . These bounds characterize how well a learned embedding predicts new comparative judgments. Second, we investigate the special case of a known noise model and study the Maximum Likelihood estimator. Third, knowledge of the noise model enables us to relate prediction errors to embedding accuracy. This relationship is highly non-trivial since we show that the linear map corresponding to distance comparisons is non-invertible, but there exists a nonlinear map that is invertible. Fourth, two new algorithms for ordinal embedding are proposed and evaluated in experiments.

1 Ordinal Embedding

Ordinal embedding, also known as non-metric multidimensional scaling, aims to represent items as points in so that the distances between items agree as well as possible with a given set of ordinal comparisons such as item is closer to item than to item . This is a classic problem that is often used to visualize perceptual similarities [1, 2]. Recently, several authors have proposed a variety of new algorithms for learning a metric embedding from ordinal information [3, 4, 5, 6, 7]. There has also been some theoretical progress towards characterizing the consistency of ordinal embedding methods. For example, it has been shown that the correct embedding can be learned in the limit as the number of items grows [8, 9, 10]. However, a major shortcoming of all prior work is the lack of generalization and embedding error bounds for problems involving a finite number of items and observations. This paper addresses this problem, developing error bounds for ordinal embedding algorithms with known observation models. The bounds explicitly show the dependence on the number of ordinal comparisons relative to the number of items and the dimension of the embedding. In addition, we propose two new algorithms for recoverying embeddings: the first is based on unbiasing a nuclear norm constrained optimization and the second is based on projected gradient descent onto the space of rank- matrices, commonly referred to as hard-thresholding. Both methods match state-of-the-art performance while being simpler to implement and faster to converge.

1.1 Ordinal Embedding from Noisy Data

Consider points . Let . The Euclidean distance matrix is defined to have elements . Ordinal embedding is the problem of recovering given ordinal constraints on distances. This paper focuses on “triplet” constraints of the form , where . Furthermore, we only observe noisy indications of these constraints, as follows. Each triplet has an associated probability satisfying

Let denote a collection of triplets drawn independently and uniformly at random. And for each we observe an independent random variable with probability , and otherwise. The goal is to recover the embedding from these data. Exact recovery of from such data requires a known link between and . To this end, our main focus is the following problem.

Ordinal Embedding from Noisy Data Consider points in -dimensional Euclidean space. Let denote a collection of triplets and for each observe an independent random variable where the link function is known. Estimate from , , and .

For example, if is the logistic function, then for triplet

(1)

then . However, we stress that we only require the existence of a link function for exact recovery of . Indeed, if one just wishes to predict the answers to unobserved triplets, then the results of Section 2 hold for arbitrary probabilities. We note that a problem known as one-bit matrix completion has similarly used link functions to recover structure [11]. However, while that work made direct measurements of the entries of the matrix, in this work we observe linear projections and as we will see later, the collection of these linear operators has a a non-empty kernel creating non-trivial challenges.

1.2 Organization of Paper

This paper takes the following approach to ordinal embedding. First we derive prediction error bounds for ordinal embedding with noise in Section 2. These bounds exploit the fact that the rank of a distance matrix of points in is at most . Then we consider the special case of a known observation model and the Maximum Likelihood estimator in Section 3. The link function enables us to relate prediction errors to error bounds on estimates of differences . In Section 4, we study the non-trivial problem of recovering , and thus , from the . Lastly, in Section 5, two new algorithms for ordinal embedding are proposed and experimentally evaluated.

1.3 Notation and Assumptions

We will use to denote the distance and Gram matrices of the latent embedding, and to denote an arbitrary distance matrix and its corresponding Gram matrix. The observations carry information about , but distance matrices are invariant to rotation and translation, and therefore it may only be possible to recover up to a rigid transformation. Therefore, we make the following assumption throughout the paper.

Assumption 1.

To eliminate the translational ambiguity, assume the points are centered at the origin (i.e., ).

Define the centering matrix . Note that under the above assumption, . Note that is determined by the Gram matrix . In addition, can be determined from up to a unitary transformation. Furthermore, we will assume that the Gram matrix is “centered” so that . Centering is equivalent to assuming the underlying points are centered at the origin (e.g., note that ). It will be convenient in the paper to work with both the distance and Gram matrix representations, and the following identities will be useful to keep in mind. For any distance matrix and its centered Gram matrix

(2)
(3)

where is the column vector composed of the diagonal of . In particular this establishes a bijection between centered Gram matrices and distance matrices. We refer the reader to [12] for an insightful and thorough treatment of the properties of distance matrices. We also define the set of all unique triplets

Assumption 2.

The observed triplets in are drawn independently and unifomly from .

2 Prediction Error Bounds

For with we define to be the linear operator satisfying for all . In general, for any Gram matrix

We can naturally view as a linear operator on , the space of symmetric positive semidefinite matrices. We can also represent as a symmetric matrix that is zero everywhere except on the submatrix corresponding to which has the form

and then

where for any compatible matrices . Ordering the elements of lexicographically, we arrange all the together to define the -dimensional vector

(4)

Let denote a loss function. For example we can consider the loss , the hinge-loss , or the logistic loss

(5)

Let and take the expectation of the loss with respect to both the uniformly random selection of the triple and the observation , we have the risk of

Given a set of observations under the model defined in the problem statement, the empirical risk is,

(6)

which is an unbiased estimator of the true risk: . For any , let denote the nuclear norm and . Define the constraint set

(7)

We estimate by solving the optimization

(8)

Since is positive semidefinite, we expect the diagonal entries of to bound the off-diagonal entries. So an infinity norm constraint on the diagonal guarantees that the points corresponding to live inside a bounded ball. The constraint in (7) plays two roles: 1) if our loss function is Lipschitz, large magnitude values of can lead to large deviations of from ; bounding bounds . 2) Later we will define in terms of the link function and as the magnitude of increases the magnitude of the derivative of the link function typically becomes very small, making it difficult to “invert”; bounding tends to keep within an invertible regime of .

Theorem 1.

Fix and assume . Let be a solution to the program (4). If the loss function is -Lipschitz (or ) then with probability at least ,

Proof.

The proof follows from standard statistical learning theory techniques, see for instance [13]. By the bounded difference inequality, with probability

where using the facts that has non-zeros of magnitude and .

Using standard symmetrization and contraction lemmas, we can introduce Rademacher random variables for all so that

The right hand side is just the Rademacher complexity of By definition,

where is the convex hull of a set . Since the Rademacher complexity of a set is the same as the Rademacher complexity of it’s closed convex hull,

which we recognize is just . By [14, 6.6.1] we can bound the operator norm in terms of the variance of and the maximal eigenvalue of These are computed in Lemma 1 given in the supplemental materials. Combining these results gives,

We remark that if is a rank matrix then

so if is low rank, we really only need a bound on the infinity norm of our constraint set. Under the assumption that is rank with and we set , then Theorem 1 implies that for

with probability at least . The above display says that must scale like which is consistent with known finite sample bounds [5].

3 Maximum Likelihood Estimation

We now turn our attention to recovering metric information about . Let be a collection of triplets sampled uniformly at random with replacement and let be a known probability function governing the observations. Any link function induces a natural loss function , namely, the negative log-likelihood of a solution given an observation defined as

For example, the logistic link function of (1) induces the logistic loss of (5). Recalling that we have

where and are the entropy and KL divergence of Bernoulli RVs with means . Recall that controls the magnitude of so for the moment, assume this is small. Then by a Taylor series using the fact that , and by another Taylor series we have

Thus, recalling the definition of from (4) we conclude that if with then one would expect . Moreover, since is an unbiased estimator of , one expects to approximate . The next theorem, combined with Theorem 1, formalizes this observation; its proof is found in the appendix.

Theorem 2.

Let where denotes the derivative of . Then for any

Note that if is the logistic link function of (1) then its straightforward to show that for any , so it suffices to take .

4 Maximum Likelihood Embedding

In this section, let be the maximum likelihood estimator; i.e., a solution to the optimization with -Lipschitz log-likelihood loss function for a fixed . We have shown that the maximum likelihood estimator allows us to bound . In this section, we discuss how this bound can lead us to recover an embedding. For the analysis in this section, it will be convenient to work with distance matrices in addition to Gram matrices. Analogous to the operators defined above, we define the operators for satisfying,

We will view the as linear operators on the space of symmetric hollow matrices , which includes distance matrices as special cases. As with , we can arrange all the together, ordering the lexicographically, to define the -dimensional vector

We will use the fact that heavily. Because consists of differences of matrix entries, has a non-trivial kernel. However, it is easy to see that can be recovered given and any one off-diagonal element of , so the kernel is -dimensional. Also, the kernel is easy to identify by example. Consider the regular simplex in dimensions. The distances between all vertices are equal and the distance matrix can easily be seen to be Thus in this case. This gives us the following simple result.

Lemma 2.

Let denote the space of symmetric hollow matrices, which includes all distance matrices. For any , the set of linear functionals , spans an dimensional subspace of , and the -dimensional kernel is given by the span of .

So we see that the operator is not invertible on . Define . For any , let , the centered distance matrix, be the component of orthogonal to the kernel of (i.e., ). Then we have the orthogonal decomposition

where . Since is assumed to be centered, the value of has a simple interpretation:

(9)

the average of the squared distances or alternatively a scaled version of the nuclear norm of . Let and be the corresponding distance and centered distance matrices.

Now following the notation of sections 2 and 3, assume that there is a true Gram matrix and a link function as in section 3, and assume that we have observed a set of triples Though is not invertible on all , it is invertible on the subspace orthogonal to the kernel, namely So if is close to we expect to be close to . The next theorem quantifies this.

Theorem 3.

Consider the setting of Theorems 1 and 2 and let be defined as above. Then

Proof.

By combining Theorem 2 with the prediction error bounds obtainined in 1 we see that

Next we employ the following restricted isometry property of on the subspace whose proof is in the supplementary materials.

Lemma 3.

Let and be two different distance matrices of points in and . Let and be the components of and orthogonal to . Then

The result then follows. ∎

This implies that by collecting enough samples, we can recover the centered distance matrix. By applying the discussion following Theorem 1 when is rank , we can state an upperbound of . However, it is still not clear that this is enough to recover or .

Remarkably, despite this unknown component being in the kernel, we show next that it can be recovered.

Though is not invertible on in general, we can provide a heuristic argument for why we might still expect it to be invertible on the (non-linear) space of low rank distance matrices. Since a distance matrix of points in is at most rank , the space of all distance matrices has at most degrees of freedom. Now is a rank operator and since for , it is not unreasonable to hope that the entries of provide enough information to parametrize the set of rank Euclidean distance matrices. In fact as the next theorem will show, the intuition that is invertible on the set of low rank distance matrices is in fact true. Perhaps even more surprisingly, merely knowing the centered distance matrix uniquely determines

Theorem 4.

Let be a distance matrix of points in , let be the component of orthogonal to the kernel of , and let denote the second largest eigenvalue of . If , then

This shows that is uniquely determined as a function of . Therefore, since and because is orthogonal to the kernel of , the distance matrix can be recovered from , even though the linear operator is non-invertible.

We now provide a proof of Theorem 4. In preparing this paper for publication we became aware of an alterntive proof of this theorem [15]. We nevertheless present our independently derived proof next for completeness.

Proof.

To prove Theorem 4 we first state two simple lemmas, which are proved in the supplementary material.

Lemma 4.

Let be a Euclidean distance matrix on points. Then is negative semidefinite on the subspace

Furthermore, .

Lemma 5.

If is an distance matrix of rank , then has a single positive eigenvalue, eigenvalues equal to , and negative eigenvalues.

We will use the following notation. For any matrix , let denote its th largest eigenvalue. We will show that for and any distance matrix with ,

Since , this proves the theorem.

Note that , for and arbitrary. So it suffices to show that .

Let the rank of be where and consider an eigendecomposition where is unitary. If is an eigenvector of with eigenvalue , then

So is an eigenvalue of with the same eigenvalue Denoting , we see that for all .

By a result on interlaced eigenvalues [16, Cor 4.3.9],

From the above inequality and using the fact that (since ), it is clear that has at least one positive eigenvalue. It is also clear that since , each eigenvalue of is bounded above by an eigenvalue of , so has at least negative eigenvalues. Hence

If , then by lemma 4. Thus and so This implies that and However from the above, Hence we can conclude that

The previous theorem along with Theorem 3 guarantees that we can recover as we increase the number of triplets sampled. We summarize this in our final theorem, which follows directly from Theorems 3 and 4.

Theorem 5.

Assume and consider the setting of Theorems 1 and 2. As , the maximum likelihood estimator

converges to .

5 Experimental Study

The section empirically studies the properties of our estimators suggested by our theory. It is not an attempt to perform an exhaustive empirical evaluation of different embedding techniques; for that see [17, 4, 6, 3]. In what follows each of the points is generated randomly: , , motivated by the observation that

for any triplet . We perform trials to see the effect of different random samples of triplets.

We report the prediction error on a holdout set of triplets and the error in Frobenius norm of the estimated Gram matrix.

Figure 1: generated with points in and dimensions on the left and right.

Three algorithms are considered. For each, the domain of the objective variable is the space of symmetric positive semi-definite matrices. None of the methods impose the constraint (as done in our theoretical analysis), since this was used to simplify the analysis and does not have a significant impact in practice. Rank-d Projected Gradient Descent (PGD) performs gradient descent on the objective with line search, projecting onto the subspace spanned by the top eigenvalues at each step (i.e. setting the bottom eigenvalues to ). Nuclear Norm PGD performs gradient descent on projecting onto the nuclear norm ball with radius , where is the Gram matrix of latent embedding. The nuclear norm projection can have the undesirable effect of shrinking the non-zero eigenvalues toward the origin. To compensate for this potential bias, we also employ Nuclear Norm PGD Debiased, which takes the (biased) output of Nuclear Norm PGD, decomposes it into where are the top eigenvectors, and outputs where . This last algorithm is motivated by the fact observation that heuristics like minimizing or are good at identifying the true support or basis of a signal, but output biased magnitudes [18]. Rank-d PGD and Nuclear Norm PGD Debiased are novel ordinal embedding algorithms.

Figure 1 presents how the algorithms behave in and on the left and right, respectively. We observe that the unbiased nuclear norm solution behaves near-identically to the rank- solution in this case and remark that this was observed in all of our experiments (see the supplementary materials for a variety values of , , and scalings of ). A popular technique for recovering rank embeddings is to perform (stochastic) gradient descent on with objective variable taken as the embedding [17, 4, 6]. In all of our experiments this method produced Gram matrices that were nearly identical to those produced by our Rank--PGD method, but Rank--PGD was an order of magnitude faster in our implementation. Also, in light of our isometry theorem, we can show that the Hessian of is nearly a scaled identity, which leads us to hypothesize that a globally optimal linear convergence result for this non-convex optimization may be possible using the techniques of [19, 20]. Finally, we note that previous literature has reported that nuclear norm optimizations like Nuclear Norm PGD tend to produce less accurate embeddings than those of non-convex methods [4, 6]. We see from the results that the Nuclear Norm PGD Debiased appears to close the performance gap between the convex and non-convex solutions.

6 Future work

For any fixed set of points in and randomly selected distance comparison queries, our results show that the component of (the Gram matrix associated with the points) in the span of all possible distance comparisons can be accurately recovered from queries, and we conjecture these bounds to be tight. Moreover, we proved the existence of an estimator such that as the number of queries grows, we have . A focus of our ongoing work is characterizing finite sample bounds for the rate at which . One way of approaching such a result is showing

for the tightest possible constants and . By inspecting our proofs, we can provide satisfying values for and , however, they differ by a factor of and hence we do not believe these to be tight. Empirically we observe that if , then the ratio of to appears to be independent of .

References

  • [1] Roger N Shepard. The analysis of proximities: Multidimensional scaling with an unknown distance function. i. Psychometrika, 27(2):125–140, 1962.
  • [2] Joseph B Kruskal. Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29(2):115–129, 1964.
  • [3] Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David J Kriegman, and Serge Belongie. Generalized non-metric multidimensional scaling. In International Conference on Artificial Intelligence and Statistics, pages 11–18, 2007.
  • [4] Omer Tamuz, Ce Liu, Ohad Shamir, Adam Kalai, and Serge J Belongie. Adaptively learning the crowd kernel. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 673–680, 2011.
  • [5] Kevin G Jamieson and Robert D Nowak. Low-dimensional embedding using adaptively selected ordinal data. In Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on, pages 1077–1084. IEEE, 2011.
  • [6] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pages 1–6. IEEE, 2012.
  • [7] Brian McFee and Gert Lanckriet. Learning multi-modal similarity. The Journal of Machine Learning Research, 12:491–523, 2011.
  • [8] Matthäus Kleindessner and Ulrike von Luxburg. Uniqueness of ordinal embedding. In COLT, pages 40–67, 2014.
  • [9] Yoshikazu Terada and Ulrike V Luxburg. Local ordinal embedding. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 847–855, 2014.
  • [10] Ery Arias-Castro. Some theory for ordinal embedding. arXiv preprint arXiv:1501.02861, 2015.
  • [11] Mark A Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference, page iau006, 2014.
  • [12] Jon Dattorro. Convex Optimization & Euclidean Distance Geometry. Meboo Publishing USA, 2011.
  • [13] Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323–375, 2005.
  • [14] Joel A. Tropp. An introduction to matrix concentration inequalities, 2015.
  • [15] Pablo Tarazaga and Juan E. Gallardo. Euclidean distance matrices: new characterization and boundary properties. Linear and Multilinear Algebra, 57(7):651–658, 2009.
  • [16] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.
  • [17] Kevin G Jamieson, Lalit Jain, Chris Fernandez, Nicholas J Glattard, and Rob Nowak. Next: A system for real-world development, evaluation, and application of active learning. In Advances in Neural Information Processing Systems, pages 2638–2646, 2015.
  • [18] Nikhil Rao, Parikshit Shah, and Stephen Wright. Conditional gradient with enhancement and truncation for atomic norm regularization. In NIPS workshop on Greedy Algorithms, 2013.
  • [19] Samet Oymak, Benjamin Recht, and Mahdi Soltanolkotabi. Sharp time–data tradeoffs for linear inverse problems. arXiv preprint arXiv:1507.04793, 2015.
  • [20] Jie Shen and Ping Li. A tight bound of hard thresholding. arXiv preprint arXiv:1605.01656, 2016.

Appendix A Supplementary Materials for “Finite Sample Error Bounds for Ordinal Embedding”

a.1 Proof of Lemma 1

Lemma 1.

For all ,

in addition if

Proof.

Note that for all Thus by the Cayley-Hamilton theorem, is the largest eigenvalue of A computation shows that the submatrix of corresponding to is

and every other element of is zero. Summing over the then gives,

This matrix can be rewritten as The eigenvalues of are with multiplicity and with multiplicity Hence the largest eigenvalue of is

a.2 Proof of Theorem 2

Proof.

For let . Then and By taking a Taylor series around ,

Now applying this to and gives

where the last line comes from applying Taylor’s theorem to ,