Feature Learning by Multidimensional Scaling and its Applications in Object Recognition
Abstract
We present the MDS feature learning framework, in which multidimensional scaling (MDS) is applied on highlevel pairwise image distances to learn fixedlength vector representations of images. The aspects of the images that are captured by the learned features, which we call MDS features, completely depend on what kind of image distance measurement is employed. With properly selected semanticssensitive image distances, the MDS features provide rich semantic information about the images that is not captured by other feature extraction techniques. In our work, we introduce the iterated LevenbergMarquardt algorithm for solving MDS, and study the MDS feature learning with IMage Euclidean Distance (IMED) and Spatial Pyramid Matching (SPM) distance. We present experiments on both synthetic data and real images — the publicly accessible UIUC car image dataset. The MDS features based on SPM distance achieve exceptional performance for the car recognition task.
I Introduction
To represent an image by a fixedlength feature vector, there are generally two groups of approaches, often referred to as handdesigned features and feature learning, respectively. In this section, we briefly review several commonly used methods from each group, and relate the proposed MDS feature learning to these existing methods.
Ia HandDesigned Features
Most handdesigned features, or sometimes called handcrafted features, focus on capturing the color, texture and gradient information in the image. Generally, these features have a closed form to be computed, without looking at other images. Some popular yet simple handdesigned image features include colorhistogram, wavelet transform coefficients [13], scaleinvariant feature transform (SIFT) [26], colorSIFT [44], speeded up robust features (SURF) [2], histogram of oriented gradients (HOG) [11] and local binary patterns (LBP) [29]. To represent an image with one fixedlength feature vector, there are generally three ways: (1) First, these features can be computed for the entire image, but the resulting feature vector will fail to embed the spatial relationship between different objects or different locations in the image. (2) Second, the image can be first uniformly divided into blocks. Then these features can be computed for each block, and can be concatenated to make a long feature vector. (3) Further, the division of the image does not have to be uniform, but can be arbitrary. We can just put random rectangular or circular masks onto the image, and compute features for each mask (or “patch”), then concatenate. To do this, the division must be consistent for all images.
The divideandconcatenate methods will result in very large feature vectors. Given a large dataset, PCA can be used to reduce the dimensionality.
IB Feature Learning
Feature learning has often been used as a synonym of deep learning, especially in recent years, and often refers to recent techniques such as sparse coding [30, 24], autoencoder [46], convolutional neural networks [23], restricted Boltzmann machines [19], and deep Boltzmann machines [38]. However, we believe this interpretation of feature learning is literally imprecise. Feature learning should be more generally defined as the opposite to handdesigned features — it should refer to any technique that learns a fixedlength vector representation of each image in the dataset by utilizing the pattern distribution of the entire dataset, or optimizing a target function that is defined on the entire dataset. Any technique that can generate a feature representation of each image without looking at the entire dataset should fail to fall into this category.
We further categorize existing feature learning methods into two subgroups: feature learning with raw intensities, and feature learning with handdesigned features. The proposed MDS feature learning falls into a third new subgroup: feature learning with image distance measurement.
IB1 Feature Learning with Raw Intensities
This subgroup of methods treat the feature learning problem as a dimensionality reduction problem, where the original highdimensional data are the image intensities, either graylevel or RGB values. Efforts on data dimensionality reduction have a long history [45], dating from the early work on PCA [31] and its nonlinear form, kernel PCA [28], to the recent work on sparse coding and deep learning [30, 19, 24, 23, 46, 38]. In all these methods, high dimensional data, such as an image, is represented by a low dimensional vector. Each entry of this vector describes one salient varying pattern of all images within the training set.
Assume we have a dataset , where each () is one data point. We briefly review several dimensionality reduction methods below.

PCA linearly projects vector to , where is obtained by performing eigenvector decomposition on the covariance matrix .

Kernel PCA first constructs a kernel matrix , where each entry of this matrix is obtained by evaluating the kernel function on two data points:
(1) Then the Gram matrix is constructed as
(2) where is the matrix with all elements equal to . Next the eigenvector decomposition problem is solved ( is eigenvector and is eigenvalue) and the projected vector is computed by
(3) 
Autoencoders first normalize all ’s to , and map them to , where is a sigmoid function. A reconstruction is computed by . The weight matrices and , and the bias vectors and are obtained by minimizing the average reconstruction error, which can be defined as either traditional square error or crossentropy.
In PCA and kernel PCA, different entries of correspond to eigenvectors of different importance, while in autoencoder, they are equivalently important.
These techniques have been shown effective on problems such as face recognition [43, 3] and even concept recognition [22]. However, most of these methods require all input data to have exactly the same size. If the input is an image, then the image has to be cropped and resized to be consistent with other images in the dataset. However, cropping the image means loss of information, and resizing the image means change of aspect ratio, which will result in distorted object shapes.
IB2 Feature Learning with HandDesigned Features
One popular method that falls into this subgroup is the bagofvisualwords (BOV) method [40, 10, 17]. This method first divides the image into local patches or segments the image into distinct regions, and then extracts handdesigned features for each patch/region. Rather than being concatenated, these feature vectors make an unordered set, or also referred to as “bag”. By performing clustering on the union of all those unordered sets for all training images, a visual vocabulary is established. Now the set of feature vectors previously extracted from each image can be transformed into a “wordfrequency” histogram by simply counting which cluster (visual word) is assigned to each patch/region. The “wordfrequency” histogram can be optionally normalized to generate the final fixedlength vector.
One extension of BOV is the Fisher Vector (FV) method [32, 33]. Rather than simply counting the word frequency, which can be viewed as the 0order statistics, FV encodes higher order statistics (up to the second order) about the distribution of local descriptors assigned to each visual word. Another extension is the Spatial Pyramid Matching [21], which gives different weights to features in different image division levels, and defines an image similarity measurement using the pyramid matching kernel [18].
Ii Method
In this section, we first review the basics of MDS and its existing solutions, and then introduce our own solution — the iterated LevenbergMarquardt algorithm (ILMA). Next, we discuss and compare some popular image distance measurement techniques in recent literature.
Iia Multidimensional Scaling: Problem Definition
As a statistical technique for the analysis of data similarity or dissimilarity, multidimensional scaling (MDS) has been well applied to areas such as information visualization [5] and surface flattening [39, 15]. Here we briefly review the basic concepts and definitions of MDS. For convenience, we will use the word “image” instead of “data” or “object” in the context, but we keep in mind that MDS is a technique for general purposes.
Suppose we have a set of images , and there is a distance measurement defined between each pair of images and . Note that is only a measurement of image dissimilarity, not necessarily a metric on set in the strict sense, since the subadditivity triangle inequality does not necessarily hold. Multidimensional scaling is the problem of representing each image by a point (vector) in a low dimensional space , such that the interpoint Euclidean distance in some sense approximates the distance between the corresponding images [20]. In Section IIC we will discuss how to define the image distance/dissimilarity measurement. Here we focus on the mathematical definitions related to MDS.
For a pair of images and , let their low dimensional (d) representations be and . The representation error is defined as , where denotes the norm. The raw stress is defined as the sumofsquares of the representation errors:
(4) 
while the normalized stress (also known as Stress1) is defined as
(5) 
MDS models require the interpoint Euclidean distances to be “as equal as possible” to the image distances. Thus we can either minimize the raw stress or normalized stress. We compactly represent the image distances by an symmetric matrix with all diagonal values equal to , and represent the low dimensional vectors by an matrix . Using the raw stress as the loss function, the MDS problem can be stated as:
(6) 
IiB Solutions for Multidimensional Scaling
There are lots of existing methods for solving Eq. (6), such as Kruskal’s iterative steepest descent approach [20] and de Leeuw’s iterative majorization algorithm (SMACOF) [14]. In 2002, Williams demonstrated the connection between kernel PCA and metric MDS [50], thus metric MDS problems can also be solved by solving kernel PCA.
In our work, we introduce an iterative least squares solution to the MDS optimization problem. We note that in Eq. (6), the raw stress is minimized with respect to , which has entries in total. Thus, when is large, this nonlinear optimization problem becomes computationally intractable if we attempt to solve for all entries in one step. Inspired by the iterated conditional modes (ICM) method [4], which was developed to solve Markov random fields (MRF), we introduce the twostage iterated LevenbergMarquardt algorithm (ILMA). The basic idea of this algorithm is to repeatedly minimize the raw stress with respect to one while holding all other ’s fixed. For this purpose, we maintain a constraining set of the indices of the ’s to be fixed. In the initialization stage, indices of all images are selected into the constraining set in a random order. In the adjustment stage, we repeatedly adjust all ’s in a randomly permuted order. By doing so, each time we only need to minimize the raw stress with respect to variables, instead of , which greatly reduces the complexity of the problem. The subproblem can be viewed as a least squares problem, and can be solved by the standard LevenbergMarquardt algorithm [25, 27]. Since the total raw stress is monotonically nonincreasing through time, the convergence of the adjustment is guaranteed. The details of the twostage algorithm are given in Algorithm 1. We will call the low dimensional vectors as MDS features or MDS codes in the context.
One advantage of our method is that we provide a unified framework for both MDS model training and new data encoding. In MDS model training, pairwise image distances are measured within the training set , and Algorithm 1 is applied to encode each training image to its MDS code . Now given a new image , we measure the distance from this image to all training images , and find its MDS code by:
(7) 
which can be directly solved as a least squares problem using the standard LevenbergMarquardt algorithm. We follow this practice for the training and testing of MDS models in the experiment in Section IIIB.
IiC Image Distance Measurement
The measurement of the similarity or dissimilarity between two images is of essential significance in contentbased image retrieval [12, 41]. There are some very simple forms of image distances, such as the traditional Euclidean distance on raw image intensities, and the earth mover’s distance (EMD) on image color histograms [37]. Here, we briefly describe two popular image distance measurement methods: the IMage Euclidean Distance (IMED) [48] and the Spatial Pyramid Matching (SPM) distance [21]. These distances will be evaluated in our experiment on real images in Section IIIB.
IiC1 Imed
The IMED is a generalized form of the traditional Euclidean distance on raw image intensities. Give two graylevel images and of the same size, the traditional Euclidean distance is defined as the square root of the sumofsquares of intensity difference at each corresponding image location:
(8) 
where denotes the intensity at row and column in image . In contrast, IMED also counts for the intensity difference at different locations, but assigns a weight to it, which is a function of the Euclidean distance of the two locations:
(9) 
where
(10) 
and is a continuous monotonically decreasing function, usually the Gaussian function. An interesting observation by Wang et al. [48] is that the IMED (IIC1) on two images is equivalent to the traditional Euclidean distance (8) on a blurred version of the two images. The blur operation is called standardizing transform (ST) by the authors.
Although IMED has shown promising performance on some recognition experiments in [48], we can see that it is still a lowlevel image distance measurement, based on the raw intensities, without embedding any semantic information. Another disadvantage of IMED is that it is only defined on images of the same size. We will apply MDS on IMED distances for the experiment in Section IIIB, where we use Gaussian function for in Eq. (10) and set , and we call this method IMEDMDS.
IiC2 SPM Distance
The spatial pyramid matching (SPM) [21] is based on Grauman and Darrell’s work on pyramid matching kernel [18], which measures the similarity of two sets of feature vectors by partitioning the feature space on different levels and taking the sum of weighted histogram intersection functions. Lazebnik et al.’s spatial pyramid matching is an “orthogonal” approach — it performs pyramid matching in the 2d image space, and uses means for clustering in the feature space (edge points and SIFT features). With a visual vocabulary of size (number of clusters), and partition levels, spatial pyramid vectors of dimensionality are generated, and spatial pyramid matching similarities between images and are measured. Authors of [21] recommend parameter setting of and .
The similarity value lies in , where 1 is for most similar, and 0 for least similar. We have many ways to define image distances using the similarities, such as:
(11)  
(12) 
where is a small value. We set in (12) for our experiment in Section IIIB.
Unlike IMED, SPM distance is based on handdesigned features such as SIFT and edge points, instead of raw intensities. It models the spatial cooccurrence of different feature clusters, and thus is more semanticssensitive. Besides, SPM distance does not require the size of images to be the same. We will apply MDS on the two SPM distances defined by Eq. (11) and Eq. (12), and we call them SPM1MDS and SPM2MDS, respectively.
Iii Experiments
We present two experiments. The first one is on synthetic data, and is to evaluate the running time performance of different MDS algorithms, and to compare different initialization strategies of our iterated LevenbergMarquardt algorithm. The second one is a real image object recognition task, in which we compare MDS features with PCA features and kernel PCA features. In the second experiment, we use the UIUC car dataset^{1}^{1}1http://cogcomp.cs.illinois.edu/Data/Car/, and follow a fivefold cross validation to report the classification precision and recall under different feature dimensions.
Iiia Synthetic Data Experiment
In this experiment, we use MDS for curved surface flattening [39] on the manually created Swiss roll data, which was introduced in [36], and is known to be complicated due to the highly nonlinear and nonEuclidean structure [7]. The Swiss roll surface contains points in , as shown in Fig. 1. We measure the pairwise interpoint geodesic distances to construct a distance matrix, and reembed the Swiss roll surface into by applying MDS on the geodesic distance matrix.
IiiA1 Running Time
First, we would like to evaluate the running time performance of the proposed iterated LevenbergMarquardt algorithm and compare with Bronstein’s implementation of the SMACOF algorithm and its variants, including SMACOF with reduced rank extrapolation (RRE) and SMACOF with multigrid [7, 35, 34, 6]. The results are given in Fig. 2, where each number in this plot is averaged on 20 independent repeated experiments, and the running time is reported on a Mac Pro with 2 2.4 GHz QuadCore Intel Xeon CPU. From Fig. 2, we can see that our ILMA is an efficient solution, which runs faster and converges to a smaller raw stress value than other methods. The unrolled surfaces by ILMA in different iterations are shown in Fig. 4.
IiiA2 Initialization Strategies
Further, we study some modifications to Algorithm 1. The original algorithm uses a random order strategy in the initialization stage, but we can modify it to:

Largestdistancefirst strategy: For Algorithm 1, in line 2 we choose the largest nondiagonal entry in instead of a random one; in line 7, we find the and that maximize rather than a random .

Smallestdistancefirst strategy: For Algorithm 1, in line 2 we choose the smallest nondiagonal entry in ; in line 7, we find the and that minimize .
If we assume that the data to be encoded are comprised of clusters, then an intuitive interpretation of the largestdistancefirst strategy is that representatives of each cluster are first encoded, and they are expected to be scattered in the multidimensional space; similarly, the smallestdistancefirst strategy encodes all data in one cluster first, and then moves to the nearest cluster.
We have been using the three initialization strategies to solve the MDS problem on the Swiss roll geodesic distance matrix, and it turns out that the random order strategy converges faster than the other two, as shown in Fig. 3. Again, each number in this plot is averaged on 20 independent repeated experiments.
IiiB Car Recognition Experiment
Now we would like to compare the performance of MDS features to the most standard and popular dimensionality reduction algorithms — PCA [31] and kernel PCA [28] on raw pixel intensities. We use the UIUC car image dataset [1], which contains 550 car and 500 noncar graylevel images of size (Fig. 5). We can observe that all car images are sideview images, but can be either side, and can be partly occluded. We divide the total of 1050 images into five subsets, each containing 110 car images and 100 noncar images, and each time we use four subsets as training set and one as testing set. We use the following methods to generate fixedlength feature vectors for the images:

PCA We represent each graylevel image by a 4000d vector, and perform standard PCA on such vectors of the training set to get eigenvectors and low dimensional representations of the training images. Then we use the eigenvectors to get the low dimensional representations of the testing images.

kPCA Gaussian Similar to the above method, but we use Gaussian kernel PCA instead of standard PCA. We follow the automatic parameter selection strategy in [49] to determine the .

kPCA poly Similar to the above two methods, but we use thirdorder polynomial kernel PCA instead of standard PCA.

SPM1MDS Similar to the above method, but we use SPM1 distance (11), instead of IMED, where the SPM parameters are and .

SPM2MDS Similar to the above method, but we use SPM2 distance (12).

pyramid PCA Instead of computing MDS features from SPM distances, we can also directly perform PCA on the obtained dimensional spatial pyramid vectors without measuring similarities. In our experiment, we set and , and the spatial pyramid vectors are 4200d. Evaluating this method will allow us to observe whether the MDS on SPM distance measurement captures semantics beyond the spatial pyramids.
After we have obtained the fixedlength features of all images, we use the features of training images to learn a binary RBF kernel SVM [9, 8], and use it to classify the features of testing images. Each dimension of the feature vector is normalized to 0mean and unit standard deviation. In the radial basis function , we set as the feature vector length. The experiment is repeated for different feature vector lengths from 1 to 20. We show the precision, recall and accuracy in Fig. 6. We also provide the feature scatter plots of different methods for feature length in Fig. 7.
In Fig. 6, we can observe that IMEDMDS method performs slightly but not significantly better than directly applying PCA or kernel PCA on raw graylevel intensities, and the superiority of IMEDMDS is more obvious when feature dimension is low. Spatial pyramid based methods do perform much better than other methods. Especially, SPM1MDS and SPM2MDS methods outperform all other methods, including pyramid PCA, at all feature dimensions. While the precision and recall of PCA, kernel PCA and IMEDMDS methods saturate at and respectively, the precision and recall of SPM1MDS and SPM2MDS saturate at and respectively. At low feature dimensions (), the accuracy of PCA and kernel PCA are very low, but the SPM1MDS and SPM2MDS perform almost as equally well as at very high dimensions.
In Fig. 7, we can also see that SPM1MDS and SPM2MDS separate car and noncar images with very clear class boundary curves in 2d feature space.
Iv Conclusions and Future Work
In this paper, we have presented a feature learning framework by combining multidimensional scaling with image distance measurement, and compared it with a number of popular existing feature extraction techniques. To the best of our knowledge, we are the first to explore MDS on image distances such as IMage Euclidean Distance (IMED) and Spatial Pyramid Matching (SPM) distance.
We have introduced a unified framework for both MDS model training and new data encoding based on the standard LevenbergMarquardt algorithm. Our twostage iterated LevenbergMarquardt algorithm for MDS model training is an efficient solution, and has shown good running time performance compared with other offtheshelf implementations (Fig. 2).
In the car recognition experiment, we have demonstrated the power of MDS features. MDS features learned from SPM distances achieve the best classification performance on all feature dimensions. The good performance of MDS features attributes to the semanticssensitive image distance, since it captures very different information from the images than traditional feature extraction techniques. The MDS further embeds such information into a lowdimensional feature space, which also captures the inner structure of the entire dataset. The MDS embedding is a very necessary step, since in Fig. 6 we can see the performance of MDS features learned from SPM distances is significantly better than simply running PCA on spatial pyramid vectors.
Our ongoing work on this method explores these directions:

We study more image distance measurements, such as the Integrated Region Matching (IRM) distance, which was originally designed for semanticssensitive image retrieval systems [47]. Performance of MDS codes learned from such distances can be evaluated and compared with the SPMMDS method in this paper.

In Eq. (7), rather than using the entire training set, we can also use only a subset of the training images to encode new data. It would be interesting to see how the performance varies by applying different subset selection strategies and different sizes of the subset.

Currently the twostage iterated LevenbergMarquardt algorithm is implemented in MATLAB^{2}^{2}2Code is available at https://sites.google.com/site/mdsfeature/. We are also recoding it in C/C++ with the lmfit library^{3}^{3}3http://joachimwuttke.de/lmfit/, which will be more computationally efficient.
References
 [1] (2004Nov.) Learning to detect objects in images via a sparse, partbased representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (11), pp. 1475–1490. External Links: Document, ISSN 01628828 Cited by: §IIIB.
 [2] (2008) Speededup robust features (surf). Computer vision and image understanding 110 (3), pp. 346–359. Cited by: §IA.
 [3] (199707) Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (7), pp. 711–720. External Links: ISSN 01628828, Document Cited by: §IB1.
 [4] (1986) On the Statistical Analysis of Dirty Pictures. Journal of the Royal Statistical Society. Series B (Methodological) 48 (3), pp. 259–302. External Links: Document, ISSN 00359246 Cited by: §IIB.
 [5] (2005) Modern Multidimensional Scaling: Theory and Applications (Springer Series in Statistics). 2nd edition, Springer. External Links: ISBN 9780387251509 Cited by: §IIA.
 [6] (2008) Numerical geometry of nonrigid shapes. Springer. Cited by: §IIIA1.
 [7] (2006) Multigrid multidimensional scaling. Numerical linear algebra with applications 13 (23), pp. 149–171. Cited by: §IIIA1, §IIIA.
 [8] (2011) LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, pp. 27:1–27:27. Cited by: §IIIB.
 [9] (1995) Supportvector networks. Machine Learning 20, pp. 273–297 (English). External Links: ISSN 08856125, Document Cited by: §IIIB.
 [10] (2004) Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, Vol. 1, pp. 22. Cited by: §IB2.
 [11] (200506) Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 886–893. External Links: Document, ISSN 10636919 Cited by: §IA.
 [12] (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Computing Surveys (CSUR) 40 (2), pp. 5. Cited by: §IIC.
 [13] (1992) Ten lectures on wavelets. Vol. 61, SIAM. Cited by: §IA.
 [14] (1977) Applications of convex analysis to multidimensional scaling. In Recent Developments in Statistics, J.R. Barra, F. Brodeau, G. Romier and B. V. Cutsem (Eds.), pp. 133–146. Cited by: §IIB.
 [15] (2003) On bending invariant signatures for surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (10), pp. 1285–1295. Cited by: §IIA.
 [16] (2004) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In Computer Vision and Pattern Recognition Workshop, 2004, pp. 178–178. Cited by: item 2.
 [17] (2005) A bayesian hierarchical model for learning natural scene categories. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, Vol. 2, pp. 524–531. Cited by: §IB2.
 [18] (2005) The pyramid match kernel: discriminative classification with sets of image features. In Tenth IEEE International Conference on Computer Vision, 2005, Vol. 2, pp. 1458–1465. Cited by: §IB2, §IIC2.
 [19] (2006) Reducing the dimensionality of data with neural networks. Science 313 (5786), pp. 504–507. Cited by: §IB1, §IB.
 [20] (1964) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, pp. 1–27 (English). External Links: ISSN 00333123, Document Cited by: §IIA, §IIB.
 [21] (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, Vol. 2, pp. 2169–2178. Cited by: §IB2, §IIC2, §IIC.
 [22] (201206) Building highlevel features using large scale unsupervised learning. In International Conference on Machine Learning, Cited by: §IB1.
 [23] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 00189219 Cited by: §IB1, §IB.
 [24] (2007) Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems 19, pp. 801–808. Cited by: §IB1, §IB.
 [25] (1944) A method for the solution of certain nonlinear problems in least squares. Quarterly of Applied Mathematics 2, pp. 164–â168. Cited by: §IIB.
 [26] (2004) Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. Cited by: §IA.
 [27] (1963) An algorithm for leastsquares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics 11 (2), pp. 431–441. External Links: ISSN , Document Cited by: §IIB.
 [28] (1999) Kernel pca and denoising in feature spaces. In Advances in Neural Information Processing Systems 11, pp. 536–542. Cited by: §IB1, §IIIB.
 [29] (1996) A comparative study of texture measures with classification based on featured distributions. Pattern recognition 29 (1), pp. 51–59. Cited by: §IA.
 [30] (1996) Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature 381 (12), pp. 607–609. External Links: ISSN , Document Cited by: §IB1, §IB.
 [31] (1901) On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2, pp. 559–572. Cited by: §IB1, §IIIB.
 [32] (2007) Fisher kernels on visual vocabularies for image categorization. In IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. Cited by: §IB2.
 [33] (2010) Improving the fisher kernel for largescale image classification. Computer Vision–ECCV 2010, pp. 143–156. Cited by: §IB2.
 [34] (2008) Topologically constrained isometric embedding. In Human Motion, pp. 243–262. Cited by: §IIIA1.
 [35] (2008) Fast multidimensional scaling using vector extrapolation. SIAM J. Sci. Comput 2. Cited by: §IIIA1.
 [36] (2000dec.22) Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), pp. 2323–2326. External Links: Document, ISSN 10959203 Cited by: §IIIA.
 [37] (2000) The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40 (2), pp. 99–121. Cited by: §IIC.
 [38] (2009) Deep boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Vol. 5, pp. 448–455. Cited by: §IB1, §IB.
 [39] (1989) A numerical solution to the generalized mapmaker’s problem: flattening nonconvex polyhedral surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 11 (9), pp. 1005–1008. Cited by: §IIA, §IIIA.
 [40] (2003) Video google: a text retrieval approach to object matching in videos. In Ninth IEEE International Conference on Computer Vision, 2003, pp. 1470–1477. Cited by: §IB2.
 [41] (2000) Contentbased image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (12), pp. 1349–1380. Cited by: §IIC.
 [42] (200607) Asymmetric bagging and random subspace for support vector machinesbased relevance feedback in image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (7), pp. 1088–1099. External Links: Document, ISSN 01628828 Cited by: item 2.
 [43] (1991jan.) Eigenfaces for recognition. Journal of Cognitive Neuroscience 3 (1), pp. 71–86. External Links: ISSN 0898929X, Document Cited by: §IB1.
 [44] (2010) Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (9), pp. 1582–1596. External Links: Document, ISSN 01628828 Cited by: §IA.
 [45] (2009) Dimensionality Reduction: A Comparative Review. Technical report Tilburg University. Cited by: §IB1.
 [46] (2008) Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine learning, pp. 1096–1103. Cited by: §IB1, §IB.
 [47] (2001Sep.) SIMPLIcity: semanticssensitive integrated matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (9), pp. 947–963. External Links: Document, ISSN 01628828 Cited by: item 1, item 2.
 [48] (2005Aug.) On the euclidean distance of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8), pp. 1334–1339. External Links: Document, ISSN 01628828 Cited by: §IIC1, §IIC1, §IIC.
 [49] (2012) Kernel principal component analysis and its applications in face recognition and active shape models. arXiv preprint arXiv:1207.3538. Cited by: item 2.
 [50] (2002) On a connection between kernel pca and metric multidimensional scaling. Machine Learning 46 (1), pp. 11–19. Cited by: §IIB.
Quan Wang Quan Wang is currently working towards his Ph.D. degree in Computer and Systems Engineering in the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute. He received a B.Eng. degree in Automation from Tsinghua University, Beijing, China in 2010. He worked as research intern at Siemens Corporate Research, Princeton, NJ and IBM Almaden Research Center, San Jose, CA in 2012 and 2013, respectively. His research interests include feature learning, medical image analysis, object tracking, contentbased image retrieval and photographic composition. 
Kim L. Boyer Dr. Kim L. Boyer is currently Head of the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute. He received the BSEE (with distinction), MSEE, and Ph.D. degrees, all in electrical engineering, from Purdue University in 1976, 1977, and 1986, respectively. From 1977 through 1981 he was with Bell Laboratories, Holmdel, NJ; from 1981 through 1983 he was with Comsat Laboratories, Clarksburg, MD. From 1986–2007 he was with the Department of Electrical and Computer Engineering, The Ohio State University. He is a Fellow of the IEEE, a Fellow of IAPR, a former IEEE Computer Society Distinguished Speaker, and currently the IAPR President. Dr. Boyer is also a National Academies Jefferson Science Fellow at the US Department of State, spending 2006–2007 as Senior Science Advisor to the Bureau of Western Hemisphere Affairs. He retains his Fellowship as a consultant on science and technology policy for the State Department. 