Image Similarity Using Sparse Representation and Compression Distance
Abstract
A new line of research uses compression methods to measure the similarity between signals. Two signals are considered similar if one can be compressed significantly when the information of the other is known. The existing compressionbased similarity methods, although successful in the discrete one dimensional domain, do not work well in the context of images. This paper proposes a sparse representationbased approach to encode the information content of an image using information from the other image, and uses the compactness (sparsity) of the representation as a measure of its compressibility (how much can the image be compressed) with respect to the other image. The more sparse the representation of an image, the better it can be compressed and the more it is similar to the other image. The efficacy of the proposed measure is demonstrated through the high accuracies achieved in image clustering, retrieval and classification.
Image similarity, Compression, Kolmogorov complexity, Overcomplete dictionary, Sparse representation
I Introduction
Measuring the similarity between a pair of images is of critical importance to many multimedia information processing systems involving retrieval, enhancement, copy detection, quality assessment, clustering and classification. Given the long history of image similarity evaluation, the volume of literature on this topic is large and diverse. Widely used similarity measures such as the Euclidean distance, the Mean Squared Error and other normbased measures work well in specific cases, but they are often criticized for not corresponding well with our visual perception of similarity [1]. Another popular approach to describe the visual content of images is to extract a set of meaningful features. The similarity between two images is then computed in terms of the similarity between their features. However, the success of this approach is limited by the availability, selection and extraction of a good set of meaningful features, demanding specific knowledge of the application and the data.
Recently, there has been an interest in developing image similarity measures using compression methods [2, 3, 4, 5, 6]. In this approach, two signals are considered similar if one can be compressed significantly when the information of the other is provided. The advantages of these methods are that they are parameterfree (the only choice the user has to make is which compression algorithm to use) and generic (they assume no prior knowledge of the application, and can be applied, without modification, to a variety of problems).
The compressionbased similarity methods rely on a new mathematical theory of similarity which is in turn based on the idea of the Kolmogorov complexity [2, 3]. The Kolmogorov complexity (also known as the algorithmic entropy) is a theoretical measure of randomness of a given data, and in general, is a noncomputable quantity. In practice, it is often approximated by the length of the compressed data. Intuitively, the more a given data can be compressed, the lower is its complexity.
The compressionbased similarity measures have been shown to be highly effective in clustering and classifying discrete, unidimensional data such as text and protein sequences [2, 3]; but their successful application in the context of realvalued, higherdimensional data like images is scarce. For effectively measuring the similarity between two signals, the compressor being employed needs to satisfy certain properties so as to be a normal compressor [3]. However, most stateoftheart compressors for images (such as JPEG, JPEG2000) are not normal, and the normal compressors (compressors of the LempelZiv family) do not work well on images [5]. Existing methods [7, 8, 9] transform images into strings in order to take advantage of the normal data compressors, and thus lose the important spatial information. Another serious obstacle lies in evaluating and approximating the conditional compression (a quantity that measures how much can a given data be compressed w.r.t. another data) which is the key component in every compressionbased similarity measure.
In this paper, we propose a sparse representationbased approach to encode the information content of an image; and use the compactness (sparsity) of the representation of the image as a measure of its compressibility i.e. how much can the image be compressed. The more sparse the representation of an image, the better it can be compressed.
In order to design a similarity measure that correlates well with the human perception, we learn a set of basis elements (collectively called a dictionary) from the images. This approach empowers us to build a cortexlike representation of an image. In 1996, Olshausen and Field have shown that the basis elements that resemble the properties of the receptive field of simple cells in the primary visual cortex can be learnt from input images [10]. The keys to building such a cortexlike dictionary are: (i) a sparsity prior  an assumption that it is possible to describe the input image using a small number of basis elements, and (ii) overcompleteness  the number of basis elements in the dictionary is greater than the vector space spanned by the input vectors. Given a pair of images, our method learns a dictionary for each image and computes how sparsely can one image be approximated using the dictionary extracted from the other, with a required precision.
The rest of the paper is organized as follows. Section II briefly describes the related work on compressionbased distances, section III proposes the sparse representationbased distance measure, and section IV presents experimental results. Section V concludes the article with possible directions to future work.
Ii Previous Work
The work of Kolmogorov and others [11, 12, 13] on how to measure data complexity has been influential in many areas of knowledge, across multiple disciplines. The notion of complexity of a string is related to its randomness. For example, the binary string is considered more complex compared to the string , because the latter contains a regularity (repeating pattern) and therefore is less random. Kolmogorov complexity formalizes this concept.
Given a finite object, such as a binary string , its Kolmogorov complexity is defined as the length of the shortest program that can effectively produce on a universal computer, such as a Turing machine [14]. The Kolmogorov complexity, however, is noncomputable in general. In practice, it is often approximated by the length or the file size of the compressed data. Intuitively, the more a given data can be compressed, the lower is its complexity.
Iia Compressionbased distance measures
Recently, Kolmogorv’s theory of complexity has been used to address the problem of similarity measurement [2, 3]. Given two signals and , a distance metric, known as the Normalized Information Distance (NID) is developed using and the conditional Kolmogorov complexity . is defined as the length of the shortest program used by a universal computer to generate when is known.
Due to the noncomputable nature of the Kolmogorov complexity, a practical analog of the NID metric is proposed based on standard compression methods. This is called the Normalized Compression Distance (NCD). Intuitively, NCD considers and to be similar if one can be significantly compressed when the information of the other is provided. It is defined as follows:
(1) 
The conditional compression is approximated as follows:
(2) 
where denotes the compressed length of the concatenation of and .
The NCD metric has been shown to be effective in clustering mitochondrial genomes, languages and music [3]. Following the success of NCD, different versions of compressionbased distance measures have been proposed; for example, a Compressionbased Dissimilarity Measure (CDM) is proposed in the context of parameterfree data mining and is shown to be useful for anomaly detection, clustering and classification of text, DNA and timeseries data [15]. CDM is defined as
(3) 
Other applications of compressionbased distances include symbolic music clustering [16] and plagiarism detection [17]. The idea of compression, independent from NCD, has also been used to design a pattern representation scheme for automatic categorization of music, voice, genome, etc. [4]; but this method requires encoding media data input into text.
IiB Compressionbased distances for images
In the context of images, however, successful application of the compressionbased distance measures is scarce. We identify two major reasons behind that.

The success of the compressionbased distances heavily depends on the availability of a normal compressor. A compressor is normal only if it satisfies certain conditions such as idempotency, monotonicity, symmetry, etc. (please refer to [3] for details). The problem is that most stateoftheart image compressors (such as JPEG, JPEG2000) are not normal, and the normal compressors (such as the compressors of the LempelZiv family) do not work well on images [5].

Another serious obstacle lies in evaluating and approximating the conditional complexity terms such as in NCD. These terms are the key components in a compressionbased measure. The existing compressionbased methods (whether or not they involve images) either approximate the conditional compression with or use a simplified definition so as not to include any conditional term (as in (3)). Direct evaluation of is usually bypassed mainly to retain the simplicity of the compressionbased measures since evaluating accurately requires delving into the complicated standards and algorithms of data or image compression. This also makes the compressionbased methods difficult to improve upon.
Clearly, the straightforward extension of the methods that work perfectly well on discrete, onedimensional data has not been very promising in the context of images. In the pursuit of alternatives, a new image encoder is proposed based on the finite context model and preliminary results on a face database are provided [5]. Another recent approach, namely the CK1 method, uses the MPEG1 video compressor to measure image similarity [6]. This method takes advantage of the temporal redundancy reduction step in video compression which performs interframe block matching. In this approach, a twoframe video consisting of the images to be compared is created. One frame is compressed with reference to the other frame using a standard video compressor. The compressed file size of the video is used to approximate the closeness between the pair of images. This method has been shown to be useful in texture classification.
Iii The Proposed Approach
A natural way of measuring the similarity between two given images is to quantify how well either image can be represented using the information of the other. The more similar the images, the better is the representation of one image in terms of the other. Our method formalizes this intuitive idea of similarity using a sparse representationbased approach.
Iiia Sparsity as a measure of data complexity
It is wellknown that sparsity of representation plays a key role in achieving good compression. For example, the superiority of JPEG2000 is mainly attributed to the capability of the wavelet transform toward representing an image more sparsely than the DCT used in JPEG. Intuitively, the more sparse the representation of a signal is, the fewer are the components needed to capture the signal’s information content and the better it can be compressed.
Sparsity thus can be seen as a direct measure of the randomness or complexity of the data. A natural image usually exhibits many repeated structures which can be discovered through its decomposition over a set of properly chosen basis functions. Due to the presence of redundancy, only a few basis functions are required to capture the significant information content of such images, resulting in a sparse representation. In the case where such structures are rare (e.g. in random Gaussian noise), there is no way to represent the data using a small number of basis elements. This indicates that as the complexity of a signal increases, more and more components are needed to represent the signal with a desired accuracy i.e. its sparsity decreases in the transform domain. This inherent connection between sparsity and data complexity is exploited in our proposed distance measure.
IiiB Sparse Representationbased distance measure
The basic idea in sparse signal analysis is to represent a signal by a linear combination of a small number of basis functions. Consider a signal represented as a linear combination of basis functions or atoms,
(4) 
where the dictionary and its columns are the basis functions or atoms. If the values of the majority of components in are (or close to ), we say that has a sparse representation w.r.t. . For orthogonal bases like Fourier, is a square matrix i.e. . For those cases where the number of basis vectors is greater than the dimensionality of the input signal i.e. where , is said to be overcomplete. An overcomplete dictionary offers greater flexibility in representing the essential structures in a signal, which in turn leads to higher sparsity in the transform domain. Such representation also has advantages such as robustness to additive noise and occlusion [18].
IiiB1 learning the dictionries
Let us consider an image . A set of random, possibly overlapping patches (each of dimension ) is extracted from . Every patch is converted to a vector of length and the patches are the concatenated to form a matrix . In order to build a perceptually meaningful model for , we intend to learn an overcomplete dictionary that has atoms () using the local patches in as input. However, greater difficulties arise with a set of overcomplete bases. An overcomplete dictionary matrix creates an underdetermined system of linear equations having an infinite number of solutions. Knowing that the natural signals are sparsely representable, often in such cases, we seek the sparsest solution i.e. we want the vector to contain as few nonzero elements as possible.
Our objective is to learn such that each patch (column) can be closely approximated as a linear superposition of a small number of atoms in . This is achieved by solving the following sparse optimization problem:
(5) 
where the vector is the sparse representation of the patch . The sparse representation of w.r.t. is denoted as the matrix . The value of is typically or and denotes the reconstruction error controlled by the user.
Note that, with (the seminorm that counts the number of nonzero elements in a vector) equation (5) becomes nonconvex, and solving it exactly is an NP hard problem. Approximate solution is found instead using either greedy algorithms [19] or using convex relaxation [20]. The convex relaxation methods use (the norm) to transform (5) into a convex problem.
We employ a fast dictionary learning algorithm called the KSVD algorithm [21] which provides an approximate solution to (5) for the case. It performs two steps at every iteration: (i) sparse coding and (ii) dictionary update. In the first step, the dictionary is fixed and is computed by a greedy algorithm called Orthogonal Matching Pursuit (OMP) [19]. Next, the atoms of are updated sequentially, allowing the relevant coefficients in to change as well. For the details of this algorithm, please refer to the original KSVD paper [21].
IiiB2 Sparse representationbased complexity functions
We define two quantities that measure the compressibility (how much can an image be compressed) of an image by (i) using its own dictionary, and (ii) using the dictionary extracted from the other image, . We name these terms as the Sparse complexity and the Relative sparse complexity, respectively.
Definition 1. Given an image , its Sparse Complexity is defined as the sparsity of averaged over the number of columns in i.e.
(6) 
Therefore, for , is the average number of nonzero coefficients required to reconstruct a column of using , up to a required precision . Smaller value of indicates higher compressibility (lower complexity) of .
Properties of :

for nonempty , and is equal to otherwise.

Considering that is represented by and hence is represented by , we have
.
This property (idempotency) follows from the averaging operation and indicates that the sparse complexity function can compress the duplicate entries.
Given another image , the compressionbased measures attempts to approximate how much can the image be compressed when additional information about is available. As discussed before, this conditional quantity, is difficult to approximate and that limits the success of these measures. We hence define a slightly different complexity term that measures how much information about is contained in . We name this term as the Relative Sparse Complexity, .
Let be the dictionary pertaining to the image learnt in the same manner as (refer to(5)). The image can be approximated in terms of the dictionary of as follows:
(7) 
where is the sparse representation of w.r.t. and is the sparse representation of w.r.t. .
Definition 2. Given two images and , the Relative Sparse Complexity is defined as the sparsity of averaged over the number of columns in .
(8) 
Therefore, for , becomes the average number of nonzero coefficients required to reconstruct a column of using , up to a required precision . A smaller value of indicates that is efficiently represented by the information extracted from i.e. and have higher similarity.
Properties of :

for nonempty , and otherwise.

(symmetry)

for . This is because, in general, is expected to be more efficiently (sparsely) approximated using  the dictionary trained on itself, than  a dictionary trained on a different image.
IiiB3 The distance measure
Based on the two terms defined above, a sparse representationbased distance measure is defined as follows:
(9) 
The proposed form of is much similar to that of the compressionbased CK1 distance measure [6]. From the property of the relative sparse complexity we have
Hence,
Intuitively, measures how efficient, on average, is it to approximate one image using the information of extracted in the form of a dictionary of its dominant local structures. The smaller the values of the higher is similarity between the two images.
Properties of :

Nonnegativity: is always nonnegative, the lowest value of is when .

Symmetry: Clearly, is symmetric i.e. . Symmetry is an important property for a similarity or dissimilarity measure because many algorithms (e.g. spectral clustering) rely on this property.
Note that, we have used to compute the complexity functions because our dictionary learning method uses greedy approximation. If optimization is used to learn the dictionaries, it would be better to use for the definitions.
Iv Experimental Validation
(a) originial image  (b) contrast change  (c) luminance change 
,  ,  , 
Proposed  Proposed  Proposed 
(d) white noise  (e) lossy jpeg  (f) unrelated image 
,  ,  , 
Proposed  Proposed  Proposed 
In order to establish the generality of the proposed distance measure, we perform experiments on a variety of applications. We first perform experiments to evaluate the compatibility of the proposed measure with the human perception of similarity. This is followed by clustering, retrieval and classification experiments involving larger datasets. The datasets that we choose contain realworld images from different domains like biology, biometrics, medicine and natural textures.
Iva Implementation Details
Practically, there are parameters to be set: the patch size (), the number of patches to be extracted from each image (), the number of dictionary elements () and the reconstruction error (). Unfortunately, there is no theoretical guidelines to determine the values of these parameter, so we rely on previous work and empirical methods. We have used the same parameter values for all experiments, unless mentioned otherwise. Below, we describe how the parameter values are chosen for this particular work.
Patch size () and automatic scale selection: The patch size determines the spatial scale at which an image is analyzed. For simplicity and speed, we analyze each image at a single scale, but use a simple technique to automatically select the (sub)optimal scale. A 2D Laplacian of Gaussian (LoG) filter is applied to each image to detect the local maxima points (keypoitns) at four different scales. The scale at which the maximum number of keypoints are detected is chosen as the (sub)optimal scale for that image. The image is downsampled accordingly and a set of patches are extracted. For example, if the scale is found to be , the image is downsampled by a factor of and then patches of size i.e. are extracted. This particular patch size is chosen in order to be consistent with most of the compression based algorithms (e.g. JPEG1) which process blocks. The automatic scale selection is performed on all images for all datasets except for the VVT Wood dataset due to the small dimensions () of the original images.
Number of patches (): In order to train a dictionary, a large number of patches need to be extracted. The color images are first converted to grayscale to achieve color invariance. It is also important that the randomly extracted patches contain important structural information of the image and do not come from the homogeneous regions of the image only. This is accomplished by selecting the patches whose energy levels are above an empirically set threshold. A collection of such patches are extracted from every image and is used to train its corresponding dictionary. The input patches for dictionary learning have zero mean and unit standard deviation which account for luminance and contrast invariance.
Overcompleteness (): Since we intend to learn an overcomplete dictionary, we must have . The ratio / is called the overcompleteness factor. It has been shown that for small overcompleteness factor, sparse representation is stable in the presence of noise [24]. Thus we set /, where .
Reconstruction error (): We used which means that the input vector is reconstructed with at least accuracy. Note that a lower reconstruction error can produce a better dictionary, but requires more computation and more importantly, may cause overfitting.
IvB Correlation with Human Perception
It is important that the distance measure between images correlate with human perception. We begin with measuring the similarities between a reference image (Fig. 1(a)) and its distorted versions (Fig. 1(b)(e)) as well as a completely unrelated image (Fig. 1(f)). We also compare our results with PSNR and the wellknown Visual Information Fidelity (VIF) [25] (values closer to zero indicates lower similarity) similarity measure. Figure 1 shows that the proposed measure correlates well with human perception and with VIF.
Next, we perform a simple clustering task where it is possible to evaluate the results manually. The Heraldic Shields dataset [6] (see Fig. 2) contains images (of various sizes) which are to be clustered into pairs. All possible pairwise distances are computed using the proposed distance measure. Hierarchical clustering is performed using the average linkage method. The clustering result shown in Fig. 2 demonstrates that our measure has discovered all basic pairs of shields, and corresponds well with human intuition.
(a) 
(b) 
IvC Clustering facial images
In this segment, we move towards more difficult clustering problems involving two larger benchmark datasets:
AT&T face [26]: This dataset contains facial images of individuals in poses. These images (dimension: ) are taken at different times with varying illumination, facial expressions and details.
Yale face [27]: This dataset has grayscale facial images of individuals. There are images per subject, one per different condition: center light, with glasses, happy, left light, no glasses, normal, right light, sad, sleepy, surprised, and wink.
For each dataset, an similarity matrix is computed using (9), where is the number of elements in the dataset. This similarity matrix serves as the input to a standard spectral clustering algorithm [28]. The accuracy of the clustering results is measured using the Hungarian algorithm [29]. We compare our results with the compressionbased stateoftheart CK1 distance measure [6] using the code provided by the authors. Due to the initialization process in spectral clustering, the accuracy varies slightly at each run. Figure 3 reports the mean clustering accuracies along with the standard deviations as computed over runs for the two databases under consideration. The proposed measure outperforms CK1 on the AT&T face dataset by and its performance is lower than CK1 on the Yale dataset.
IvD Texture retrieval
An image retrieval system, when provided with a query image, returns images from a large dataset that are perceptually similar to the query. We perform standard retrieval experiments on the following benchmark texture dataset.
Brodatz texture dataset [30]: This is a benchmark dataset that contains a variety of natural textures like grass and cloth. There are different texture classes. Each original texture image is divided into subimages to create the samples for that class.
For each query, the distances between the query and the remaining images in the dataset are computed, and the first nearest images are retrieved. The performance of a retrieval system is often measured in terms Precision and Recall accuracy. Precision is defined as the ratio of correctly retrieved images to the total number of images retrieved. Recall accuracy is defined as the ratio of the number of correctly retrieved images to the number of images available for the query class. Both precision and recall accuracy are expressed in terms of %. Our retrieval results are compared with those obtained using the CK1 method in Fig. fig:retrv where our method clearly outperforms CK1.
IvE Texture classification
Supervised classification experiments are performed on a diverse collection of texture datasets drawn from the sources across various disciplines such as biology, medicine, forensics, etc.
UIUCTex [31]: This dataset features texture classes with samples each.
KTH Tips [32]: This dataset consists of textures of different materials. The images vary in illumination, pose and scale.
Camouflage [6]: This dataset consists of images of varieties of modern US military camouflage. The images are created by photographing military tshirts at random orientations.
Nematodes [6]: Nematodes are wormlike animals with great commercial and medical importance. Their species are often very difficult to distinguish from each other. This dataset contains images of different species of nematodes.
Tire tracks [6]: This is a collection of tire imprints left on a paper. It has imprints of different tires at varying directions.
Spiders [6]: This is a collection of images of Australasian ground spiders of the family Trochanteriidae. This family has high intra and interclass variation.
VVT Wood [6]: This dataset contains images of types of wood defects (such as dry knot and small knot, etc.). The task is to label an image as either defective or sound.
The classification results for the above datasets using the proposed method and the CK1 are presented in Table I. We test both methods using a leaveoneout scheme in a 1Nearest Neighbor framework. Our method demonstrates much better or comparable accuracy for all the datasets.
Dataset  Classes  Proposed (%)  CK1 [6] (%) 

Brodatz  
UIUCTex  
KTH Tips  
Camouflage  
Nematodes  
Tire tracks  
Spiders  
VTT wood 
IvF Discussion
Most compressionbased methods use an offtheshelf compressor (data, image or video compressor) and treat the compressor as a blackbox. This makes it difficult to understand which part of the compression algorithm actually estimates the complexity of the data or measures the similarity. Consequently, the compressionbased methods are difficult to improve upon, unless one wants to delve into the details of the compression algorithms. The proposed method takes a rather direct approach towards the approximation of complexity, and it is easier to understand and improve. Our method can be easily extended to measure the similarity between any type of signals including audio, video and other type of images such as medical images.
The proposed method requires learning a dictionary for each image. The dictionary learning process takes only a few seconds; for example, with the abovementioned parameter values, a MATLAB implementation takes secs to learn a dictionary per image (including the patch extraction process) on a standard PC (intel quad @2.67GHz). This is as fast as any standard feature extraction process. However, our method is still slower compared to the compressionbased CK1 measure. This can be explained by the fact that the areas of dictionary learning and sparse representation are still in the developing stage. In other words, unlike the standard compression algorithms, the existing algorithms for learning dictionaries or sparse representations are not yet fully optimized for speed or memory.
We have used a greedy algorithm (OMP) to solve the sparse optimization problems in this work, primarily for speed and simplicity. Better results may be achieved using regularized algorithms but at a higher computational cost. The proposed method is also not parameterfree, it requires a few parameters to be set by the user.
V Conclusion
The main contribution of this work is the introduction of a sparse representationbased approach for computing a generic image similarity measure. The proposed measure has been shown to be successful in classifying, retrieving and clustering a variety of images as it performs consistently at par or better than the stateoftheart. Nevertheless, the present work is not closed and we hope that this will stimulate interest in the areas of compression or Kolmogorov complexitybased similarity measurement using sparse representation.
A very recent work has also addressed the problem of similarity measurement using sparse representation of image features [33]. However, it addresses the problem from a different perspective and does not have any connection with the compressionbased or Kolmogorov complexitybased approaches.
In this work, we have not focused on speeding up the classification, retrieval or the clustering processes since our objective has been to first demonstrate the usefulness and generality of the new distance measure. Future research will focus on using the measure more efficiently to classify and cluster larger datasets. This will require exploiting sophisticated machine learning techniques. Applications can also be extended to problems such as copy detection and data mining.
References
 [1] B. Girod, “What’s wrong with meansquarederror?” Digital Images and Human Vision, 1993.
 [2] M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi, “The similarity metric,” IEEE Trans. Information Theory, vol. 50, no. 12, pp. 3250 – 3264, Dec 2004.
 [3] R. Cilibrasi and P. Vitanyi, “Clustering by compression,” IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1523 – 1545, Apr 2005.
 [4] T. Watanabe, K. Sugawara, and H. Sugihara, “A new pattern representation scheme using data compression,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 5, pp. 579 –590, may 2002.
 [5] A. Pinho and P. Ferreira, “Image similarity using the normalized compression distance based on finite context models,” in Image Processing (ICIP), 2011 18th IEEE International Conference on, sept. 2011, pp. 1993 –1996.
 [6] B. J. L. Campana and E. J. Keogh, “A compressionbased distance measure for texture,” Statistical Analysis and Data Mining, vol. 3, no. 6, 2010.
 [7] A. Macedonas, D. Besiris, G. Economou, and S. Fotopoulos, “Dictionary based color image retrieval,” J. Visual Comm. and Image Representation, vol. 19, no. 7, pp. 464 – 470, 2008.
 [8] M. Li and Y. Zhu, “Image classification via lz78 based string kernel: A comparative study,” in Advances in Knowledge Discovery and Data Mining, 2006, vol. 3918, pp. 704–712.
 [9] D. Cerra and M. Datcu, “A model conditioned data compression based similarity measure,” in Proc. DCC, Mar 2008, p. 509.
 [10] B. A. Olshausen and D. J. Field, “Emergence of simplecell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, pp. 607–609, 1996.
 [11] A. Kolmogorov, “Three approaches to the quantitative definition ofinformation’,” Problems of information transmission, vol. 1, no. 1, pp. 1–7, 1965.
 [12] R. J. Solomonoff, “A formal theory of inductive inference. part i,” Information and control, vol. 7, no. 1, pp. 1–22, 1964.
 [13] G. J. Chaitin, “On the length of programs for computing finite binary sequences,” Journal of the ACM (JACM), vol. 13, no. 4, pp. 547–569, 1966.
 [14] M. Li and P. M. Vitnyi, An Introduction to Kolmogorov Complexity and Its Applications, 2nd ed. Springer, 1997.
 [15] E. Keogh, S. Lonardi, and C. A. Ratanamahatana, “Towards parameterfree data mining,” in Proc. ACM SIGKDD, 2004, pp. 206–215.
 [16] R. Cilibrasi, P. M. B. Vitányi, and R. de Wolf, “Algorithmic clustering of music based on string compression,” Computer Music Journal, vol. 28, no. 4, pp. 49–67, 2003.
 [17] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, “Shared information and program plagiarism detection,” IEEE Trans. Information Theory, vol. 50, no. 7, pp. 1545 – 1551, july 2004.
 [18] M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” Neural Computation, vol. 12, no. 2, pp. 337–365, 2000.
 [19] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition,” in Proc. Asilomar Signals, Systems and Computers, 1993.
 [20] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Scientific Computing, vol. 20, pp. 33–61, 1998.
 [21] M. Aharon, M. Elad, and A. Bruckstein, “Ksvd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Processing, vol. 54, pp. 4311–4322, 2006.
 [22] A. Tversky, “Features of similarity,” Psychological Review, vol. 84(4), pp. 327–352, 1977.
 [23] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a metric for image retrieval,” Int. J. of Computer Vision, vol. 40, pp. 99–121, 2000.
 [24] B. Wohlberg, “Noise sensitivity of sparse signal representations: reconstruction error bounds for the inverse problem,” IEEE Trans. Signal Processing, vol. 51, no. 12, pp. 3053 – 3060, 2003.
 [25] H. Sheikh and A. Bovik, “Image information and visual quality,” IEEE Tran. Image Processing, vol. 15, no. 2, pp. 430 –444, feb. 2006.
 [26] [Online]. Available: http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
 [27] [Online]. Available: http://cvc.yale.edu/projects/yalefaces/yalefaces.html
 [28] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Proc. NIPS. MIT Press, 2001, pp. 849–856.
 [29] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. Dover Publications, 1998.
 [30] [Online]. Available: http://www.ux.uis.no/~tranden/brodatz.html
 [31] [Online]. Available: http://wwwcvr.ai.uiuc.edu/ponce_grp/data/
 [32] [Online]. Available: http://www.nada.kth.se/cvap/databases/kthtips/download.html
 [33] L.W. Kang, C.Y. Hsu, H.W. Chen, C.S. Lu, C.Y. Lin, and S.C. Pei, “Featurebased sparse representation for image similarity assessment,” IEEE Trans. Multimedia, vol. 13, no. 5, pp. 1019–1030, 2011.