Information Theoretic Learning with Infinitely Divisible Kernels
In this paper, we develop a framework for information theoretic learning based on infinitely divisible matrices. We formulate an entropy-like functional on positive definite matrices based on Renyi’s axiomatic definition of entropy and examine some key properties of this functional that lead to the concept of infinite divisibility. The proposed formulation avoids the plug in estimation of density and brings along the representation power of reproducing kernel Hilbert spaces. As an application example, we derive a supervised metric learning algorithm using a matrix based analogue to conditional entropy achieving results comparable with the state of the art.
Information Theoretic Learning with Infinitely Divisible Kernels
Luis G. Sanchez Giraldo Dept. of Electrical and Computer Engineering University of Florida Gainesville, Florida, USA email@example.com Jose C. Principe Dept. of Electrical and Computer Engineering University of Florida Gainesville, Florida, USA firstname.lastname@example.org
Information theoretic quantities are descriptors of the distributions of the data that go beyond second-order statistics. The expressive richness of quantities such entropy or mutual information has been shown to be very useful for machine learning problems where optimality based on linear and Gaussian assumptions no longer holds. Nevertheless, operational quantities in information theory are based on the probability laws underlying the data generation process, which are rarely known in the statistical learning setting where the only information available comes from the sample . Therefore, the use of information theoretic quantities as descriptors of data requires the development of suitable estimators. In [JPrincipe], the use of Renyi’s definition of entropy along with Parzen density estimation is proposed as the main tool for information theoretic learning (ITL). The optimality criteria is expressed in terms of quantities such as Renyi’s entropy, divergences based on the Cauchy-Schwarz inequality, quadratic mutual information, among others. Part of the research effort in this context has pointed out connections to reproducing kernel Hilbert spaces [JWXu08]. Here, we show that these connections are not only valuable from a theoretical point of view, but they can also be exploited to derive novel information theoretic quantities with suitable estimators from data.
Positive definite kernels have been employed in machine learning as a representational tool allowing algorithms that are based on inner products to be expressed in a rather generic way (the so called “kernel trick”). Algorithms that exploit this property are commonly known as kernel methods. Let be a nonempty set. A function is called a positive definite kernel if for any finite set and any set of coefficients , it follows that , if at least one , . In this case, there exist an implicit mapping that maps any element to an element in a Hilbert space , such that . The above map provides an implicit representation of the objects of interest that belong to the set . The generality of this representation has been exploited in many practical applications, even for data that do not come in standard vector representation [JShaweTaylor]. This is possible as long as a kernel function is available.
More recently, it has been noticed that kernel induced maps are also useful beyond the above kernel trick in a rather interesting fashion. Namely, kernels can be utilized to compute higher-order statistics of the data in a nonparametric setting. Some examples exploring this idea are: kernel independent component analysis [FBach02], the work on measures of dependence and independence using Hilbert-Schmidt norms [AGretton05], and the quadratic measures of independence proposed in [SSeth11]. It is not surprising, yet important to mention, that a similar observation have also been reached from the work on ITL since one of the original motivations in using information theoretic quantities is to go beyond second order statistics. The work we introduce in this paper goes along these lines. The twist is that rather than defining an estimator of a conventional information theoretic quantity such as Shannon entropy, we propose a quantity build from the data that satisfies similar axiomatic properties to those of well establish definitions such as Renyi’s definition of entropy
The main contribution of this work is to show that the Gram matrix obtained from evaluating a positive definite kernel on samples can be used to define a quantity based on the data with properties similar to those of an entropy without assuming that the probability density is being estimated. Therefore, we look at the axiomatic treatment of entropy and adapt it to the Gram matrices describing the data. In this sense, we think about entropy as a measure inversely related to the amount of statistical regularities (structure) directly from the data that can be applied as the optimality criterion in a learning algorithm.
As an application example, we derive supervised metric learning algorithm that uses conditional entropy as the cost function. This is the second contribution of this paper, and the empirical results show that the proposed method is competitive with current approaches.
The main body of the paper is organized in two parts. First, we introduce the proposed matrix-based entropy measure using the spectral theorem along with a set of axiomatic properties that our quantity must satisfy. Then, the notion of joint entropy is developed based on Hadamard products. We look at some basic inequalities of information and how they translate to the setting of positive definite matrices, which finally allow us to define an analogue to conditional entropies. In the development of these ideas, we find that the concept of infinitely divisible kernels arises and become key to our purposes. We revisit some of the theory on infinitely divisible matrices, to show how it links to the the proposed information theoretic framework. In the last part, we introduce an information theoretic supervised metric learning algorithm. We show how the proposed analogue to conditional entropy is a suitable cost function leading naturally to a gradient descent procedure. Finally, we provide some conclusions and future directions.
2 Positive Definite Matrices, and Renyi’s Entropy Axioms
Let us start with an informal observation that motivated our matrix based entropy. In [JPrincipe], the use of Renyi’s entropy is proposed as an alternative to the more commonly adopted definition of entropy given by Shannon. In particular, it was found that Renyi’s second-order entropy provides an amenable quantity for practical purposes. An empirical plug in estimator of Renyi’s second-order entropy based on the Parzen density estimator can be obtained as follows:
where . Note that since is a positive definite kernel, there exists a mapping to a RKHS such that ; and the argument of the in (1), called the information potential, can be interpreted in this space as a norm:
with the limiting case given by . Thus, we can think of this estimator as an statistic computed on the representation space provided by the positive definite kernel . Now, let us look at the case where is the Gaussian kernel; if we construct the Gram matrix with elements , it is easy to verify that the estimator of Renyi’s second-order entropy based on (1) corresponds to:
where takes care of the normalization factor of the Parzen window. As we can see, the information potential estimator can be related to the norm of the Gram matrix defined as . From the above informal argument two important questions arise. First, it seems natural to ask whether other functionals on Gram matrices allow information theoretic interpretations that can be further utilized as objective functions in ITL. Secondly, even though was originally derived from a convolution of Parzen windows, was there anything about the implicit representation that allows to interpret (2) in information theoretic terms?
2.1 Renyi’s Axioms for Gram matrices
Real Hermitian matrices are considered generalizations of real numbers. It is possible to define a partial ordering on this set by using positive definite matrices, which are a generalization of the positive real numbers. Let be the set of all real matrices; for two Hermitian matrices , we say if is positive definite. Likewise, means that is strictly positive definite.
The following spectral decomposition theorem [RHorn_Topics_in_Matrix_Analysis] relates to the functional calculus on matrices and provides a reasonable way to extend continuous scalar-valued functions to Hermitian matrices.
Let be a given set and let where denotes the spectrum of . If is a continuous scalar-valued function on , then the primary matrix function
is continuous on , where , , and is unitary.
Equipped with the above result, we can define matrix functions such as for , which will be used in defining the following matrix-based analogue to Renyi’s -entropy. The functional will then be applied to Gram matrices constructed by pairwise evaluation of a positive definite kernel on the data samples.
Consider the set of positive definite matrices for which . It is clear that this set is closed under finite convex combinations.
Let and and also . The functional
satisfies the following set of conditions:
for any orthonormal matrix
is a continuous function for .
If ; then for the strictly monotonic and continuous function for and , we have that:
The proof of (i) easily follows from Theorem 2.1. Take now is also a unitary matrix and thus = the trace functional is invariant under unitary transformations. For (ii), the proof reduces to the continuity of . For (iii), a simple calculation yields . Now, for property (iv), notice that if , then, . Since and we can write , from which and thus (iv) is proved. Finally, (v) notice that for any integer power of we have: since . Under extra conditions such as the argument in the proof of Theorem 2.1 can be extended to this case. Since the eigen-spaces for the non-null eigenvalues of and are orthogonal we can simultaneously diagonalize and with the orthonormal matrix , that is and where and are diagonal matrices containing the eigenvalues of and respectively. Since , then . Under the extra condition , we have that yielding the desired result for (v).
Notice also that if the rank of , , the entropy for any .
It is also true that,
As we can see (5) satisfies some properties attributed to entropy. Nevertheless, such a characterization may not fully endow all unit-trace positive definite matrices with an information theoretic interpretation. Which descriptors are suitable in representing joint-spaces? What properties should be satisfied by the matrices in order to be applied to concepts that link them to random variables such as conditioning? In what follows, we address these points by developing notions of joint entropy and conditional entropy, for which, additional properties must be fulfilled. Recall that the notions of joint and conditional entropy are not only important for the above reasons, but they also provide the means to propose objective functions for learning that are based on information theoretic quantities.
2.2 Hadamard Products and the Notion of Joint Entropy
Positive kernels are also useful in integrating multiple modalities. Using the the product kernel, we can readily define the notion of joint-entropy. Consider a sequence of sample pairs where and . Assume, we have a positive definite kernels defined on and defined on . The product kernel is a positive definite kernel on . As we can see the Hadamard product arises as a joint representation in a our matrix based entropy. Consider two matrices and in with unit trace, for which there exists some relation between the elements and for all and . The joint entropy can be defined as:
It is important then to verify that the definition of joint entropy (8) satisfies a basic intuition about uncertainty. The joint entropy should never be smaller than any of the individual entropies of the variables that conform it. The following proposition verifies this intuition for a subset of the unit trace, positive definite matrices.
Let and be two positive definite matrices with trace with nonnegative entries, and for . Then, the following inequality holds:
2.3 Conditional Entropy as a Difference Between Entropies
The conditional entropy of given , which can be understood as the uncertainty about that remains after knowing the joint distribution of and , can be obtained from a difference between two entropies. In the Shannon’s definition of conditional entropy, can be expressed as . The properties of this definition has been recently studied in the case of Renyi’s entropies [ATeixeira12] and in the matrix case, this definition yields:
for positive semidefinite matrices and with nonnegative entries and unit trace, such that for all . The above quantity is nonnegative and upper bounded by . Certainly, normalization is an important property of the matrices involved in the above results. If and are normalized to have unit trace, then for it is true that the Hadamard product of
is also normalized. However, it is not always true that the resulting matrix (11) is positive definite. This product can be thought as a weighted geometric average for which the resulting matrix will give more emphasis to either one of the matrices. However, if and satisfy a property called infinitely divisibility, the product is guaranteed to be positive definite 111By this, we also mean positive semidefinite.
3 Infinitely Divisible Functions
The theory of infinitely divisible developed below is not new, but it is included because it provides a basic understanding about the role of infinitely divisible kernels in computing the above information theoretic quantities from data. To avoid confusion, let us describe the key points to bear in mind before we move to the mathematical description. Infinitely divisible kernels and negative definite functions are tied together trough the exponential a logarithm functions. Both functions provide Hilbert space representations of the data. We can think of the RKHS of the infinitely divisible kernel as a representation to compute the higher order descriptors of the data. On the other hand, the Hilbertian metric can be the representation space for which we want to compute the high order statistics. Normalization, as we show below is not only important in satisfying the conditions for the information theoretic quantities already defined, but it also shows that many possible representational choices are equivalent.
3.1 Negative Definite Functions and Hilbertian Metrics
Let be a separable metric space. A necessary and sufficient condition for to be embeddable in a Hilbert space is that for any set of points, for any . This condition is equivalent to for any , such that . This condition is known as negative definiteness. Interestingly, the above condition implies that is positive definite in for all [ISchoenberg38]. Indeed, matrices derived from functions satisfying the above property conform a special class of matrices know as infinitely divisible.
3.2 Infinitely Divisible Matrices
According to the Schur product theorem implies for any positive integer . Does the above hold if we to take fractional powers of ? In other words,is the matrix for any positive integer ? This question leads to the concept of infinitely divisible matrices [RBhatia06, RHorn69]. A nonnegative matrix is said to be infinitely divisible if for every nonnegative . Infinitely divisible matrices are intimately related to negative definiteness as we can see from the following proposition
If is infinitely divisible, then the matrix is negative definite
From this fact it is possible to relate infinitely divisible matrices with isometric embeddings into Hilbert spaces. If we construct the matrix
using the matrix from proposition 3.1. There exists a Hilbert space and a mapping such that
Moreover, notice that if is positive definite is negative definite and is infinitely divisible. In a similar way, we can construct a matrix,
with the same property (13). This relation between (12) and (14) suggests a normalization of infinitely divisible matrices with non-zero diagonal elements that can be formalized in the following theorem.
Let be a nonempty set, and let and be two metrics on it, such that for any set , , for any , and , is true for . Consider the matrices and their normalizations , defined as:
Then, if for any finite set , there exist isometrically isomorphic Hilbert spaces and , that contain the Hilbert space embeddings of the metric spaces , . Moreover, are infinitely divisible.
Figure 1 summarizes the relation between spaces that are considered in the proposed framework. The object space can be directly mapped into using an infinitely divisible kernel , or it can be mapped to a Hilbert space , if a negative definite function , is employed as the distance function. The spaces and are related by the and functions.
4 Application to Metric Learning
4.1 Adaptation Using the Matrix-Based Entropy
By definition, the matrix entropy functional (5) fall into the family of matrix functions know as spectral functions. These functions only depend on the eigenvalues of matrix and therefore their name [SFriedland81]. Using theorem (1.1) from [ALewis96a] it is straightforward to obtain the derivative of (5) at as
where . It is important to note that this decomposition can be used to our advantage. Instead of computing the full set of eigenvectors and eigenvalues of , we can approximate the gradient of by using only a few leading eigenvalues. It is easy to see that this approximation will be optimal in the Frobenius norm .
4.2 Metric Learning Using Conditional Entropy
Here, we apply the proposed matrix framework to the problem of supervised metric learning. This problem can be formulated as follows. Given a set of points , we seek a positive semidefinite matrix , that parametrizes a Mahalanobis distance between samples as . Our goal is to find parametrization matrix such that the conditional entropy of the labels given the projected samples with and , is minimized. This can be posed as the following optimization problem:
where the trace constraint prevents the solution from growing unbounded. We can translate this problem to our matrix-based framework in the following way. Let be the matrix representing the projected samples
and be the matrix of class co-occurrences where if and zero otherwise. The conditional entropy can be computed as , and its gradient at , which can be derived based on (24), is given by:
Finally, we can use (18) to search for iteratively.
UCI Data: To evaluate the results we use the same experimental setup proposed in [JDavis07], we compares 5 different approaches to supervised metric learning based on the classification error obtained from two-fold cross-validation using a -nearest neighbor classifier. The reported errors are averages errors from 10 runs on the two folds for each algorithm; in our case the parameters are , and . The feature vectors were centered and scaled to have unit variance. Figure 2(a) shows the results of the proposed approach conditional entropy metric learning (CEML), information theoretic metric learning (ITML) proposed in [JDavis07], neighborhood component analysis (NCA) from [JGoldberger04], the maximally collapsing metric learning (MCML) method from [AGloberson05], the large margin nearest neighbor (LMNN) method found in [KWeinberger05], and, as a baseline, the the inverse covariance and Euclidean distances. The results for the Soybean dataset are not reported since there is more than one possible data set in the UCI repository under that name.
The errors obtained by the metric learning algorithm using the proposed matrix-based entropy framework are consistently among the best performing methods included in the comparison.
Choice of order : Even though the choice of the entropy order above appears to be arbitrary, there is a motivation in choosing close to . The reason is that the higher the entropy order, the more prone the algorithm is to find unimodal solutions. This can be advantageous if prior knowledge or strong assumptions on the class distributions are taken into consideration. In our experiments, we opted for lower entropy order and give the algorithm more flexibility in finding a good solution. To experimentally show this phenomena, we generated a two-dimensional dataset containing points from two classes. In one direction the classes are very well separated but the distribution has multiple modalities. On the orthogonal direction, the classes are not fully separable, but their distributions are unimodal. Figure 3 shows a sample with points drawn from both classes, as we can see projecting the data onto the horizontal axis provides better separability at the cost of a more complex decision boundary.
We run our metric learning algorithm 60 times for different values of and recorded the direction of the resulting one-dimensional feature extractor. Table 1 shows the number of times a particular direction was picked by our algorithm for different entropy orders. It can be seen that for larger values of , the algorithm selected the vertical direction more often.
UMist Faces: We also run the algorithm on the UMist dataset; This data set consists of Grayscale faces (8 bit [0-255]) of 20 different people. The total number of images is 575 and the size of each image is 112x92 pixels for a total of dimensions. Pixel values were normalized by dividing by and removing the mean. Figure 2(b) shows the images projected into . It is remarkable how a linear projection can separate the faces, and it can also be seen from the Gram matrix that it tries to approximate the co-occurrence matrix .
In this paper, we presented a data-driven framework for information theoretic learning based on infinitely divisible matrices. We define estimators of entropy-like quantities that can be computed from the Gram matrices obtained by evaluating infinitely divisible kernels on pairs of samples. The proposed quantities do not assume that the density of the data has been estimated, this can be advantageous in many scenarios where even defining a density is not feasible. We discuss some key properties of the proposed quantities and show how they can be applied to define useful analogues to quantities such as conditional entropy. Based on the proposed framework, we introduce a supervised metric learning algorithm with results that are competitive with the state of the art. Nevertheless, we believe that many interesting formulations to learning problems based on the proposed framework are yet to be found. It is also important to highlight that the connection between the RKHS provided by the infinitely divisible kernel, and the Hilbertian metrics associated with the negative definite functions, opens an interesting avenue to investigate formulations of information theoretic learning algorithms on both spaces, and the implications of choosing one or the other.
Appendix A Additional results and proofs
To prove (9), we need to introduce the concept of majorization and some results pertaining the ordering that arises from this definition. The proposition is replicated in this appendix for the sake of self containment.
(Majorization): Let and be two nonnegative vectors in such that . We say , majorizes , if their respective ordered sequences and denoted by and , satisfy:
It can be shown that if then for some doubly stochastic matrix [RBhatia]. It is also easy to verify that if and then for . The majorization order is important because it can be associated with the definition of Schur-concave (convex) functions. A real valued function on is called Schur-convex if implies and Schur-concave if .
The function ( denotes the dimensional simplex), defined as,
is Schur-concave for .
Notice that, Schur-concavity (Schur-convexity) cannot be confused with concavity (convexity) of a function in the usual sense. Now, we are ready to state the inequality for Hadamard products.
Let and be two positive definite matrices with trace with nonnegative entries, and for . Then, the following inequality holds:
In proving (9), we will use the fact that preserves the majorization order (inversely) of nonnegative sequences on the -dimensional simplex. First look at the identity
In particular, if is an orthonormal basis for , . If we let be the eigenvectors of ordered according to their respective eigenvalues in decreasing order, then,
where and are the eigenvectors of ordered according to their respective eigenvalues in decreasing order. The inequality (23) is equivalent to say that , that is, the sequence of eigenvalues of is majorized by the sequence of eigenvalues of , which implies (9) by Lemma A.1.
A beautiful observation from Theorem 3.1 is that, according to equation (10), the proposed normalization procedure for infinitely divisible matrices can be thought of as finding the maximum entropy matrix among all matrices for which the Hilbert space embeddings are isometrically isomorphic.
a.1 Derivatives of Spectral Functions
Let denote the vector space of real Hermitian matrices of size endowed with inner product ; and let denote the set of unitary matrices. A real valued function defined on a subset of is unitarily invariant if for any . Associated with each spectral function there is a symmetric function on . By symmetric we mean that for any permutation matrix . Let denote the vector of ordered eigenvalues of ; then, a spectral function is of the form for a symmetric. We are interested in the differentiation of the composition at 222In here, denotes composition rather than Hadamard product. The following result [ALewis96a] allows us to differentiate a spectral function at
Let the set be open and symmetric, that is, for any and any permutation matrix , . Suppose that is symmetric, Then, the spectral function is differentiable at a matrix if and only if is differentiable at the vector . In this case, the gradient of at is
for any unitary matrix satisfying .