An Informationtheoretic Approach to Unsupervised Feature Selection for HighDimensional Data
Abstract
In this paper, we propose an informationtheoretic approach to design the functional representations to extract the hidden common structure shared by a set of random variables. The main idea is to measure the common information between the random variables by Watanabe’s total correlation, and then find the hidden attributes of these random variables such that the common information is reduced the most given these attributes. We show that these attributes can be characterized by an exponential family specified by the eigendecomposition of some pairwise joint distribution matrix. Then, we adopt the loglikelihood functions for estimating these attributes as the desired functional representations of the random variables, and show that such representations are informative to describe the common structure. Moreover, we design both the multivariate alternating conditional expectation (MACE) algorithm to compute the proposed functional representations for discrete data, and a novel neural network training approach for continuous or highdimensional data. Furthermore, we show that our approach has deep connections to existing techniques, such as HirschfeldGebeleinRényi (HGR) maximal correlation, linear principal component analysis (PCA), and consistent functional map, which establishes insightful connections between information theory and machine learning. Finally, the performances of our algorithms are validated by numerical simulations.
I Introduction
Given a set of discrete random variables with the (unknown) joint distribution , and a sequence of observed sample vectors i.i.d. generated from this joint distribution, for , our goal in this paper is to efficiently and effectively extract the hidden common information structure (or simply called common structure) shared by these random variables from the observed sample vectors. This is a typical unsupervised learning problem, and such common structures can be useful in many machine learning scenarios. As a motivating example, in the MNIST digits recognition problem [1], we often divide the images into overlapping subimages, such as in Fig. 1, and then train feature functions on the subimages for learning the digits. In this problem, we can view each subimage as a random variable , and the training images as the observed data vectors. Since these subimages are constructed from the written digits, the digit is the key common information shared by these subimages. Therefore, effectively mining the information of the shared structure among these random variables can be helpful for recognizing the digits.
In addition, the concept of extracting common structure shared by multiple random variables or objects has also appeared or implicitly posted in several disciplines. For instance, linear principal component analysis (PCA) [2], the most widely adopted unsupervised learning technique, can be viewed as resolving a principal direction that conveys the most common randomness among different dimensions of data vectors. In addition, the consistent functional map network [3, 4, 5], a recently proposed effective approach in computer vision, takes each as a shape, and aims to find the shared components among different shapes. The main issue behind these problems is: how to design good lowdimensional functions of the random variables , such that these functional representations are effective to reveal the common structure among these random variables. This can also be viewed as the unsupervised dimension reduction problem with the particular focus on extracting the common information of random variables. In this paper, our goal is to apply the ideas from information theory to design good algorithms for finding such useful functional representations.
Our approach can be delineated in the following steps. Firstly, we want to identify the targeted random variable embedded in the random variables with some joint distribution , such that contains much information about the common structure shared by . For this purpose, we apply the Watanabe’s total correlation (or simply called the total correlation [6]) to measure the amount of information shared by multiple random variables, and then find the optimal embedded such that the reduction of the total correlation given the knowledge of is maximized. To extract the effective lowdimensional features, we restrict the information volume of about to be small, so that we can concentrate on the most “learnable” part of information about the common structures from the data. We show that in this small information rate regime of , the optimal embedded can be characterized by an exponential family induced by the largest eigenvector of a pairwise joint distribution matrix. Then, we apply the loglikelihood function of estimating from in this exponential family as the functional representation for extracting the common structure. Since is informative about the common structure and the loglikelihood function is the sufficient statistic of the observed data vectors about the target , such a functional representation is effective to extract the common structure shared by these random variables. In addition, we extend this approach to searching for a sequence of mutually independent random variables , such that the reduction of the total correlation is maximized. It turns out that the loglikelihood functions for estimating precisely correspond to the top eigenvectors of the pairwise joint distribution matrix, which establishes a decomposition of the common information between multiple random variables to principal modes of the pairwise joint distribution matrix.
Moreover, we demonstrate that these functional representations can be directly computed from the observed data vectors by a multivariate alternating conditional expectation (MACE) algorithm, which generalizes the traditional ACE algorithm [7] to more than two random variables. This offers an efficient and reliable way to compute useful functional representations from discrete data variables. Furthermore, for highdimensional or continuous data variables such that the conditional expectations are hardly accurately estimated from the limited data samples, we show that the functional representations can be computed through neural networks by optimizing a pairwise correlation loss. This offers a novel neural network training architecture for jointly analyzing multimodal data.
Finally, we show in Section IV that our approach shares deep connections and can be viewed as generalizations to several existing techniques, including the HirschfeldGebeleinRényi (HGR) maximal correlation [8], linear PCA, and consistent functional map network. This combines the knowledge from different domains, and offers a unified understanding for disciplines in information theory, statistics, and machine learning. We would also like to mention that the idea of studying the tradeoff between the total correlation and the common information rate was also employed in [9][10] for Gaussian vectors in caching problems, while our works investigate this tradeoff for general discrete random variables. Moreover, the correlation explanation (CorEx) introduced by [11] also applied the total correlation as the information criterion to unsupervised learning. In particular, the authors in [11] solved an optimization problem by restricting the cardinality of , and a rather complicated iterative algorithm was derived. On the other hand, in this paper we restrict the information volume contained in , which is a more natural constraint in information theory, and we obtain clean analytical solutions that can be computed by simple and efficient algorithms.
In the rest of this paper, we introduce the details of our information theoretic approach for extracting the common structure via functional representations, and present the resulted algorithm design, as well as their applications to practical problems.
Ii The Information Theoretic Approach
Given discrete random variables with the ranges and joint distribution , we model the common structure shared by these random variables as a highdimensional latent variable , in which the random variables are conditionally independent given , i.e., , as depicted in Fig. 2. Our goal is to learn this common structure from i.i.d. sample vectors generated from . Since the correlation between and ’s is generally complicated, it is difficult to directly identify and learn the structures of without the labels and assumptions on the generating models of ’s as in the unsupervised learning scenarios. Therefore, instead of identifying the highdimensional latent variable , we focus on learning the lowdimensional random variable that contains much common information shared between ’s, which can be viewed as the informative attribute for the common structure.
To identify such variable, we apply the total correlation^{1}^{1}1Specifically, for random variables , the total correlation is defined as the KullbackLeibler (KL) divergence between the joint distribution and the product of the marginal distributions. [6] to measure the amount of common information shared between multiple random variables. Then, the amount of information that an attribute contains about the common structure shared by the random variables ’s is measured by the reduction of the total correlation given the knowledge of , defined as
(1) 
Our goal is to identify the targeted random variable with the information rate constraint^{2}^{2}2Note that measures the amount of information of about the whole , while measure the amount of information only about the common structure. The constraint allows us to focus on lowdimensional attribute of , in which we typically choose to be small. , for some given , such that the reduction of the total correlation is maximized. This can be formulated as the optimization problem:
(2a)  
subject to:  (2b) 
In particular, we would like to focus on the lowrate regime of , which assumes to be small. This allows us to concentrate on the most representative lowdimensional attribute for describing the common structure. In addition, we make an extra constraint that
(3) 
for some finite irrelevant to , which is natural for many machine learning problems. While the optimization problem (2) in general has no analytical solution, in the regime of small , it can be solved by a local information geometric approach, in which the optimal solutions can be specified by the eigendecomposition of some pairwise joint distribution matrix.
Iia The Local Information Geometry
To delineate our approach and results, we define the matrix from the pairwise joint distributions as
(4) 
where are identity matrices, for all , and are dimensional matrices with the entry in the th row and th column defined as, for all ,
The eigendecomposition of the matrix has the following properties.
Lemma 1.
Let the eigenvalues and eigenvectors of the matrix be and , respectively, where is the dimensionality of . In addition, let be the dimensional vector such that , then

is a positive semidefinite matrix, i.e., .

The largest eigenvalue with the corresponding eigenvector .

The second largest eigenvalue .

The last eigenvalues , and the subspace of the corresponding eigenvectors is spanned by the vectors , such that the scalars ’s satisfy .

For each , if we partition the corresponding eigenvector into dimensional subvectors , such that
(5) then is orthogonal to , for all .
Proof.
See Appendix A. ∎
It will also be convenient to define the matrix :
(6) 
Then from Lemma 1, the eigenvalues of are , with the corresponding eigenvectors . Moreover, we define a collection of functions as
(7) 
where is the th subvector of as defined in (5). Then, it follows from Lemma 1 and (7) that ’s are zeromean functions and . In addition, these functions induce an exponential family of joint distributions for .
Definition 1.
Let be the set of functions with zero mean and unit variance. Then, an exponential family on is defined as
where is the normalizing factor.
Note that this also defines a family of random variables embedded in corresponding to the collection of distributions in . It turns out that this exponential family characterizes the optimal solution of (2) in the regime of small , which is demonstrated as follows.
Theorem 1.
Proof.
See Appendix B. ∎
From Theorem 1, the family of random variables embedded in defined by are the set of attributes that contain the most amount of information about the common structure shared by . To extract such information from data, we consider the loglikelihood function to estimate from :
(9) 
Although the loglikelihood functions for different in the exponential family may have different magnitudes due to , all of them are proportional to the functional representation of the data vectors. This can be interpreted as the 1dimensional subspace of the functional space of that is the most informative about the shared structure. This is similar to what linear PCA [2] aims to achieve in the space of data, while we are searching for the optimal subspace of the general functional space. Later on we will show that our result is indeed a nonlinear generalization of linear PCA.
In addition, note that is the second largest eigenvector of , which maximizes over all unit vectors orthogonal to . This implies that the functions defined from (7) form the optimal solution of the joint correlation optimization problem:
subject to:  
Therefore, the functional representation essentially searches for a 1dimensional subspace for each functional space of , such that the joint correlation between these subspaces is maximized. As a consequence, these subspaces and the corresponding functional representations convey much information about the common structure shared among these random variables.
IiB The Informative dimensional Attributes
In addition to the largest eigenvector , the rest eigenvectors of essentially lead to functional representations, which correspond to informative dimensional attributes for the common structure. To show that, we consider the optimization problem for dimensional attribute :
(10) 
where is as defined in (1), and the maximization is over all joint distributions such that the constituent variables with ranges , for , satisfy: 1) ; 2) for all and some constant independent of ; 3) are mutually independent variables; 4) are conditionally independent variables given . To solve the optimization problem (10), we define the following exponential family for dimensional attributes.
Definition 2.
Let be the set of functions with zero mean and unit variance, for . Then, an exponential family on is defined as
Then, the exponential family characterizes the optimal solutions of (10).
Theorem 2.
Proof.
See Appendix C. ∎
Note that from Definition 2 and Theorem 2, when , the optimal solution is to design to follow the distributions in , and let the last attributes be independent of . This implies that only the top attributes can effectively reduce the total conditional correlation, which leads to an intrinsic criterion for designing the dimensionality of the attributes.
Moreover from Definition 2, the loglikelihood functions for the optimal attributes correspond to the functional representations , for . This generalizes (9) for providing the informative dimensional representations about the common structure shared by . Furthermore, it is shown in Appendix D that the functions as defined in (7) form the optimal solution of the following optimization problem:
(12a)  
(12b)  
(12c) 
where is the dimensional identity matrix. Therefore for , the functional representations , form the dimensional functional subspace of , such that the joint correlation between these subspaces for different ’s is maximized.
Example 1 (Common Bits Patterns Extraction).
Suppose that are mutually independent bits, and each random variable is a subset of these random bits, where denotes the index set. Then, our information theoretic approach essentially extracts the bit patterns that appear the most among the random variables . To show that, we define as the number of sets that include , i.e.,
(13) 
where is the indicator function. In addition, we denote as the subsets of with the decreasing order . Then, it is shown in Appendix E that the eigenvalues for the corresponding matrix are
(14) 
where is the dimensionality of . Therefore, the eigenvalue of the matrix essentially counts the number of times the corresponding bits pattern appears among the random variables , and the largest eigenvalue indicates the most appeared bits pattern. Moreover for , the corresponding functions () as defined in (7) are
(15) 
Thus, the th optimal functional representation of (cf. Section IIB) is
which depends only on the bits indexed by . For instance, if , and , and , then for all subsets of , the values for the function as defined in (13) are
Therefore, the corresponding eigenvalues of are
Moreover, the corresponding ’s satisfy
and
Iii The Algorithm to Compute the Functional Representation from Data
While our information theoretic approach provides a guidance for searching informative functional representations, it remains to derive the algorithm to compute these functions from observed data vectors. Intuitively, one can first estimate the empirical distribution between from the data samples, and then construct the matrix to solve the eigendecomposition. However, this is often not feasible in practice due to: (1) there may not be enough number of samples to estimate the joint distribution accurately, (2) the dimensionality of may be extremely high especially for big data applications, so that the singular value decomposition (SVD) can not be computed directly.
Iiia The Multivariate Alternating Conditional Expectation (MACE) Algorithm
Alternatively, it is wellknown that eigenvectors of a matrix can be efficiently computed by the power method [12]. The power method iteratively multiplies the matrix to an initial vector, and if all the eigenvalues are nonnegative, it converges to the eigenvector with respect to the largest eigenvalue with an exponential convergence rate. To apply the power method for computing the second largest eigenvector of , we choose an initial vector , such that is orthogonal to , for all . This forces to be orthogonal to , and since is positive semidefinite, the power iteration will converge to the second largest eigenvector if is not orthogonal to . Then, the algorithm iteratively computes the matrix multiplication , or equivalently
(16) 
for all . Note that if we write , then as shown in [13], the step (16) is equivalent to a conditional expectation operation on functions:
(17) 
Therefore, the power method can be transferred to an algorithm based on the alternating conditional expectation (ACE) [7] algorithm as shown in Algorithm 1, which computes the optimal functional representation derived in Section II. Note that the choice of to be orthogonal to is transferred to the zeromean choice of functions in the initialization step of the algorithm.
IiiB Finding Functional Representations from Eigendecomposition
The Algorithm 1 can be further extended to compute the top eigenvectors , and the corresponding functional representations. To design the algorithm for computing these functions, we denote the th functional representation as , where is as defined in (7). Then, since is orthogonal to , for , the th functional representation can be computed by the power method similar to the first functional representation , but with extra orthogonality constraints
to maintain the orthogonality to the first functional representations. Therefore, can be computed by the power method as in Algorithm 1 with the extra step of GramSchmidt procedure to guarantee the orthogonality, which is illustrated in Algorithm 2. Note that the computation complexities of Algorithm 1 and Algorithm 2 are both linear to the size of the dataset, which is often much more efficient than the singular value decomposition of the matrix .
IiiC Generating Informative Functional Representations for HighDimensional Data
While the Algorithm 2 generally requires less training samples than estimating the joint distribution and the matrix , in order to obtain an acceptable estimation for the conditional expectation step (17), it is still necessary to acquire training samples in the size comparable to the cardinality of the random variable . This is often difficult for highdimensional or continuous random variables in practice. In such cases, we propose a neural network based approach to generate the informative functional representations by deep neural networks. The key idea is to note that by EckartYoungMirsky theorem [14], the top eigenvectors of can be computed from the lowrank approximation problem:
(18) 
where the columns of are the top eigenvectors of . The unconstrained optimization problem (18) leads to a training loss for generating informative functions by neural networks.
Proposition 1.
Let be matrices, for , such that , and define dimensional functions , , as , where denotes the th row of the matrix . Then, it follows that
(19) 
where
and is defined as, for all ,
where denotes the trace of its matrix argument.
Proof.
See Appendix F. ∎
Note that coincides with the Hscore [15] when the means of the functions are zero, hence we term the multivariate Hscore (MHscore). Then from (19), the optimization problem (18) is equivalent to the functional optimization problem
(20) 
for solving the informative functional representations for the common structure. The optimization problem (20) leads to a neural network training strategy. Specifically, given the training samples of , we design neural networks, where the th neural network takes as the input and generates the representations . Then, the weights of these neural networks are trained to minimize the negative MHscore as the loss function. Finally, the informative functional representations are generated by the trained neural networks that attempts to optimize (20), as illustrated in Fig. 3.
Iv Connections to Existing Techniques
In this section, we demonstrate the relationship between our functional representations and the HirschfeldGebeleinRényi (HGR) maximal correlation [8], linear PCA [2], and the consistent functional map [3]. This demonstrates the deep connections between our approach and existing techniques, while offering novel information theoretic interpretations to machine learning algorithms.
Iva The HGR maximal correlation
The HGR maximal correlation is a variational generalization of the wellknown Pearson correlation coefficient, and was originally introduced as a normalized measure of the dependence between two random variables [8].
Definition 3 (Maximal Correlation).
For jointly distributed random variables and , with discete ranges and respectively, the maximal correlation between and is defined as:
where the maximum is taken over zeromean and unitvariance functions and .
The HGR maximal correlation has been shown useful not only as a statistical measurement, but also in designing machine learning algorithms for regression problems [16][13][17]. To draw the connection, note that in the bivariate case , the functions derived in Section II are precisely the maximal correlation functions for two random variables. In addition, our functional representation for general cases essentially defines a generalized version of the maximal correlation [cf. (12)].
Definition 4.
The generalized maximal correlation for jointly distributed random variables with discrete ranges , for , is defined as
(21) 
for the functions , with the constraints , for all .
It is easy to verify that , and if and only if are pairwise independent.
Note that there are some other generalizations to maximal correlations to multiple random variables. For example, the network maximal correlation (NMC) proposed in [18] defined a correlation measurement in the same way as (21) but with a slightly different constraint: In addition, [19] proposed a maximally correlated principal component analysis, which considered the SVD of a matrix similar to . However, our approach and results essentially offer the information theoretic justification of generalizing the maximal correlation as extracting common structures shared among random variables, and also provide the guidance to algorithm designs.
IvB Linear PCA
It turns out that the functional representation derived in Section II is a nonlinear generalization to the linear PCA [2]. To see that, consider a sequence of data vectors , for , where the sample mean and variance for each dimension are zero and one, respectively, i.e., , and , for all . Then, the linear PCA aims to find the principle vector with unit norm such that is maximized; or equivalently, to maximize
(22) 
subject to the constraint