An Information-theoretic Approach to Unsupervised Feature Selection for High-Dimensional Data

# An Information-theoretic Approach to Unsupervised Feature Selection for High-Dimensional Data

Shao-Lun Huang,  Xiangxiang Xu,  and Lizhong Zheng,  S.-L. Huang is with the Data Science and Information Technology Research Center, Tsinghua-Berkeley Shenzhen Institute, Shenzhen 518055, China (e-mail: shaolun.huang@sz.tsinghua.edu.cn).X. Xu is with the Department of Electronic Engineering, Tsinghua University, Beijing 100084, China (e-mail: xuxx14@mails.tsinghua.edu.cn).L. Zheng is with the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA (e-mail: lizhong@mit.edu).
###### Abstract

In this paper, we propose an information-theoretic approach to design the functional representations to extract the hidden common structure shared by a set of random variables. The main idea is to measure the common information between the random variables by Watanabe’s total correlation, and then find the hidden attributes of these random variables such that the common information is reduced the most given these attributes. We show that these attributes can be characterized by an exponential family specified by the eigen-decomposition of some pairwise joint distribution matrix. Then, we adopt the log-likelihood functions for estimating these attributes as the desired functional representations of the random variables, and show that such representations are informative to describe the common structure. Moreover, we design both the multivariate alternating conditional expectation (MACE) algorithm to compute the proposed functional representations for discrete data, and a novel neural network training approach for continuous or high-dimensional data. Furthermore, we show that our approach has deep connections to existing techniques, such as Hirschfeld-Gebelein-Rényi (HGR) maximal correlation, linear principal component analysis (PCA), and consistent functional map, which establishes insightful connections between information theory and machine learning. Finally, the performances of our algorithms are validated by numerical simulations.

## I Introduction

Given a set of discrete random variables with the (unknown) joint distribution , and a sequence of observed sample vectors i.i.d. generated from this joint distribution, for , our goal in this paper is to efficiently and effectively extract the hidden common information structure (or simply called common structure) shared by these random variables from the observed sample vectors. This is a typical unsupervised learning problem, and such common structures can be useful in many machine learning scenarios. As a motivating example, in the MNIST digits recognition problem [1], we often divide the images into overlapping sub-images, such as in Fig. 1, and then train feature functions on the sub-images for learning the digits. In this problem, we can view each sub-image as a random variable , and the training images as the observed data vectors. Since these sub-images are constructed from the written digits, the digit is the key common information shared by these sub-images. Therefore, effectively mining the information of the shared structure among these random variables can be helpful for recognizing the digits.

In addition, the concept of extracting common structure shared by multiple random variables or objects has also appeared or implicitly posted in several disciplines. For instance, linear principal component analysis (PCA) [2], the most widely adopted unsupervised learning technique, can be viewed as resolving a principal direction that conveys the most common randomness among different dimensions of data vectors. In addition, the consistent functional map network [3, 4, 5], a recently proposed effective approach in computer vision, takes each as a shape, and aims to find the shared components among different shapes. The main issue behind these problems is: how to design good low-dimensional functions of the random variables , such that these functional representations are effective to reveal the common structure among these random variables. This can also be viewed as the unsupervised dimension reduction problem with the particular focus on extracting the common information of random variables. In this paper, our goal is to apply the ideas from information theory to design good algorithms for finding such useful functional representations.

Our approach can be delineated in the following steps. Firstly, we want to identify the targeted random variable embedded in the random variables with some joint distribution , such that contains much information about the common structure shared by . For this purpose, we apply the Watanabe’s total correlation (or simply called the total correlation [6]) to measure the amount of information shared by multiple random variables, and then find the optimal embedded such that the reduction of the total correlation given the knowledge of is maximized. To extract the effective low-dimensional features, we restrict the information volume of about to be small, so that we can concentrate on the most “learnable” part of information about the common structures from the data. We show that in this small information rate regime of , the optimal embedded can be characterized by an exponential family induced by the largest eigenvector of a pairwise joint distribution matrix. Then, we apply the log-likelihood function of estimating from in this exponential family as the functional representation for extracting the common structure. Since is informative about the common structure and the log-likelihood function is the sufficient statistic of the observed data vectors about the target , such a functional representation is effective to extract the common structure shared by these random variables. In addition, we extend this approach to searching for a sequence of mutually independent random variables , such that the reduction of the total correlation is maximized. It turns out that the log-likelihood functions for estimating precisely correspond to the top eigenvectors of the pairwise joint distribution matrix, which establishes a decomposition of the common information between multiple random variables to principal modes of the pairwise joint distribution matrix.

Moreover, we demonstrate that these functional representations can be directly computed from the observed data vectors by a multivariate alternating conditional expectation (MACE) algorithm, which generalizes the traditional ACE algorithm [7] to more than two random variables. This offers an efficient and reliable way to compute useful functional representations from discrete data variables. Furthermore, for high-dimensional or continuous data variables such that the conditional expectations are hardly accurately estimated from the limited data samples, we show that the functional representations can be computed through neural networks by optimizing a pairwise correlation loss. This offers a novel neural network training architecture for jointly analyzing multi-modal data.

Finally, we show in Section IV that our approach shares deep connections and can be viewed as generalizations to several existing techniques, including the Hirschfeld-Gebelein-Rényi (HGR) maximal correlation [8], linear PCA, and consistent functional map network. This combines the knowledge from different domains, and offers a unified understanding for disciplines in information theory, statistics, and machine learning. We would also like to mention that the idea of studying the tradeoff between the total correlation and the common information rate was also employed in [9][10] for Gaussian vectors in caching problems, while our works investigate this tradeoff for general discrete random variables. Moreover, the correlation explanation (CorEx) introduced by [11] also applied the total correlation as the information criterion to unsupervised learning. In particular, the authors in [11] solved an optimization problem by restricting the cardinality of , and a rather complicated iterative algorithm was derived. On the other hand, in this paper we restrict the information volume contained in , which is a more natural constraint in information theory, and we obtain clean analytical solutions that can be computed by simple and efficient algorithms.

In the rest of this paper, we introduce the details of our information theoretic approach for extracting the common structure via functional representations, and present the resulted algorithm design, as well as their applications to practical problems.

## Ii The Information Theoretic Approach

Given discrete random variables with the ranges and joint distribution , we model the common structure shared by these random variables as a high-dimensional latent variable , in which the random variables are conditionally independent given , i.e., , as depicted in Fig. 2. Our goal is to learn this common structure from i.i.d. sample vectors generated from . Since the correlation between and ’s is generally complicated, it is difficult to directly identify and learn the structures of without the labels and assumptions on the generating models of ’s as in the unsupervised learning scenarios. Therefore, instead of identifying the high-dimensional latent variable , we focus on learning the low-dimensional random variable that contains much common information shared between ’s, which can be viewed as the informative attribute for the common structure.

To identify such variable, we apply the total correlation111Specifically, for random variables , the total correlation is defined as the Kullback-Leibler (K-L) divergence between the joint distribution and the product of the marginal distributions. [6] to measure the amount of common information shared between multiple random variables. Then, the amount of information that an attribute contains about the common structure shared by the random variables ’s is measured by the reduction of the total correlation given the knowledge of , defined as

 L(Xd|U)≜D(PXd∥PX1⋯PXd)−D(PXd∥PX1⋯PXd|U). (1)

Our goal is to identify the targeted random variable with the information rate constraint222Note that measures the amount of information of about the whole , while measure the amount of information only about the common structure. The constraint allows us to focus on low-dimensional attribute of , in which we typically choose to be small. , for some given , such that the reduction of the total correlation is maximized. This can be formulated as the optimization problem:

 maxPUXd L(Xd|U), (2a) subject to: I(U;Xd)≤δ. (2b)

In particular, we would like to focus on the low-rate regime of , which assumes to be small. This allows us to concentrate on the most representative low-dimensional attribute for describing the common structure. In addition, we make an extra constraint that

 minuPU(u)>γ, (3)

for some finite irrelevant to , which is natural for many machine learning problems. While the optimization problem (2) in general has no analytical solution, in the regime of small , it can be solved by a local information geometric approach, in which the optimal solutions can be specified by the eigen-decomposition of some pairwise joint distribution matrix.

### Ii-a The Local Information Geometry

To delineate our approach and results, we define the matrix from the pairwise joint distributions as

 B=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣I(1)B12⋯B1dB21I(2)⋯B2d⋮⋮⋱⋮Bd1Bd2⋯I(d)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (4)

where are identity matrices, for all , and are -dimensional matrices with the entry in the -th row and -th column defined as, for all ,

 Bij(xi;xj)=PXiXj(xi,xj)√PXi(xi)√PXj(xj).

The eigen-decomposition of the matrix has the following properties.

###### Lemma 1.

Let the eigenvalues and eigenvectors of the matrix be and , respectively, where is the dimensionality of . In addition, let be the -dimensional vector such that , then

1. is a positive semidefinite matrix, i.e., .

2. The largest eigenvalue with the corresponding eigenvector .

3. The second largest eigenvalue .

4. The last eigenvalues , and the subspace of the corresponding eigenvectors is spanned by the vectors , such that the scalars ’s satisfy .

5. For each , if we partition the corresponding eigenvector into -dimensional subvectors , such that

 ψ(ℓ)=⎡⎢ ⎢ ⎢⎣ψ(ℓ)1⋮ψ(ℓ)d⎤⎥ ⎥ ⎥⎦, (5)

then is orthogonal to , for all .

###### Proof.

See Appendix A. ∎

It will also be convenient to define the matrix :

 ~B≜B−d⋅ψ(0)(ψ(0))T. (6)

Then from Lemma 1, the eigenvalues of are , with the corresponding eigenvectors . Moreover, we define a collection of functions as

 f(ℓ)i(xi)=ψ(ℓ)i(xi)√PXi(xi),for all i,ℓ, (7)

where is the -th subvector of as defined in (5). Then, it follows from Lemma 1 and (7) that ’s are zero-mean functions and . In addition, these functions induce an exponential family of joint distributions for .

###### Definition 1.

Let be the set of functions with zero mean and unit variance. Then, an exponential family on is defined as

 P(δ)exp={ 1ZPU(u)PXd(xd)⋅exp(√2δh(u)√λ(1)d∑i=1f(1)i(xi)):h∈H},

where is the normalizing factor.

Note that this also defines a family of random variables embedded in corresponding to the collection of distributions in . It turns out that this exponential family characterizes the optimal solution of (2) in the regime of small , which is demonstrated as follows.

###### Theorem 1.

The optimal value of (2a) is

 maxPUXdL(Xd|U)=δ(λ(1)−1)+o(δ), (8)

which is attainable by the distributions in . Moreover, for any distribution achieving (8), there exists a distribution , such that for all ,

###### Proof.

See Appendix B. ∎

From Theorem 1, the family of random variables embedded in defined by are the set of attributes that contain the most amount of information about the common structure shared by . To extract such information from data, we consider the log-likelihood function to estimate from :

 logPXd|U=u(xd)PXd(xd)=√2δh(u)√λ(1)d∑i=1f(1)i(xi)+o(√δ). (9)

Although the log-likelihood functions for different in the exponential family may have different magnitudes due to , all of them are proportional to the functional representation of the data vectors. This can be interpreted as the 1-dimensional subspace of the functional space of that is the most informative about the shared structure. This is similar to what linear PCA [2] aims to achieve in the space of data, while we are searching for the optimal subspace of the general functional space. Later on we will show that our result is indeed a nonlinear generalization of linear PCA.

In addition, note that is the second largest eigenvector of , which maximizes over all unit vectors orthogonal to . This implies that the functions defined from (7) form the optimal solution of the joint correlation optimization problem:

 maxfi:Xi↦R, i=1,…,d E⎡⎣∑i≠jfi(Xi)fj(Xj)⎤⎦ subject to: E[fi(Xi)]=0,i=1,…,d, E[d∑i=1f2i(Xi)]=1,i=1,…,d.

Therefore, the functional representation essentially searches for a 1-dimensional subspace for each functional space of , such that the joint correlation between these subspaces is maximized. As a consequence, these subspaces and the corresponding functional representations convey much information about the common structure shared among these random variables.

### Ii-B The Informative k-dimensional Attributes

In addition to the largest eigenvector , the rest eigenvectors of essentially lead to functional representations, which correspond to informative -dimensional attributes for the common structure. To show that, we consider the optimization problem for -dimensional attribute :

 maxPUkXd L(Xd|Uk), (10)

where is as defined in (1), and the maximization is over all joint distributions such that the constituent variables with ranges , for , satisfy: 1) ; 2) for all and some constant independent of ; 3) are mutually independent variables; 4) are conditionally independent variables given . To solve the optimization problem (10), we define the following exponential family for -dimensional attributes.

###### Definition 2.

Let be the set of functions with zero mean and unit variance, for . Then, an exponential family on is defined as

 P(δ)exp,k ={1Zk[k∏j=1PUi(ui)]PXd(xd)⋅exp(√2δk0∑ℓ=1hℓ(uℓ)k0∑j=1qjℓ√λ(j)d∑i=1f(j)i(xi)) :hℓ∈Hℓ,Q=[qij]k0×k0,QTQ=Ik0k∑j=1},

where is as defined in (7), is the normalizing factor, and

 k∗≜max{i:λ(i)>1}.

Then, the exponential family characterizes the optimal solutions of (10).

###### Theorem 2.

The optimal value of (10) is

 maxPUkXd L(Xd|Uk)=δ(k0∑ℓ=1λ(ℓ)−k0)+o(δ), (11)

which is attainable by the distributions in . Moreover, for any distribution achieving (11), there exists a distribution , such that for all ,

 ∣∣PUkXd(uk,xd)−^PUkXd(uk,xd)∣∣=o(√δ).
###### Proof.

See Appendix C. ∎

Note that from Definition 2 and Theorem 2, when , the optimal solution is to design to follow the distributions in , and let the last attributes be independent of . This implies that only the top attributes can effectively reduce the total conditional correlation, which leads to an intrinsic criterion for designing the dimensionality of the attributes.

Moreover from Definition 2, the log-likelihood functions for the optimal attributes correspond to the functional representations , for . This generalizes (9) for providing the informative -dimensional representations about the common structure shared by . Furthermore, it is shown in Appendix D that the functions as defined in (7) form the optimal solution of the following optimization problem:

 maxf––i:Xi↦Rk, i=1,…,d E⎡⎣∑i≠jf––Ti(Xi)f––j(Xj)⎤⎦ (12a) subject to: E[f––i(Xi)]=0–, for all i, (12b) E[d∑i=1f––i(Xi)f––Ti(Xi)]=Ik, (12c)

where is the -dimensional identity matrix. Therefore for , the functional representations , form the -dimensional functional subspace of , such that the joint correlation between these subspaces for different ’s is maximized.

###### Example 1 (Common Bits Patterns Extraction).

Suppose that are mutually independent bits, and each random variable is a subset of these random bits, where denotes the index set. Then, our information theoretic approach essentially extracts the bit patterns that appear the most among the random variables . To show that, we define as the number of sets that include , i.e.,

 w(I)≜d∑i=11{I⊂Ii}, (13)

where is the indicator function. In addition, we denote as the subsets of with the decreasing order . Then, it is shown in Appendix E that the eigenvalues for the corresponding matrix are

 λ(ℓ)=w(Jℓ),ℓ=0,…,m−1. (14)

where is the dimensionality of . Therefore, the eigenvalue of the matrix essentially counts the number of times the corresponding bits pattern appears among the random variables , and the largest eigenvalue indicates the most appeared bits pattern. Moreover for , the corresponding functions () as defined in (7) are

 f(ℓ)i(Xi)=⎧⎪⎨⎪⎩1√w(Jℓ)∏j∈Jℓbj,if Jℓ⊂Ii,0,otherwise. (15)

Thus, the -th optimal functional representation of (cf. Section II-B) is

 d∑i=1f(ℓ)i(Xi)=√w(Jℓ)∏j∈Jℓbj,

which depends only on the bits indexed by . For instance, if , and , and , then for all subsets of , the values for the function as defined in (13) are

 w(∅)=3,w({1})=w({2})=w({3})=2, w({1,2})=w({2,3})=w({3,1})=1,w({1,2,3})=0.

Therefore, the corresponding eigenvalues of are

 λ(0)=3,λ(1)=λ(2)=λ(3)=2,λ(4)=λ(5)=λ(6)=1,λ(7)=0.

Moreover, the corresponding ’s satisfy

 3∑i=1f(ℓ)i(Xi)=√2bℓ,ℓ=1,2,3,

and

 3∑i=1f(ℓ)i(Xi)=⎧⎨⎩b1b2,ℓ=4,b2b3,ℓ=5,b3b1,ℓ=6.

## Iii The Algorithm to Compute the Functional Representation from Data

While our information theoretic approach provides a guidance for searching informative functional representations, it remains to derive the algorithm to compute these functions from observed data vectors. Intuitively, one can first estimate the empirical distribution between from the data samples, and then construct the matrix to solve the eigen-decomposition. However, this is often not feasible in practice due to: (1) there may not be enough number of samples to estimate the joint distribution accurately, (2) the dimensionality of may be extremely high especially for big data applications, so that the singular value decomposition (SVD) can not be computed directly.

### Iii-a The Multivariate Alternating Conditional Expectation (MACE) Algorithm

Alternatively, it is well-known that eigenvectors of a matrix can be efficiently computed by the power method [12]. The power method iteratively multiplies the matrix to an initial vector, and if all the eigenvalues are nonnegative, it converges to the eigenvector with respect to the largest eigenvalue with an exponential convergence rate. To apply the power method for computing the second largest eigenvector of , we choose an initial vector , such that is orthogonal to , for all . This forces to be orthogonal to , and since is positive semidefinite, the power iteration will converge to the second largest eigenvector if is not orthogonal to . Then, the algorithm iteratively computes the matrix multiplication , or equivalently

 ψi←ψi+∑j≠iBijψj, (16)

for all . Note that if we write , then as shown in [13], the step (16) is equivalent to a conditional expectation operation on functions:

 fi(Xi)←fi(Xi)+E⎡⎣∑j≠ifj(Xj)∣∣ ∣∣Xi⎤⎦, (17)

Therefore, the power method can be transferred to an algorithm based on the alternating conditional expectation (ACE) [7] algorithm as shown in Algorithm 1, which computes the optimal functional representation derived in Section II. Note that the choice of to be orthogonal to is transferred to the zero-mean choice of functions in the initialization step of the algorithm.

### Iii-B Finding k Functional Representations from Eigen-decomposition

The Algorithm 1 can be further extended to compute the top eigenvectors , and the corresponding functional representations. To design the algorithm for computing these functions, we denote the -th functional representation as , where is as defined in (7). Then, since is orthogonal to , for , the -th functional representation can be computed by the power method similar to the first functional representation , but with extra orthogonality constraints

 ⟨→f(ℓ),→f(k)⟩≜d∑i=1E[f(ℓ)i(Xi)f(k)i(Xi)]=0, for ℓ≤k−1

to maintain the orthogonality to the first functional representations. Therefore, can be computed by the power method as in Algorithm 1 with the extra step of Gram-Schmidt procedure to guarantee the orthogonality, which is illustrated in Algorithm 2. Note that the computation complexities of Algorithm 1 and Algorithm 2 are both linear to the size of the dataset, which is often much more efficient than the singular value decomposition of the matrix .

### Iii-C Generating Informative Functional Representations for High-Dimensional Data

While the Algorithm 2 generally requires less training samples than estimating the joint distribution and the matrix , in order to obtain an acceptable estimation for the conditional expectation step (17), it is still necessary to acquire training samples in the size comparable to the cardinality of the random variable . This is often difficult for high-dimensional or continuous random variables in practice. In such cases, we propose a neural network based approach to generate the informative functional representations by deep neural networks. The key idea is to note that by Eckart-Young-Mirsky theorem [14], the top eigenvectors of can be computed from the low-rank approximation problem:

 Ψ∗=minΨ∈Rm×k∥∥~B−ΨΨT∥∥2F (18)

where the columns of are the top eigenvectors of . The unconstrained optimization problem (18) leads to a training loss for generating informative functions by neural networks.

###### Proposition 1.

Let be matrices, for , such that , and define -dimensional functions , , as , where denotes the -th row of the matrix . Then, it follows that

 ∥∥~B−ΨΨT∥∥2F=∥∥~B∥∥2F−2H(f––1(X1),…,f––d(Xd)), (19)

where

 H(f––1(X1),…,f––d(Xd))≜d∑i=1d∑j=1H(f––i(Xi),f––j(Xj)),

and is defined as, for all ,

 H(f––i(Xi),f––j(Xj)) ≜E[f––Ti(Xi)f––j(Xj)]−(E[f––i(Xi)])TE[f––j(Xj)] −12tr{E[f––i(Xi)f––Ti(Xi)]E[f––j(Xj)f––Tj(Xj)]},

where denotes the trace of its matrix argument.

###### Proof.

See Appendix F. ∎

Note that coincides with the H-score [15] when the means of the functions are zero, hence we term the multivariate H-score (MH-score). Then from (19), the optimization problem (18) is equivalent to the functional optimization problem

 maxfi:Xi↦Rk, i=1,…,dH(f––1(X1),…,f––d(Xd)), (20)

for solving the informative functional representations for the common structure. The optimization problem (20) leads to a neural network training strategy. Specifically, given the training samples of , we design neural networks, where the -th neural network takes as the input and generates the representations . Then, the weights of these neural networks are trained to minimize the negative MH-score as the loss function. Finally, the informative functional representations are generated by the trained neural networks that attempts to optimize (20), as illustrated in Fig. 3.

## Iv Connections to Existing Techniques

In this section, we demonstrate the relationship between our functional representations and the Hirschfeld-Gebelein-Rényi (HGR) maximal correlation [8], linear PCA [2], and the consistent functional map [3]. This demonstrates the deep connections between our approach and existing techniques, while offering novel information theoretic interpretations to machine learning algorithms.

### Iv-a The HGR maximal correlation

The HGR maximal correlation is a variational generalization of the well-known Pearson correlation coefficient, and was originally introduced as a normalized measure of the dependence between two random variables [8].

###### Definition 3 (Maximal Correlation).

For jointly distributed random variables and , with discete ranges and respectively, the maximal correlation between and is defined as:

 ρ(X;Y)≜maxf:X↦R,g:Y↦RE[f(X)g(Y)]

where the maximum is taken over zero-mean and unit-variance functions and .

The HGR maximal correlation has been shown useful not only as a statistical measurement, but also in designing machine learning algorithms for regression problems [16][13][17]. To draw the connection, note that in the bivariate case , the functions derived in Section II are precisely the maximal correlation functions for two random variables. In addition, our functional representation for general cases essentially defines a generalized version of the maximal correlation [cf. (12)].

###### Definition 4.

The generalized maximal correlation for jointly distributed random variables with discrete ranges , for , is defined as

 ρ∗(X1,⋯,Xd)≜max1d−1E⎡⎣∑i≠jfi(Xi)fj(Xj)⎤⎦ (21)

for the functions , with the constraints , for all .

It is easy to verify that , and if and only if are pairwise independent.

Note that there are some other generalizations to maximal correlations to multiple random variables. For example, the network maximal correlation (NMC) proposed in [18] defined a correlation measurement in the same way as (21) but with a slightly different constraint: In addition, [19] proposed a maximally correlated principal component analysis, which considered the SVD of a matrix similar to . However, our approach and results essentially offer the information theoretic justification of generalizing the maximal correlation as extracting common structures shared among random variables, and also provide the guidance to algorithm designs.

### Iv-B Linear PCA

It turns out that the functional representation derived in Section II is a nonlinear generalization to the linear PCA [2]. To see that, consider a sequence of data vectors , for , where the sample mean and variance for each dimension are zero and one, respectively, i.e., , and , for all . Then, the linear PCA aims to find the principle vector with unit norm such that is maximized; or equivalently, to maximize

 1nn∑ℓ=1∑i≠j(wix(ℓ)i)(wjx(ℓ)j)=E⎡⎣∑i≠j(wiXi)⋅(wjXj)⎤⎦ (22)

subject to the constraint

 1=d∑i=1w2i=d∑i=1E[(wiXi)