On Universal Features for High-DimensionalLearning and Inference

On Universal Features for High-Dimensional
Learning and Inference

Shao-Lun Huang, , Anuran Makur, ,
Gregory W. Wornell, , and Lizhong Zheng 
Manuscript received September 2019. This work was supported in part by NSF under Grant Nos. CCF-1717610 and CCF-1816209. This work was presented in part at thte Int. Symp. Inform. Theory (ISIT-2017) [1], Aachen, Germany, June 2017, the Inform. Theory and Appl. Workshop (ITA-2018), Feb. 2018 [2], and at the Inform. Theory Workshop (ITW-2018), Guangzhou, China, Nov. 2018 [3].S-L. Huang is with the Data Science and Information Technology Research Center, Tsinghua-Berkeley Shenzhen Institute, Shenzhen, China (Email: shaolun.huang@sz.tsinghua.edu.cn). A. Makur, G. W. Wornell, and L. Zheng are with the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139. (Email: {a_makur, lizhong, gww}@mit.edu).
Abstract

We consider the problem of identifying universal low-dimensional features from high-dimensional data for inference tasks in settings involving learning. For such problems, we introduce natural notions of universality and we show a local equivalence among them. Our analysis is naturally expressed via information geometry, and represents a conceptually and computationally useful analysis. The development reveals the complementary roles of the singular value decomposition, Hirschfeld-Gebelein-Rényi maximal correlation, the canonical correlation and principle component analyses of Hotelling and Pearson, Tishby’s information bottleneck, Wyner’s common information, Ky Fan -norms, and Brieman and Friedman’s alternating conditional expectations algorithm. We further illustrate how this framework facilitates understanding and optimizing aspects of learning systems, including multinomial logistic (softmax) regression and the associated neural network architecture, matrix factorization methods for collaborative filtering and other applications, rank-constrained multivariate linear regression, and forms of semi-supervised learning.

machine learning, statistical inference, sufficient statistics, information geometry, logistic regression, neural networks, information bottleneck, maximal correlation, alternating conditional expectations algorithm, canonical correlation analysis, principal component analysis, common information, Ky Fan -norm, matrix factorization, collaborative filtering, reduced-rank linear regression.
\stackMath

I Introduction

In many contemporary and emerging applications of machine learning and statistical inference, the phenomena of interest are characterized by variables defined over large alphabets. Familiar examples, among many others, include the relationship between individual consumers and products that may be of interest to them, and the relationship between images and text in a visual search setting. In such scenarios, not only are the data high-dimensional, but the collection of possible inference tasks is also large. At the same time, training data available to learn the underlying relationships is often quite limited relative to its dimensionality.

From this perspective, for a given level of training data, there is a need to understand which inference tasks can be most effectively carried out, and, in turn, what features of the data are most relevant to them. As we develop in this paper, a natural framework for addressing such questions rather broadly can be traced back to the work of Hirschfeld [4].

As we will develop, the problem can be equivalently expressed as one of “universal” feature extraction, and show that diverse notions of such universality lead to precisely the same features. Our development emphasizes an information theoretic treatment of the associated questions, and in particular we adopt a convenient “local” information geometric analysis that provides useful insight. In turn, as we describe, the interpetation of such features in terms of a suitable singular value decomposition (SVD) facilitates their computation.

An outline of the paper, and summary of its key contributions, is as follows:

Section Ii

As a foundation, and focusing on finite alphabets, we describe the modal decomposition of bivariate distributions into constituent features that arises out of Hirschfeld’s analysis, developing it in terms of the SVD of a particular matrix characterization—termed the canonical dependence matrix (CDM)—of the distribution and the associated conditional expectation operator.

Section Iii

We describe the variational characterization of the modal decomposition in terms of standard SVD analysis, as further developed by Gebelein and Rényi, from which we obtain the resulting Hirchfeld-Gebelein-Rényi (HGR) maximal correlation as the Ky Fan -norm of the CDM. Via this analysis, the features defining the modal decomposition are obtained via an optimization.

Section Iv

As a further foundation, we describe the local geometric analysis on the probability simplex that is associated with -divergence. In the resulting Euclidean information space, distributions are represented as information vectors, and features as feature vectors, and we develop an equivalence between them via log-likelihoods. Via this geometry, we develop a suitable notion of weakly dependent variables for which we obtain a decomposition of mutual information and through which we interpret truncated modal decompositions as “information efficient.” Additionally, we characterize the error exponents in local decision making in terms of (mismatched) feature projections.

Section V

Via the local analysis, we develop several different characterizations of universal features, all of which coincide with the features that arise in the modal decomposition of the joint distribution. As an initial observation, we note that features characterize a locally exponential family for the conditional distributions. For the remaining characterizations, we introduce latent local attribute variables. In particular: Section V-C obtains the modal decomposition features as the solution to a game between system designer and nature, where the system designer must choose features to detect attributes that nature chooses at random after these features are fixed; Section V-D obtains the same features as the solution to a cooperative game in which the system designer and nature seek the most detectable attributes and locally sufficient statistics for their detection; Section V-E obtains the same features as the solution to a local symmetric version of Tishby’s information bottleneck problem that seeks mutual information maximizing attributes and the associated locally sufficient statistics; and Section V-F obtains superpositions of these same features arise as locally sufficient statistics in the solution to a local version of Wyner’s common information, which using variational analysis we show specializes to the nuclear (trace) norm of the CDM. In turn, Section V-G develops Markov structure relating the resulting common information variable to the attributes optimizing the information bottleneck.

Section Vi

We discuss the estimation of universal features from training data, starting from the statistical interpretation of the orthogonal iteration method of computing an SVD as the alternating conditional expectations (ACE) algorithm of Breiman and Friedman. We include the relevant analysis of sample complexity of feature recovery, which supports the empirical observation that in practice the dominant modes can typically be recovered with comparatively little training data.

Section Vii

We use the context of collaborative filtering to develop matrix factorization perspectives associated with the modal decomposition. In particular, we formulate the problem of collaborative filtering as one of Bayesian attribute matching, and find that the optimum such filtering is achieved using a truncated modal decomposition, which corresponds to the optimum low-rank approximation to the empirical CDM, which differs from some other commonly used factorizations.

Section Viii

We analyze a local version of multinomial logistic regression; specifically, under weak dependence we show that softmax weights correspond to (normalized) conditional expectations, and that the resulting discriminative model matches, to first order, that of a Gaussian mixture without any Gaussian assumptions in the analysis. We further show that the optimizing features are, again, those of the modal decomposition, in which case the associated softmax weights are proportional to the “dual” features of the decomposition. Our analysis additionally quantifies the performance limits in this regime in terms of the associated singular values. As we discuss, this analysis implies a relationship between the ACE algorithm and methods used to train at least some classes of neural networks.

Section Ix

We provide a treatment for Gaussian variables that parallels the preceding one for finite alphabets. To start, we construct the modal decomposition of covariance via the SVD of the canonical correlation matrix (CCM), and obtain the familiar formulation of Hotelling’s canonical correlation analysis (CCA) via the corresponding variational characterization. We further define a local Gaussian geometry, the associated notion of weakly correlated variables, and construct a local modal decomposition of joint distributions of such variables in terms of the CCA features, which are linear. Via Gaussian attribute models, we then show these CCA features arise in the solution to universal feature problem formulations. Section IX-H shows they arise in the solution of an attribute estimation game in which nature chooses the attribute at random after the system designer chooses the linear features from which it will be estimated using a minimum mean-square error (MMSE) criterion, and Section IX-I shows they arise in the solution of the corresponding cooperative MMSE attribute estimation game; these analyses are global. Section IX-J, shows the CCA features arising in the solution to the local symmetric version of Tishby’s Gaussian information bottleneck problem, and Section IX-K describes how superpositions of CCA features arise in the solution to the (global) Gaussian version of Wyner’s common information problem; locally this common information is given by the nuclear norm of the CCM. Section IX-L describes the Markov relationships between the dominant attributes in the solution to the information bottleneck and the common information variable. Section IX-M interprets the features arising out of Pearson’s principal component analysis (PCA) as a special case of the preceding analyses in which the underlying variables are simultaneously diagonalizable, and Section IX-N discusses the estimation of CCA features, interpreting the associated SVD computation as a version of the ACE algorithm in which the features are linearly constrained. Section IX-O develops Gaussian attribute matching, and interprets the resulting procedure as one of optimum rank-constrained linear estimation, and Section IX-P develops a form of rank-constrained linear regression as the counterpart to softmax regression, and distinguishing it from classical formulations.

Section X

We provide a limited discussion of the application of universal feature analysis to problems beyond the realm of fully supervised learning. Section X-A describes the problem of “indirect” learning in which to carry out clustering on data, relationships to secondary data are exploited to define an appropriate measure of distance. We show, in particular, that our softmax analysis implies a natural procedure in which Gaussian mixture modeling is applied to the dominant features obtained from the modal decomposition with respect to the secondary data. By contrast, Section X-B discusses the problem of partially-supervised learning in which features are learned in an unsupervised manner, and labeled data is used only to obtain the classifier based on the resulting features. As an illustration of the use of universal features in this setting, an application to handwritten digit recognition using the MNIST database is described in which the relevant features are obtained via the common information between subblocks of MNIST images. A simple implementation achieves an error probability of 3.02%, close to that of a 3-layer neural net (with 300+100 hidden nodes), which yields an error probability of 3.05%.

Finally, Section XI contains some concluding remarks.

Ii The Modal Decomposition of Joint Distributions

Let and denote random variables over finite alphabets and , respectively, with joint distribution . Without loss of generality we assume throughout that the marginals satisfy and for all and , since otherwise the associated symbols may be removed from their respective alphabets. Accordingly, we let denote the set of all such distributions.

For an arbitrary feature111The literature sometimes refers to these as embeddings, referring to functions of embeddings as features. However, our treatment does not require this distinction. , let be the feature induced by through conditional expectation with respect to , i.e.,

(1)

Then we can express (1) in the form

i.e.,

(2)

where we have defined

(3)

and

(4a)
(4b)

Clearly and in (4) are equivalent representations for and respectively. But in (3) is also an equivalent representation for , as we will verify shortly. Moreover, (2) expresses that has an interpretation as a conditional expectation operator, and thus is equivalent to .

Next consider an arbitrary feature , and let be the feature induced by through conditional expectation with respect to , i.e.

(5)

Then using the notation (3) and that analogous to (4), i.e.,

(6a)
(6b)

we can express (5) in the form

(7)

where is the adjoint of . Likewise is an equivalent representation for and, in turn, .

It is convenient to represent as a matrix. Specifically, we let denote the matrix whose th entry is , i.e.,

(8)

where denotes a diagonal matrix whose th diagonal entry is , where denotes a diagonal matrix whose th diagonal entry is , and where denotes the matrix whose th entry is . In [5], is referred to as the divergence transfer matrix (DTM) associated with .222The work of [5], building on [6], focuses on a communication network setting. Subsequently, [7] recognizes connections to learning that motivate aspects of, e.g., the present paper.

Although for convenience we will generally restrict our attention to the case in which the marginals and are strictly positive, note that extending the DTM definition to arbitrary nonnegative marginals is straightforward. In particular, it suffices make the th column of all zeros if for some , and, similarly, the th row of all zeros if for some , i.e., (3) is extended via

all , such that (9)

Useful alternate forms of and are [cf. (3)]

from which we obtain the alternate matrix representations

(10)
(11)

where denotes the left (column) stochastic transition probability matrix whose th entry is , and where, similarly, denotes the left (column) stochastic transition probability matrix whose th entry is .

The SVD of takes the form

(12a)
with
(12b)
where denotes the th singular value and where and are the corresponding left and right singular vectors, and where by convention we order the singular values according to
(12c)

The following proposition establishes that (and thus ) is a contractive operator, a proof of which is provided in Appendix B-A.

Proposition 1

For defined via (8) we have

(13)

where denotes the spectral (i.e., operator) norm of its matrix argument.333The spectral norm of an arbitrary matrix is where denotes the th singular value of . Moreover, in (12), the left and right singular vectors and associated with singular value

(14a)
have elements
(14b)

It follows immediately from the second part of Proposition 1 that is an equivalent representation for . Indeed, given , we can compute the singular vectors and , from which we obtain and via (14b). In turn, using these marginals together with , whose th entry is (3), yields . We provide a more complete characterization of the class of DTMs, i.e., in Appendix B-B. In so doing, we extend the equivalence result above, establishing the continuity of bijective mapping between and .

The SVD (12) provides a key expansion of the joint distribution . In particular, we have the following result.

Proposition 2

Let and denote finite alphabets. Then for any , there exist features and , for , such that

(15)

where are as defined in (12), and where444We use the Kronecker notation

(16a)
(16b)
(16c)
(16d)

Moreover, and are related to the singular vectors in (12) according to

(17a)
(17b)

where and are the th and th entries of and , respectively.

Proof:

It suffices to note that

(18)
(19)
(20)
(21)

where to obtain (18) we have used (3), to obtain (19) we have used (12a) with (14), and where to obtain (20) we have made the choices (17), which we note satisfy the constraints (16). In particular, (16a) follows from the fact that and are orthogonal, for , and, likewise, (16b) follows from the fact that and are orthogonal, for . Finally, (16c) and (16d) follow from the remaining orthogonality relations among the and , respectively. \qed

The expansion (15) in Proposition 2 effectively forms the basis of what is sometimes referred to as “correspondence analysis,” which was originated by Hirschfeld [4] and sought to extend the applicability of the methods of Pearson [8, 9]. Such analysis was later independently developed and further extended by Lancaster [10, 11], and yet another independent development began with the work of Gebelein [12], upon which the work of Rényi [13] was based.555The associated analysis was expressed in terms of eigenvalue decompositions of instead of the SVD of , since the latter was not widely-used at the time. This analysis was reinvented again and further developed in [14, 15], which established the correspondence analysis terminology, and further interpreted in [16, 17].666This terminology includes that of “inertia,” which is also adopted in a variety of works, an example of which is [18], which refers to “principal inertial components.” Subsequent developments appear in [19, 20], and more recent expositions and developments include [21, 22], and the practical guide [23].

The features (17) in (15) can be interpreted as suitably normalized sufficient statistics for inferences involving and . Indeed, since

(22a)
(22b)

it follows that777Throughout, we use the convenient sequence notation .

is a sufficient statistic for inferences about based on , i.e., we have the Markov structure

Analogously,

is a sufficient statistic for inferences about based on , i.e., we have the Markov structure

Combining these results, we have

(23)

Additionally, note that Proposition 2 has additional consequences that are direct result of its connection to the SVD of . In particular, since the left and right singular vectors are related according to

(24a)
(24b)

it follows from (17) that the and are related according to

(25a)
(25b)

for . Moreover, in turn, we obtain, for ,

(26)

The Canonical Dependence Matrix

In our development, it is convenient for our analysis to remove the zeroth mode from . We do this by defining the matrix whose th entry is

(27)

where in the last equality we have expressed its SVD in terms of that for , and from which we see that has singular values

where we have defined the zero singular value as a notational convenience. Note that we can intrepret as the conditional expectation operator restricted to the (sub)space of zero-mean features , which produces a corresponding zero-mean features . We refer to , which we can equivalently write in the form888As first used in Appendix B-A, we use to denote a vector of all ones (with dimension implied by context).

(28)
(29)

as the canonical dependence matrix (CDM). Some additional perspectives on this representation of the conditional expectation operator—and thus the particular choice of SVD—are provided in Appendix B-C.

It is worth emphasing that restricting attention to features of and that are zero-mean is without loss of generality, as there is an invertible mapping between any set of features and their zero-mean counterparts. As a result, we will generally impose this constraint.

Iii Variational Characterization of the Modal Decomposition

The feature functions , , in Proposition 2 can be equivalently obtained from a variational characterization, via which the key connection to the correlation maximization problem considered (in turn) by Hirschfeld [4], Gebelein [12], and Rényi [13] is obtained, as we now develop.

Iii-a Variational Characterizations of the SVD

We begin by summarizing some classical variational results on the SVD that will be useful in our analysis. First, we have the following lemma (see, e.g., [24, Corollary 4.3.39, p. 248]).

Lemma 3

Given an arbitrary matrix and any , we have999We use to denote the identity matrix of appropriate dimension.

(30)

where denotes the Frobenius norm of its matrix argument,101010Specifically, the Frobenius norm of an arbitrary matrix is where denotes the th singular value of , and were denotes the trace of its matrix argument. and where denote the (ordered) singular values of . Moreover, the maximum in (30) is achieved by

(31)

with denoting the right singular vector of corresponding to , for .

Second, the following lemma, essentially due to von Neumann (see, e.g., [25] [24, Theorem 7.4.1.1]), will also be useful in our analysis, and can be obtained using Lemma 3 in conjunction with the Cauchy-Schwarz inequality.

Lemma 4

Given an arbitrary matrix , we have

(32)

with denoting the (ordered) singular values of . Moreover, the maximum in (32) is achieved by

(33)

with and denoting the left and right singular vectors, respectively, of corresponding to , for .

Iii-B Maximal Correlation Features

We now have the following result, which relates the modal decomposition and correlation maximization, and reveals the role of the Ky Fan -norms (as defined in, e.g., [24, Section 7.4.8]) in the analysis.

Proposition 5

For any , the dominant features (17) in Proposition 2, i.e.,

(34)

are obtained via111111We use to denote the Euclidean norm, i.e., for any and .

(35a)
where
(35b)
and
(35c)
(35d)

Moreover, the resulting maximal correlation is

(36)

which we note is the Ky Fan -norm of .121212We use to denote the Ky Fan -norm of its argument, i.e., for , (37) with denoting its singuar values.

The quantity (36) is often referred to as the Hirschfeld-Gebelein-Rényi (HGR) maximal correlation associated with the distribution (particularly in the special case ).

Proof:

First, note that the constraints (35c) and (35d) express (16) in Proposition 2. Next, to facilitate our development, we define [cf. (4)]

(38a)
(38b)

for We refer to and as the feature vectors associated with the feature functions and , respectively, and we further use and to denote column vectors whose th and th entries are and , respectively. Then

(39a)
with
(39b)

where the last equality in (39b) follows from the mean constraints in (35c) and (35d), which imply, for ,

In turn, from (39) we have

(40)

where

(41a)
(41b)

Moreover, from the covariance constraints in (35c) and (35d) we have

(42)

Hence, applying Lemma 4 we immediately obtain that (40) is maximized subject to (42) by the feature vectors

(43a)
(43b)

with

(44a)
(44b)

whence and as given by (17), for . The final statement of the proposition follows immediately from the properties of the SVD; specifically,

\qed

Iv Local Information Geometry

Further interpretation of the features and arising out of the modal decomposition of Section II benefits from developing the underlying inner product space. More specifically, a local analysis of information geometry leads to key information-theoretic interpretations of (17) as universal features. Accordingly, we begin with a foundation for such analysis.

Iv-a Basic Concepts, Terminology, and Notation

Let denote the space of distributions on some finite alphabet , where , and let denote the relative interior of , i.e., the subset of strictly positive distributions.

Definition 6 (-Neighborhood)

For a given , the -neighborhood of a reference distribution is the set of distributions in a (Neyman) -divergence [26] ball of radius about , i.e.,

(45a)
where for and ,
(45b)

In the sequel, we assume that all the distributions of interest, including all empirical distributions that may be observed, lie in such an -neighborhood of the prescribed . While we don’t restrict to be small, most of our information-theoretic insights arise from the asymptotics corresponding to .

An equivalent representation for a distribution is in terms of its information vector

(46)

which we note satisfies

(47)

with denoting the usual Euclidean norm.131313Specifically, for defined on , We will sometimes find it convenient to express as a -dimensional column vector , according to some arbitrarily chosen but fixed ordering of the elements of .

Hence, we can equivalently interpret the -dimensional neighborhood as the set of distributions whose corresponding information vectors lie in the unit Euclidean ball about the origin. Note that since

(48)

the -dimensional vector space subset

(49)

with denoting the usual Euclidean inner product,141414Specifically, for and defined on , characterizes all the possible information vectors: if and only if , for all sufficiently small. It is convenient to refer to as information space. When the relevant reference distribution is clear from context we will generally omit it from our notation, and simply use to refer to this space.

For a feature function , we let

(50)

denote its associated feature vector. As with information vectors, we will sometimes find it convenient to express as a -dimensional column vector , according to the chosen ordering of the elements of . Moreover, there is an effective equivalence of feature vectors and information vectors, which the following proposition establishes. A proof is provided in Appendix D-A.

Proposition 7

Let be an arbitrary reference distribution, and a positive constant. Then for any distribution ,

(51)

is a feature function satisfying

(52)

and has as its feature vector the information vector of , i.e.,

(53)

Conversely, for any feature function such that (52) holds,

(54)

is a valid distribution for all sufficiently small, and has as its information vector the feature vector of , i.e.,

(55)

The following corollary of Proposition 7 specific to the case of (relative) log-likelihood feature functions is further useful in our analysis. A proof is provided in Appendix D-B.

Corollary 8

Let be an arbitrary reference distribution and a positive constant. Then for any distribution with associated information vector , the feature vector associated with the relative log-likelihood feature function151515Throughout, all logarithms are base , i.e., natural.

(56)

satisfies161616Note that the term has zero mean with respect to , consistent with .

(57)

Conversely, every feature function satisfying can be interpreted to first order as a (relative) log-likelihood, i.e., can be expressed in the form

(58)

as for some

A consequence of Proposition 7 is that we do not need to distinguish between feature vectors and information vectors in the underlying inner product space. Indeed, note that when without loss of generality we normalize a feature so that both (52) and

are satisfied, then we have , where is the feature vector associated with , as defined in (50).

The following lemma, verified in Appendix D-C, interprets inner products between feature vectors and information vectors.

Lemma 9

For any , let be a feature function satisfying (52) with associated feature vector . Then for any and with associated information vector