A Theory of Usable Information under Computational Constraints

A Theory of Usable Information under Computational Constraints

Abstract

We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon’s information theory that takes into account the modeling power and computational constraints of the observer. The resulting predictive -information encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon’s mutual information and in violation of the data processing inequality, -information can be created through computation. This is consistent with deep neural networks extracting hierarchies of progressively more informative features in representation learning. Additionally, we show that by incorporating computational constraints, -information can be reliably estimated from data even in high dimensions with PAC-style guarantees. Empirically, we demonstrate predictive -information is more effective than mutual information for structure learning and fair representation learning.

\iclrfinalcopy

1 Introduction

Extracting actionable information from noisy, possibly redundant, and high-dimensional data sources is a key computational and statistical challenge at the core of AI and machine learning. Information theory, which lies at the foundation of AI and machine learning, provides a conceptual framework to characterize information in a mathematically rigorous sense (Shannon and Weaver, 1948; Cover and Thomas, 1991). However, important computational aspects are not considered in information theory. To illustrate this, consider a dataset of encrypted messages intercepted from an opponent. According to information theory, these encrypted messages have high mutual information with the opponent’s plans. Indeed, with infinite computation, the messages can be decrypted and the plans revealed. Modern cryptography originated from this observation by Shannon that perfect secrecy is (essentially) impossible if the adversary is computationally unbounded (Shannon and Weaver, 1948). This motivated cryptographers to consider restricted classes of adversaries that have access to limited computational resources (Pass and Shelat, 2010). More generally, it is known that information theoretic quantities can be expressed in terms of betting games (Cover and Thomas, 1991). For example, the (conditional) entropy of a random variable is directly related to how predictable is in a certain betting game, where an agent is rewarded for correct guesses. Yet, the standard definition unrealistically assumes agents are computationally unbounded, i.e., they can employ arbitrarily complex prediction schemes.

Leveraging modern ideas from variational inference and learning (Ranganath et al., 2013; Kingma and Welling, 2013; LeCun et al., 2015), we propose an alternative formulation based on realistic computational constraints that is in many ways closer to our intuitive notion of information, which we term predictive -information. Without constraints, predictive -information specializes to classic mutual information. Under natural restrictions, -information specializes to other well-known notions of predictiveness, such as the coefficient of determination (). A consequence of this new formulation is that computation can “create usable information” (e.g., by decrypting the intercepted messages), invalidating the famous data processing inequality. This generalizes the idea that clever feature extraction enables prediction with extremely simple (e.g., linear) classifiers, a key notion in modern representation and deep learning (LeCun et al., 2015).

As an additional benefit, we show that predictive -information can be estimated with statistical guarantees using the Probably Approximately Correct framework (Valiant, 1984). This is in sharp contrast with Shannon information, which is well known to be difficult to estimate for high dimensional or continuous random variables (Battiti, 1994). Theoretically we show that the statistical guarantees of estimating information translate to statistical guarantees for a variant of the Chow-Liu algorithm for structure learning. In practice, when the observer employs deep neural networks as a prediction scheme, -information outperforms methods that approximate Shannon information in various applications, including Chow-Liu tree contruction in high dimension and gene regulatory network inference.

2 Definitions and Notations

To formally define the predictive -information, we begin with a formal model of a computationally bounded agent trying to predict the outcome of a real-valued random variable ; the agent is either provided another real-valued random variable as side information, or provided no side information . We use and to denote the samples spaces of and respectively (while assuming they are separable), and use to denote the set of all probability measures over the Borel algebra on ( similarly defined for ).

Definition 1 (Predictive Family).
1

Let . We say that is a predictive family if it satisfies

(1)

A predictive family is a set of predictive models the agent is allowed to use, e.g., due to computational or statistical constraints. We refer to the additional condition in Eq.(1) as optional ignorance. Intuitively, it means that the agent can, in the context of the prediction game we define next, ignore the side information if she chooses to.

Definition 2 (Predictive conditional -entropy).

Let be two random variables taking values in , and be a predictive family. Then the predictive conditional -entropy is defined as

We additionally call the -entropy, and also denote it as

In our notation is a function , so is a probability measure on chosen based on the received side information (we use instead of the more conventional ); and is the value of the density evaluated at . Intuitively, (conditional) entropy is the smallest expected negative log-likelihood that can be achieved predicting given observation (side information) (or no side information ), using models from . Eq.(1) means that whenever the agent can use to predict ’s outcomes, it has the option to ignore the input, and use no matter whether is observed or not.

Definition 2 generalizes several known definitions of uncertainty. For example, as shown in proposition 3.1, if the is the largest possible predictive family that includes all possible models, i.e. , then Definition 2 reduces to Shannon entropy: and . By choosing more restrictive families , we recover several other notions of uncertainty such as trace of covariance, as will be shown in Proposition 2.1.

Shannon mutual information is a measure of changes in entropy when conditioning on new variables:

(2)

Here, we will use predictive -entropy to define an analogous quantity, , to represent the change in predictability of an output variable when given side information .

Definition 3 (Predictive -information).

Let be two random variables taking values in , and be a predictive family. The predictive -information from to is defined as

(3)

2.1 Important Special Cases

Several important notions of uncertainty and predictiveness are special cases of our definition. Note that when we are defining -entropy of a random variable in sample space (without side information), out of convenience we can assume is empty (this does not violate our requirement that .) {restatable}proppropvar For -entropy and -information, we have

  1. Let be as in Def. 1. Then is the Shannon entropy, is the Shannon conditional entropy, and is the Shannon mutual information.

  2. Let and , where is the distribution with density where , then the -entropy of a random variable equals its mean absolute deviation, up to an additive constant.

  3. Let and , then the -entropy of a random variable equals the trace of its covariance , up to an additive constant.

  4. Let , where is a distribution in a minimal exponential family with sufficient statistics and set of natural parameters . For a random variable with expected sufficient statistics , the -entropy of is the maximum Shannon entropy over all random variables with identical expected sufficient statistics, i.e. .

  5. Let , be any vector space, and , where is the set of linear functions , then -information equals the (unnormalized) maximum coefficient of determination for linear regression.

The trace of covariance represents a natural notion of uncertainty – for example, a random variable with zero variance (when ,)) is trivial to predict. Proposition 2.1.3 shows that the trace of covariance corresponds to a notion of surprise (in the Shannon sense) for an agent restricted to make predictions using certain Gaussian models. More broadly, a similar analogy can be drawn for other exponential families of distributions. In the same spirit, the coefficient of determination, also known as the fraction of variance explained, represents a natural notion of informativeness for computationally bounded agents. Also note that in the case of Proposition 2.1.4, the -entropy is invariant if the expected sufficient statistics remain the same.

3 Properties of -information

3.1 Elementary Properties

We first show several elementary properties of -entropy and -information. In particular, -information preserves many properties of Shannon information that are desirable in a machine learning context. For example, mutual information (and -information) should be non-negative as conditioning on additional side information should not reduce an agent’s ability to predict .

{restatable}

propproperty Let and be any random variables on and , and and be any predictive families, then we have

  1. Monotonicity: If , then , .

  2. Non-Negativity: .

  3. Independence: If is independent of , .

The optional ignorance requirement in Eq.(1) is a technical condition needed for these properties to hold. Intuitively, it guarantees that conditioning on side information does not restrict the class of densities the agent can use to predict . This property is satisfied by many existing machine learning models, often by setting some weights to zero so that an input is effectively ignored.

3.2 On the production of information through preprocessing

The Data Processing Inequality guarantees that computing on data cannot increase its mutual information with other random variables. Formally, letting be any function, cannot have higher mutual information with than : . But is this property desirable? In analyzing optimal communication, yes - it demonstrates a fundamental limit to the number of bits that can be transmitted through a communication channel. However, we argue that in machine learning settings this property is less appropriate.

Consider an RSA encryption scheme where the public key is known. Given plain text and its corresponding encrypted text , if we have infinite computation, we can perfectly compute one from the other. Therefore, the plain text and the encrypted text should have identical Shannon mutual information with respect to any label we want to predict. However, to any human (or machine learning algorithm), it is certainly easier to predict the label from the plain text than the encrypted text. In other words, decryption increases a human’s ability to predict the label: processing increases the “usable information”. More formally, denoting as the decryption algorithm and as a class of natural language processing functions, we have that: .

As another example, consider the mutual information between an image’s pixels and its label. Due to data processing inequality, we cannot expect to use a function to map raw pixels to “features” that have higher mutual information with the label. However, the fundamental principle of representation learning is precisely the ability to learn predictive features — functions of the raw inputs that enable predictions with higher accuracy. Because of this key difference between -information and Shannon information, machine learning practices such as representation learning can be justified in the information theoretic context.

3.3 On the asymmetry of predictive -Information

-information also captures the intuition that sometimes, it is easy to predict from but not vice versa. In fact, modern cryptography is founded on the assumption that certain functions are one-way, meaning that there exists an polynomial algorithm to compute but no polynomial algorithm to compute . This means that if contains all polynomial-time computable functions, then .

This property is also reasonable in the machine learning context. For example, several important methods for causal discovery (Peters et al., 2017) rely on this asymmetry: if causes , then usually it is easier to predict from than vice versa; another commonly used assumption is that can be accurately modeled by a Gaussian distribution, while cannot (Pearl, 2000).

4 PAC Guarantees for -information Estimation

For many practical applications of mutual information (e.g., structure learning), we do not know the joint distribution of , so cannot directly compute the mutual information. Instead we only have samples and need to estimate mutual information from data.

Shannon information is notoriously difficult to estimate for high dimensional random variables. Although non-parametric estimators of mutual information exist (Kraskov et al., 2004; Darbellay and Vajda, 1999; Gao et al., 2017), these estimators do not scale to high dimensions. Several variational estimators for Shannon information have been recently proposed (van den Oord et al., 2018; Nguyen et al., 2010; Belghazi et al., 2018), but have two shortcomings: due to their variational assumptions, their bias/variance tradeoffs are poorly understood and they are still not efficient enough for high dimensional problems. For example, the CPC estimator suffers from large bias, since its estimates saturate at where is the batch size (van den Oord et al., 2018; Poole et al., 2019); the NWJ estimator suffers from large variance that grows at least exponentially in the ground-truth mutual information (Song and Ermon, 2019). Please see Appendix B for more details and proofs.

On the other hand, -information is explicit about the assumptions (as a feature instead of a bug). -information is also easy to estimate with guarantees if we can bound the complexity of (such as its Radamacher or covering number complexity) As we will show, bounds on the complexity of directly translate to PAC (Valiant, 1984) bounds for -information estimation. In practice, we can efficiently optimize over , e.g., via gradient descent. In this paper we will present the Rademacher complexity version; other complexity measures (such as covering number) can be derived similarly.

Definition 4 (Empirical -information).

Let be two random variables taking values in and denotes the set of samples drawn from the joint distribution over and . is a predictive family. The empirical -information (under ) is the following -information under the empirical distribution defined via :

(4)

Then we have the following PAC bound over the empirical -information: {restatable}theoremthmpac

Assume . Then for any , with probability at least , we have:

(5)

where we define the function family , and denotes the Rademacher complexity of with sample number .

Typically, the Rademacher complexity term satisfies  (Bartlett and Mendelson, 2001; Gao and Zhou, 2016). It’s worth noticing that a complex function family (i.e., with large Rademacher complexity) could lead to overfitting. On the other hand, an overly-simple may not be expressive enough to capture the relationship between and . As an example of the theorem, we provide a concrete estimation bound when is chosen to be linear functions mapping to the mean of a Gaussian distribution. This was shown in Proposition 2.1 to lead to the coefficient of determination. {restatable}corollarycorrsquare Assume and . If

Denote , then , with probability at least :

Similar results can be obtained using other classes of machine learning models with known (Rademacher) complexity.

5 Structure learning with -information

Among many possible applications of -information, we show how to use it to perform structure learning with provable guarantees. The goal of structure learning is to learn a directed graphical model (Bayesian network) or undirected graphical model (Markov network) that best captures the (conditional) independence structure of an underlying data generating process. Structure learning is difficult in general, but if we restrict ourselves to certain set of graphs , there are efficient algorithms. In particular, the Chow-Liu algorithm  (Chow and Liu, 1968) can efficiently learn tree graphs (i.e. is the set of trees). Chow and Liu (1968) show that the problem can be reduced to:

(6)

where is the Shannon mutual information between variables and . In other words, it suffices to construct the maximal weighted spanning tree where the weight between two vertices is their Shannon mutual information. Chow and Wagner (1973) show that the Chow-Liu algorithm is consistent, i.e, it recovers the true solution as the dataset size goes to infinity. However, the finite sample behavior of the Chow-Liu algorithm for high dimensional problems is much less studied, due to the difficulty of estimating mutual information. In fact, we show in our experiments that the empirical performance is often poor, even with state-of-the-art estimators. Additionally, methods based on mutual information cannot take advantage of intrinsically asymmetric relationships, which are common for example in gene regulatory networks (Meyer et al., 2007).

To address these issues, we propose a new structure learning algorithm based on -information instead of Shannon information. The idea is that we can associate to each directed edge in (i.e., each pair of variables) a suitable predictive family (cf. Def 1). The main challenge is that we cannot simply replace mutual information with -information in Eq. 6 because -information is asymmetric – we now have to optimize over directed trees:

(7)

where is the set of directed trees, and is the function mapping each non-root node of directed tree to its parent, and is the predictive family for random variables and . After estimating -information on each edge, we use the Chu-Liu algorithm (Chu and Liu, 1965) to construct the maximal directed spanning tree. This allows us to solve (7) exactly, even though there is a combinatorially large number of trees to consider. Pseudocode is summarized in Algorithm 1 in Appendix. Denote , we show in the following theorem that unlike the original Chow-Liu algorithm, our algorithm has guarantees in the finite samples regime, even in continuous settings:

{restatable}

theoremthmtree Let be the set of m random variables, (resp. ) be the set of samples drawn from (resp. ). Denote the optimal directed tree with maximum expected edge weights sum as and the optimal directed tree constructed on the dataset as . Then with the assumption in theorem 1, for any , with probability at least , we have:

(8)

Theorem 5 shows that the total edge weights of the maximal directed spanning tree constructed by algorithm 1 would be close to the optimal total edge weights if the Rademacher term is small. Although larger does not necessarily lead to better Chow-Liu trees, empirically we find that the optimal tree in the sense of equation (7) is consistent with the optimal tree in equation (6) under commonly used .

6 Experimental results

6.1 Structure learning with continuous high-dimensional data

We generate synthetic data using various ground-truth tree structures with between and variables, where each variable is 10-dimensional. We use Gaussians, Exponentials, and Uniforms as ground truth edge-conditionals. We use -information(Gaussian) and -information(Logistic) to denote Algorithm 1 with two different families. Please refer to Appendix D.1 for more details. We compare with the original Chow-Liu algorithm equipped with state-of-the-art mutual information estimators: CPC (van den Oord et al., 2018), NWJ (Nguyen et al., 2010) and MINE (Belghazi et al., 2018), with the same neural network architecture as the -families for fair comparison. All the experiments are repeated for 10 times. As a performance metric, we use the wrong-edges-ratio (the ratio of edges that are different from ground truth) as a function of the amount of training data.

We show two illustrative experiments in figure (a)a; please refer to Appendix D.1 for all simulations. We can see that although the two -families used are misspecified with respect to the true underlying (conditional) distributions, the estimated Chow-Liu trees are much more accurate across all data regimes, with CPC (blue) being the best alternative. Surprisingly, -information(Gaussian) works consistently well in all cases and only requires about 100 samples to recover the ground-truth Chow-Liu tree in simulation-A.

(a) Chow-Liu tree Construction
(b) Gene network inference
(c) -information of frames
Figure 4: (a) The expected wrong-edges-ratio of algorithm 1 with different and other mutual information estimators-based algorithms from sample size to . (b) AUC curve for gene regulatory network inference. (c) The predictive -information versus frame distance.

6.2 Gene regulatory network inference

Mutual information between pairs of gene expressions is often used to construct gene regulatory networks. We evaluate -information on the in-silico dataset from the DREAM5 challenge (Marbach et al., 2012) and use the setup of Gao et al. (2017), where 20 genes with 660 datapoints are utilized to evaluate all methods. We compare with state-of-the-art non-parametric Shannon mutual information estimators in this low dimensional setting: KDE, the traditional kernel density estimator; the KSG estimator (Kraskov et al., 2004); the Mixed KSG estimator (Gao et al., 2017) and Partitioning, an adaptive partitioning estimator (Darbellay and Vajda, 1999) implemented by Szabó (2014). For fair comparison with these low dimensional estimators, we select , where is a -rd order polynomial.

The task is to predict whether a directed edge between genes exists in the ground-truth gene network. We use the estimated mutual information and -information for gene pairs as the test statistic to obtain the AUC for various methods. As shown in Figure (b)b, our method outperforms all other methods in network inference under different fractions of data used for estimation. The natural information measure in this task is asymmetry since the goal is to find the pairs of genes s in which regulates , thus -information is more suitable for such case than mutual information.

6.3 Recovering the order of video frames

Let be random variables each representing a frame in videos from the Moving-MNIST dataset, which contains 10,000 sequences each of length 20 showing two digits moving with stochastic dynamics. Can Algorithm 1 be used to recover the natural (causal) order of the frames? Intuitively, predictability should be inversely related with frame distance, thus enabling structure learning. Using a conditional PixelCNN++  (Salimans et al., 2017) as predictive family , we shown in Figure (c)c that predictive -information does indeed decrease with frame distance, despite some fluctuations when the frame distances are large. Using Algorithm 1 to construct a Chow-Liu tree, we find that the tree perfectly recovers the relative order of the frames.

We also generate a Deterministic-Moving-MNIST dataset, where digits move according to deterministic dynamics. From the perspective of Shannon mutual information, every pair of frames has the same mutual information. Hence, standard Chow-Liu tree learning algorithm would fail to discover the natural ordering of the frames (causal structure). In contrast, once we constrain the observer to PixelCNN++ models, algorithm 1 with predictive -information can still recover the order of different frames when the frame distances are relatively small (less than 9). Compared to the stochastic dynamics case, -information is more irregular with increasing frame distance, since the PixelCNN++ tends to overfit.

6.4 Information theoretic approaches to fairness

The goal of fair representation learning is to map inputs to a feature space such that the mutual information between and some sensitive attribute (such as race or gender) is minimized. The motivation is that using (instead of ) as input we can no longer use the sensitive attributes to make decisions, thus ensuring some notion of fairness. Existing methods obtain fair representations by optimizing against an “adversarial” discriminator so that the discriminator cannot predict from  (Edwards and Storkey, 2015; Louizos et al., 2015; Madras et al., 2018; Song et al., 2018). Under some assumptions on and , we show in Appendix D.2 that these works actually use -information minimization as part of their objective, where depends on the functional form of the discriminator.

However, it is clear from the -information perspective that features trained with -information minimization might not generalize to -information and vice versa. To illustrate this, we use a function family as the attacker to extract information from features trained with minimization, where all the s are neural nets. On three datasets commonly used in the fairness literature (Adult, German, Heritage), previous methods work well at preventing information “leak” against the class of adversary they’ve been trained on, but fail when we consider different ones. As shown in Figure 6b in Appendix, the diagonal elements in the matrix are usually the smallest in rows, indicating that the attacker function family extracts more information on featured trained with -information minimization. This challenges the generalizability of fair representations in previous works. Please refer to Appendix D.2 for details.

7 Related work

Alternative definitions of Information

Several alternative definitions of mutual information are available in the literature. Renyi entropy and Renyi mutual information (Lenzi et al., 2000) extend Shannon information by replacing KL divergence with -divergences. However, they have the same difficulty when applied to high dimensional problems as Shannon information.

The line of work most related to ours is the entropy and mutual information (DeGroot, 1962; Grünwald and Dawid, 2004), which associate a definition of entropy to every prediction loss. However, there are two key differences. First, literatures in entropy only consider a few special types of prediction functions that serve unique theoretical purposes; for example, (Duchi et al., 2018) considers the set of all functions on a feature space to prove surrogate risk consistency, and (Grünwald and Dawid, 2004) only considers the entropy to prove the duality between maximum entropy and worst-case loss minimization. In contrast, our definition takes a completely different perspective — emphasizing bounded computation and intuitive properties of “usable” information. Furthermore entropy still suffers from difficulty of estimation in high dimension because the definitions do not restrict to functions with small complexity (e.g. Rademacher complexity).

Mutual information estimation

The estimation of mutual information in the machine learning field is often on the continuous underlying distribution. For non-parametric mutual information estimators, many methods have exploited the principle to calculate the mutual information, such as the Kernel density estimator (Paninski and Yajima, 2008), k-Nearest-Neighbor estimator and the KSG estimator (Kraskov et al., 2004). However, these non-parametric estimators usually aren’t scalable to high dimension. Recently, several works utilize the variational lower bounds of MI to design MI estimator based on deep neural network in order to estimate MI of high dimension continuous random variables (Nguyen et al., 2010; van den Oord et al., 2018; Belghazi et al., 2018).

8 Conclusion

We defined and investigated -information, a variational extension to classic mutual information that incorporates computational constraints. Unlike Shannon mutual information, -information attempts to capture usable information, and has very different properties, such as invalidating the data processing inequality. In addition, -information can be provably estimated, and can thus be more effective for structure learning and fair representation learning.

Acknowledgements

This research was supported by AFOSR (FA9550-19-1-0024), NSF (#1651565, #1522054, #1733686), ONR, and FLI.

Appendix A Proofs

a.1 Proof of Proposition 1

\propvar

*

Proof.

(1)

Let denote the density function of random variable conditioned on (we denote this random variable as ).

(9)

where infimum is achieved for where and is the Shannon (conditional) entropy. The same proof technique can be used to show that , with the infimum achieved by where . Hence we have

(10)

(2)

(11)

where denotes mean absolute deviation .

(3)

(Cyclic property of trace)
(Linearity of trace)

(4) The density function of an exponential family distribution with sufficient statistics is where is the partition function.

(12)

where is the Fenchel dual of the log-partition function . Under mild conditions (Wainwright and Jordan, 2008)

where is the maximum entropy distribution out of all distributions satisfying  (Jaynes, 1982), and is the Shannon entropy.

(5) Assume random variable , . Then the -information from to is

(13)

a.2 Proof of Proposition 3.1

\property

*

Proof.

(1)

(14)
(15)

The inequalities (14) and (15) are because we are taking the infimum over a larger set.

(2)

Denote as the subset of that satisfy , .

(By Optional Ignorance)

Therefore

(3)

Denote as the subset of that satisfy , .

(Independence)
(Jensen)
(Optional Ignorance)
(No dependence on )

Therefore . Combined with the Proposition 3.1.2 that must be non-negative, must be .

a.3 Proof of Theorem 4

\thmpac

* Before proving theorem 1, we introduce two lemmas. Proofs for these Lemmas follow the same strategy as theorem 8 in Bartlett and Mendelson (2001):

Lemma 1.

Let be two random variables taking values in and denotes the set of samples drawn from the joint distribution over . Assume . Take , then , with probability at least , we have:

(16)
Proof.

We apply McDiarmid’s inequality to the function defined for any sample by

(17)

Let and be two samples differing by exactly one point, then since the difference of suprema does not exceed the supremum of the difference and , we have:

then by McDiarmid’s inequality, for any , with probability at least , the following holds:

(18)

Then we bound the term:

(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)

where s are Rademacher variables that is uniform in . Inequality (22) follows from the convexity of , inequality (24) follows from the symmetrization argument for norm for Radermacher random variables (Ledoux and Talagrand (2013), Section 6.1), inequality (21) follows from the convexity of . (27) follows from the definition of and Rademacher complexity.

Finally, combining inequality (18) and (27) yields for all , with probability at least

(28)

In particular, the inequality holds for and . Then we have: