Conditional Dependence via Shannon Capacity:
Axioms, Estimators and Applications^{†}^{†}thanks: Parts of this manuscript has appeared in the International Conference on Machine Learning (ICML), 2016.
Abstract
We consider axiomatically the problem of estimating the strength of a conditional dependence relationship from a random variables to a random variable . This has applications in determining the strength of a known causal relationship, where the strength depends only on the conditional distribution of the effect given the cause (and not on the driving distribution of the cause). Shannon capacity, appropriately regularized, emerges as a natural measure under these axioms. We examine the problem of calculating Shannon capacity from the observed samples and propose a novel fixed nearest neighbor estimator, and demonstrate its consistency. Finally, we demonstrate an application to singlecell flowcytometry, where the proposed estimators significantly reduce sample complexity.
1 Introduction
The axiomatic study of dependence measures on joint distributions between two random variables and has a long history in statistics [Sha48, Rén59, Csi08]. In this paper, we study the relatively unexplored terrain of measures that depend only on the conditional distribution . We are motivated to study conditional dependence measures from a problem in causal strength estimation. Causal learning is a basic problem in many areas of scientific learning, where one wants to uncover the causeeffect relationship usually using interventions or sometimes directly from observational data [Pea09, RE15, MPJ15].
In this paper, we are interested in an even simpler question: given a causal relationship, how does one measure the strength of the relationship. This problem arises in many contexts, for example, one may know causal genetic pathways but only a subset of these maybe active in a particular tissue or organ  therefore, deducing how much influence each causal link exerts becomes necessary.
We focus on a simple model: consider a pair of random variables with known causal direction , and suppose that there are no confounders  we are interested in quantifying the causal influence has on . We denote the causal influence quantity by . There are two philosophically distinct ways to model the quantity: the first one is factual influence, i.e., how much influence does exert on under the current probability of the cause . The second possible way, which one can term as potential influence measures how much influence can potentially exert on  without cognizance to the present distribution of the cause. For example, consider a (hypothetical) city which has very few smokers, but smoking inevitably leads to lungcancer. In such a city, the factual influence of smoking on lungcancer will be small but the potential influence is very high. Depending on the setting, one may prefer the former or the latter. In this paper, we are interested in the potential influence of a cause on its effect.
We want to be invariant to scaling and oneone transformations of the variables . This naturally suggests information theoretic metrics as plausible choices of , starting with the mutual information , at least in the case of factual influence. This measures the information through the channel from as given by the prior . Observe that this metric is symmetric with respect to the directions and ; this property is not always desirable. In fact, this measure is taken as a starting point to develop an axiomatic approach to studying causal strength on general graphs in [JBGW13].
In a recent work [KSM14], potential causal influence is posited as a relevant metric to spot “trends” in gene pathways. In the particular application considered there, rare biological states of gene in a given data may nevertheless correspond to important biological states (or become common under different biological conditions), and therefore it is important to have causal measures that are not sensitive to the cause distribution but only depend on the relationship between the cause and the effect. To quantify the potential influence of those rare , the following approach is proposed. Replace the observed distribution by a uniform distribution and calculate the mutual information under the joint distribution . The resulting causal strength quantification is , where represents the distribution at the output of a channel with input given by . We call this quantification as Uniform Mutual Information (UMI) and pronounced “youme”. A key challenge is to compute this quantity from i.i.d. samples in a statistical efficient manner, especially when the channel output is continuous valued (and potentially in high dimensions). This is the first focus point of this paper.
UMI is not invariant under bijective transformations (since a uniform distribution on is different from a uniform distribution on ) and is also sensitive to the estimated support size of . Even more fundamentally, it is unclear why one would prefer the uniform prior to measure potential influence through the channel . Based on natural axioms of data processing and additivity, we motivate an alternative measure of causal strength: the largest amount of information that can be sent through the channel, namely the Shannon capacity. Formally , where represents the distribution at the output of a channel with input given by . We refer to such a quantification as Capacitated Mutual Information (CMI) and pronounced “seeme”. A key challenge is to compute this quantity from i.i.d. samples in a statistical efficient manner, especially when the channel output is continuous valued (and potentially in high dimensions). This is the second focus point of this paper. We make the following main contributions in this paper.

UMI Estimation: We construct a novel estimator to compute UMI from data sampled i.i.d. from a distribution . The estimator brings together ideas from three disparate threads in statistical estimation theory: nearestneighbor methods, a correlation boosting idea in the estimation of (standard) mutual information from samples [KSG04], and importance sampling. The estimator has only a single hyper parameter (the number of nearestneighbors considered, set to 4 or 5 in practice), uses an offtheshelf kernel density estimator of only , and has strong connections to the entropy estimator of [KL87]. Our main technical result is to show that the estimator is consistent (in probability) supposing that the RadonNikodym derivative is uniformly bounded over the support. In simulations, the estimator has very strong performance in terms of sample complexity (compared to a baseline of the partitionbased estimator in [Mod89]).

CMI Estimation: We build upon the estimator derived for UMI and construct an optimization problem that mimics the optimization problem inherent in computing the capacity directly from the conditional probability distribution of the channel. Our main technical result is to show the consistency of this estimator, supposing that the RadonNikodym derivative is uniformly bounded over the support, where is the optimizing input to the channel. Simulation results show strong empirical performance, compared to a baseline of a partitionbased method followed by discrete optimization.

Application to gene pathway influence: In [KSM14], considered an important result in singlecell flowcytometry data analysis, a causal strength metric (termed DREMI) is proposed for measuring the causal influence of a gene – this estimator is a specific way of implementing UMI along with a “channel amplification” step, and DREMI was successfully used to spot genepathway trends. We show that our proposed CMI and UMI estimators also exhibit the same performance as DREMI when supplied with the full dataset, while at the same time, having significantly smaller sample complexity for the same performance.
2 An Axiomatic Approach
We formally model an influence measure on conditional probability distributions, by postulating five natural axioms. Let be drawn from an alphabet , and from an alphabet . Let the probability distribution of given be given as . Let be a family of conditional distributions; usually we will consider the case when is the set of all possible conditional distributions. Then the influence measure is a function of the conditional distribution to nonnegative real numbers: , and we can write as . We postulate that the function satisfies five axioms on , and show that CMI satisfies all five axioms:

Independence: The measure if and only if depends only on .

Data Processing: If be a processing chain, i.e., , then the natural data processing inequalities should hold: (a) ; and (b) .

Additivity: For a parallel channel , we need
(1) 
Monotonicity: A causal relationship is strong if many possible values of are achievable by varying the input probability distribution . Thus if we consider as a map from the probability simplex in to the probability simplex in , the larger the range of this map, the stronger should be the causal strength.

should only depend on the range of the map, , the convex hull of the output distributions .

should be a monotonic function of the range of the map. If and are such that, then: .


Maximum value: The maximum value over all possible conditional distributions for a particular output alphabet should be achieved exactly when the relationship is fully causal, i.e., each can be achieved by setting for some .
We begin our exploration of appropriate influence measures with the alphabets for and being discrete. Let denote the mutual information with respect to the joint distribution . Since we are looking at potential influence measures, Shannon capacity, defined as the maximum over input probability distributions of the mutual information, is a natural choice:
(2) 
Our first claim is the following:
Proposition 1.
CMI satisfies all the axioms of causal influence.
Proof: The proof is fairly straightforward.

Clearly Axiom 0 holds, cf. Chapter 2 of [CT12].

Axiom 1: Suppose is achieved with . Consider the joint distribution . Utilizing the dataprocessing inequality for mutual information, we get
(3) Thus Axiom 1a is satisfied. Now consider Axiom 1b. With the same joint distribution, let be the marginal of . Then we have,
(4) 
Axiom 2: This is a standard result for Shannon capacity and we refer the interested reader to Chapter 7 of [CT12].

Axiom 3a: First we rewrite capacity equivalently as the informationcentroid (see [CS04]):
(5) Here the conditional KL divergence is defined in the usual way:
(6) This characterization allows us to make the observation that the capacity is a function only of the convex hull of the probability distributions . Given a conditional probability distribution , we augment the input alphabet to have one more input symbol such that is a convex combination of the other conditional distributions. We claim that the capacity of the new channel is unchanged: one direction is obvious, i.e., the new channel has capacity greater than or equal to the original channel, since adding a new symbol cannot decrease capacity. To show the other direction, we use (5) and observe that, due to the convexity of KL divergence in its arguments, we get,
Thus Shannon capacity is only a function of the convex hull of the range of the map , satisfying Axiom . This function is monotonic directly from (5), thus satisfying Axiom .

Axiom 4: For fixed output alphabet , it is clear that . Now suppose for some conditional distribution we have . This implies that, with the optimizing input distribution, . This implies that and , thus is a deterministic function of the essential support of and since , it implies that , the uniform distribution and the deterministic function is onto.
Axiomatic View of UMI : Now consider an alternative metric: Uniform Mutual Information (UMI) which is defined as the mutual information with uniform input distribution,
(7) 
where is the uniform distribution on . This estimator is motivated by the recent work in [KSM14]. We investigate how this estimator fares in terms of the proposed axioms.

UMI clearly satisfies Axiom . It also satisfies Axioms . Dataprocessing inequality for mutual information on the joint distribution implies that , which is the same as . Thus .

UMI however does not satisfy Axiom in general. However, if the transition matrices and are both doubly stochastic, then a straightforward calculation shows that UMI satisfies Axiom too.

UMI satisfies Axiom since the uniform distribution on naturally factors as and we have
(8) (9) (10) 
UMI does not satisfy Axiom since multiple repeated values of does not alter the convex hull but alters the UMI value.

Interestingly, UMI does satisfy Axiom for the same reason as CMI.
2.1 Realvalued alphabets
For realvalued , the Shannon mutual information is not finite without additional regularizations. This is also true of other measures of relation such as the Renyi correlation [Rén59], and in each case the measure is studied in the context of some form penalty term. Typically this is done via a cost constraint on the realvalued input parameters. In this context, one possibility is to consider the following normconstrained optimization to ensure the causal effect is finite valued:
(11) 
In practice, is chosen from the empirical moments of from samples: for samples . This regularization turns to be the socalled power constraint on the input, common in treatments of additive noise communication channels.
3 Estimators
Although the definition of UMI and CMI seamlessly applies to both discrete and continuous random variables, the estimation becomes relatively straightforward when both and are discrete; the estimation of the conditional distribution and the computation of UMI and CMI can be separated in a straightforward manner. For this reason and also due to an application in genomic biology that we study, we focus on the more challenging regime is continuous. Due to certain subtleties in the estimation process, we provide separate estimators each customized for each case of discrete and continuous , respectively.
3.1 Uniform Mutual Information
The idea of applying UMI to infer the strength of conditional dependence was first proposed in [KSM14]. Offtheshelf 2dimensional kernel density estimators (KDE) are used to first estimate the joint distribution from given samples. Subsequently, the channel is computed directly from the joint distribution, and then UMI is computed via either numerical integration or sampling from and . This approach suffers from known drawbacks of KDE, such as sensitivity to the choice of the bandwidth and increased bias in higher dimensional and . However, a more critical challenge in using KDE to estimate the joint distribution at all points (and not just at samples) is the overkill nature: we only need to compute a single functional (UMI) of the joint distribution, which could in principle be computed more efficiently directly from the samples. It is not at all clear how to directly estimate UMI.
Perhaps surprisingly, we bring together ideas from three topics in statistical estimation to introduce novel estimators that are also provably convergent. Our estimator is based on nearest neighbor estimators, e.g. [KL87]; the correlation boosting idea of the estimator from [KSG04]–which is widely adopted in practice [KBG07]; and the importance sampling techniques to adjust for the uniform prior for UMI. We explain each idea below.
Consider a simpler task of computing the mutual information from samples; several approaches exist for this estimation: [Pan03, KSG04, WKV09, PPS10, SRHI10, PXS12, GSG14, GSG15, KKPW15]. Note that three applications of the entropy estimator, such as those from [BDGVdM97], gives an estimate of the mutual information, i.e. . Each entropy term can be computed using, for example, a KDE based approach, which suffers from the same challenges, as in UMI. Alternatively, to bypass estimating at every point, the differential entropy estimation can be done via nearest neighbor (NN) methods (pioneering work in [KL87]). This KL entropy estimator provides the first step in designing the UMI estimator. However, taking the route of estimating the mutual information via estimating the three differential entropies (two marginals and one joint), it is entirely unclear how to estimate two of these quantities (differential entropy of and that of ) directly from samples.
Perhaps surprisingly, an innovative approach undertaken in [KSG04] to improve upon three applications of KL estimators provides a solution. The KSG estimator of [KSG04] is based on NN distance defined as the distance to the th nearest neighbor from in distance, i.e. where is the th nearest neighbor to . Precisely, the KSG estimator is
(12) 
where is the digamma function, (for large , ), and the NN statistics and are defined as
(13)  
(14) 
Note that the number of nearest neighbors in and are computed with respect to in the joint space . This innovative idea, not only gives a simple estimator, but also has an advantage of canceling correlations in three entropy estimates, giving an improved performance. However, despite its popularity in practice due to its simplicity, no convergence result has been known until very recently (when [GOV16] showed some consistency and rate of convergence properties).
Inspired by the innovative mutual information estimator in (12), we combine importance sampling techniques to adjust for the uniform prior for UMI, and propose a novel estimator. On top of the provable convergence, our estimator has only one hyperparameter (besides the choice of bandwidth for estimating the marginal distribution which is a significantly simpler task compared to estimating the joint), which is the number of nearest neighbors to consider; in practice is set to a small integer such as 4 or 5.
Continuous . We propose a novel UMI estimator based on the Kraskov mutual information estimator. For a conditional probability density , we want to compute the uniform mutual information from i.i.d. samples that are generated from for some prior on . Our UMI estimator is based on nearest neighbor (NN) statistics. Given a choice of and samples,
(15) 
where , , is the volume of dimensional unit ball, and is the selfnormalized importance sampling estimate [CMMR12] of :
(16) 
where is the estimate of . We use the standard kernel density estimator with a bandwidth :
(17) 
We define the NN statistics and as follows. For each sample , calculate the Euclidean distance (as opposed to the distance proposed by [KSG04]) to the th nearest neighbor. This determines the (random) number of samples within in : first is defined as the same as in (13), but with Euclidean distance; second we have a weighted number of samples within in as
(18) 
Compared to (12), we first exchange log function for the digamma functions of , , and . This step (especially for , and ) is crucial for proving convergence. We use ideas from importance sampling and introduce new variables ’s that capture the correction for the mismatch in the prior. The constants , , and correct for the volume measured in .
Discrete . Similarly, for a discrete random variable , the joint probability density function is denoted by . We propose a UMI estimator, and overload the same notation for this discrete case.
(19) 
where is the number of samples such that , is the selfnormalizing estimate of defined as
(20) 
and is the weighted NN statistics defined as follows. For each sample , let the distance to the th nearest neighbor be , where those samples that have the same value as is considered and the Euclidean distance is measured in . We define the weighted number of samples within in as
(21) 
3.2 Capacitated Mutual Information
Given standard estimators for mutual information and entropy, it is not at all clear how to directly estimate CMI where is changed to the (unknown) optimal input distribution. However, combining the mutual information estimator in (12) with importance sampling techniques, we design a novel estimator as a solution to an optimize over the space of the weights. Our estimator has only one hyperparameter , the number of nearest neighbors to consider.
Continuous . For a conditional distribution , we compute an estimate of CMI from i.i.d. samples generated from for some prior on . We introduce a novel CMI estimator that is based on our UMI estimator. Given a choice of and samples, the estimated CMI is the solution of the following constrained optimization:
where , , , and are defined in the same as in (15). We optimize over under the second moment constraint, i.e. . Observe that no KDE of is needed for CMI estimation, making it particularly simple and robust.
4 Convergence Guarantees
We show both the proposed UMI and CMI estimators are consistent under typical assumptions on the distribution. While consistency of estimators in the large sample limit is generally only a (basic) first step in understanding their properties, this is not so for fixed nearest neighbor based estimators. As far as we know, the only estimator based on fixed nearest neighbors that is known to be consistent is the entropy estimator of [KL87], and the convergence rate is only known for the univariate case [TVdM96] (and that too under significant assumptions on the univariate density). Our result below for the consistency of the UMI estimator for discrete alphabet marks another instance where consistency of fixed nearest neighbor based estimators is established.
Uniform Mutual Information: As our estimators use the offtheshelf kernel density estimator of [DP84, SJ91] and also the ides from the nearestneighbor methods [KL87], we make assumptions on the conditional density that are typical in these literature. One extra assumption we make for UMI is that the RadonNikodym derivative is uniformly bounded over the support. This is necessary for controlling the importancesampling estimates of ’s. We refer to the Assumption 1 in the supplementary material for a precise description.
Theorem 1.
Under the Assumption 1 in the supplementary material, the UMI estimator converges to the true value in probability, i.e. for all and all ,
(22) 
if for continuous and for discrete .
In practice, we regularize the NN distance in case it is much smaller than the expected distance of order . For continuous , we require to be larger than the ratio of the dimensions, which is a finite constant. For discrete , however, the effective dimension of is zero, which makes the ratio unbounded. Hence, for concentration of measure to hold, we need scaling at least logarithmically in the number of samples .
Capacitated Mutual Information: We make analogous assumptions which are described precisely in Assumption 2 in the supplementary material. The following theorem establishes consistency of our estimator when is discrete and we quantize . Our analysis requires uniform convergence over all possible choices of the weights , making the quantization step inevitable; improvements on this technical condition are natural future steps.
Theorem 2.
Under the Assumption 2 in the supplementary material, the CMI estimator converges in probability to the true value up to the resolution of the quantization, i.e. if for some , and , for all and and
5 Numerical Experiments
5.1 Gene Causal Strength from Single Cell Data
We briefly describe the setup of [KSM14] to motivate our numerical experiments. Consider a simple genetic pathway: a cascade of genes having expression values which interact linearly, i.e., . A key question of interest in this case is how the signaling in the pathway varies in different conditions of intervention. Let denote the time after the intervention (for example, after giving a certain drug). Then we may want to compare the strength of the causal relationship between two genes at different times after the intervention. In the experiments, usually samples are taken at very few time points, so has very small cardinality (for example, before the drug, minutes after the drug and minutes after the drug), but at each given time point, many cells are interrogated so we get samples from the distribution . For each value of , we observe i.i.d. samples , for sampled from . These samples are obtained using a technique called singlecell mass flow cytometry, see [KSM14] for details. We are interested in obtaining a causal measure and another measure for each time point . This measure serves as a high level summary of how signaling proceeds in the cascade as a function of time, and lets one compare the strengths of a given causal relationship at different points after intervention.
If the drug indeed activates the causal pathway, one may expect the causal relationship to follow a certain trend, i.e., at earlier , the strength of will be high and at a later value of , the strength of will be high before the effect of the drug wears off, at which time we expect all the relationships to fall back to its low nominal value. Such an analysis is conducted in [KSM14] where the causal strength function is evaluated via the socalled DREMI estimator (essentially a version of UMI estimation with a “channel amplification” step and careful choice of hyper parameters therein – no theoretical properties of this estimator were evaluated). In that paper, it is shown that, for two example pathways, DREMI recovers the correct trend, i.e., it correctly identifies the time at which each causal relationship is expected to peak as per prior biological knowledge. This demonstrates the utility of DREMI for causal strength inference in gene networks (see Figure 6 of [KSM14]). The authors there also demonstrate that other metrics which depend on the whole joint distribution, such as mutual information, maximal information coefficient, and correlation do not capture the trend. As an aside, we note that a somewhat different set of “trend spotting” estimators, primarily trying to find genes which demonstrate a monotonic trend over time from singlecell RNAsequencing data, have been proposed very recently in [MJG15].
In this paper, we have studied influence measures axiomatically and proposed the UMI and CMI measures. It is natural to apply our estimators to each time point in the same setting as [KSM14] – and look to understand two distinct issues in our experiments with the flowcytometry data. The first is whether the proposed quantities of UMI and CMI are able to capture the same biological trend as DREMI was able to. The second question relates to the sample complexity: how does the ability to recover the trend vary as a function of the sample complexity? To study this, we subsample the original data from [KSM14] multiple times (100 in the experiments) at each subsampling ratio and compute the fraction of times we recover the true biological trend. This is plotted in Figure 1. The figure demonstrates that when the whole dataset is made available, UMI and CMI are able to spot the trend correctly (just as DREMI does). When fewer samples are available, UMI uniformly dominates DREMI and, in turn, CMI uniformly dominates UMI in terms of capturing the biological trend as a function of number of samples available. We believe that this strong empirical evidence lends credence to our approach. For completeness, we note that the datasets represented in Figure 1 refer to regular Tcells (left figure) and Tcells exposed with an antigen (right figure), for which we expect different biological trends, but both of which are correctly captured by our metrics.
5.2 Synthetic data
We demonstrate the accuracy of the proposed UMI and CMI estimators on synthetic experiments. We generate samples from where is distributed as beta distribution and , , independent of . We present three results with varying . Figure 2 shows the estimate of UMI, averaged over 100 instances. This is compared to the ground truth and the stateoftheart partition based estimators from [Mod89]. The ground truth has been computed via simulations with samples from the desired distribution using Kraskov’s mutual information estimator [KSG04]. For CMI, we use exactly the same distribution as in UMI, but with varying , which is illustrated in Figure 3. Under the power constraint, the ground truth is given by . The proposed CMI estimator is compared against BlahutArimoto algorithm [Bla72, Ari72] for computing discrete channel capacity, applied to quantized data. Both figures illustrate that the proposed estimators significantly improves over the stateoftheart partition based methods, in terms of sample complexity.
6 Discussion
In this paper we have proposed novel information theoretic measures of potential influence of one variable on another, as well as provided novel estimators to compute the measures from i.i.d. samples. The technical innovation has been in proposing these estimators, by combining separate threads of ideas in statistics (including importance sampling and nearestneighbor methods). The consistency proofs suggest that a similar analysis the very popular estimator of (traditional) mutual information in [KSG04] can be conducted successfully; such work has been recently conducted in [GOV16]. Several other issues in statistical estimation theory intersect with our current work and we discuss some of these topics below.
(a) The main technical results of this paper have been weak consistency of the proposed estimators. Proving stronger consistency guarantees and rates of convergence would be natural improvements, albeit challenging ones. Rates of convergence in the nearestneighbor methods are barely known in the literature even for traditional information theoretic quantities: for instance, [TVdM96] derives a consistency for the single dimensional case of differential entropy estimation (under strong assumptions on the underlying pdf), leaving higher dimensional scenarios open, and which recently have been successfully addressed in [GOV16].
(b) There is a natural generalization of our estimators when the alphabet is high dimensional, using the NN approach (just as in the differential entropy estimator of [KL87] or in the mutual information estimator of [KSG04]). However, very recent works [GSG14, GSG15, LP16] have shown that boundary biases common in high dimensional scenarios is much better handled using local parametric methods (as in [L96, HJ96]). Adapting these approaches to the estimators for UMI and CMI is an interesting direction of future research.
(c) We have considered both the case of discrete and (single dimensional) continuous alphabet . The scenario of high dimensional is significantly more challenging for CMI estimation: this is because of the (vastly) expanded space of distributions over which the optimization can be performed. Also challenging is to consider application specific regularization of the inputs in this scenario.
(d) While the focus of this paper has been on quantifying potential causal influence, a related question involves testing the direction of causality for a pair of random variables. This is a widely studied topic with a long lineage [Pea09] but also of strong topical interest [JBGW13, JSSS15, MPJ15, SJSB15]. A natural inclination is to explore the efficacy of UMI and CMI measures to test for direction of causality – especially in the context of the benchmark data sets collected in [MPJ15]. Our results are as follows: UMI has a 45% probability to predict the correct direction. CMI gives 53% probability. Directly comparing the marginal entropy and by the estimator in [KL87] also only provides 45% accuracy. While in [MPJ15], different entropy estimators (with appropriate hyper parameter choices) were applied to get an accuracy up to 60%70%. Further research is needed to shed conclusive light, although we point out that the benchmark data sets in [MPJ15] have substantial confounding factors that make causal direction hard to measure in the first place.
(e) The axiomatic derivation of potential causal influence naturally suggests CMI as an appropriate measure. We are also able to show that a more general quantity – the socalled Rényi capacity – also meets the axioms. For any , define Rényi entropy as
(23) 
and Rényi divergence as:
(24) 
Now define the asymmetric information measure [Csi95]:
(25) 
which converges to the traditional mutual information when . Now we can define the Rényi capacity for any parameter as, for any fixed conditional distribution :
(26) 
Observe that as , we have , the traditional Shannon capacity. We observe the following.
Proposition 2.
For any we have that satisfies the axioms in Section 2.
The proof is available in Appendix D. In the light of this result, it would be interesting to design estimators for the more general family of Rényi capacity measures and confirm their performance on empirical tasks such as the ones studied in [KSM14]. It would also be very interesting to understand the role of additional axioms that would lead to uniqueness of Shannon capacity (in the same spirit as entropy being uniquely characterized by somewhat similar axioms [Csi08]).
(f) Finally, a comment on the optimization problem in CMI estimation: the optimization problem involving the ’s is not necessarily a concave program for a given sample realization, although this program converges to that of Shannon capacity computation involves maximizing mutual information, which is a concave function of the input probability distribution. Standard (stochastic) gradient decent is used in our experiments, and we did not face any disparity in convergent values over the set of synthetic experiments we conducted.
Acknowledgements
This work is supported in part by ARO W911NF1410220, NSF SaTC award CNS1527754, NSF CISE award CCF1553452 and a University of Washington startup grant.
References
 [Ari72] Suguru Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. Information Theory, IEEE Transactions on, 18(1):14–20, 1972.
 [BDGVdM97] Jan Beirlant, Edward J Dudewicz, László Györfi, and Edward C Van der Meulen. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1):17–39, 1997.
 [Bla72] Richard E Blahut. Computation of channel capacity and ratedistortion functions. Information Theory, IEEE Transactions on, 18(4):460–473, 1972.
 [CMMR12] Jean Cornuet, JeanMichel Marin, Antonietta Mira, and Christian P Robert. Adaptive multiple importance sampling. Scandinavian Journal of Statistics, 39(4):798–812, 2012.
 [CS04] Imre Csiszár and Paul C Shields. Information theory and statistics: A tutorial. Now Publishers Inc, 2004.
 [Csi95] Imre Csiszár. Generalized cutoff rates and renyi’s information measures. Information Theory, IEEE Transactions on, 41(1):26–34, 1995.
 [Csi08] Imre Csiszár. Axiomatic characterizations of information measures. Entropy, 10(3):261–273, 2008.
 [CT12] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 [DP84] Luc Devroye and Clark S Penrod. The consistency of automatic kernel density estimates. The Annals of Statistics, pages 1231–1249, 1984.
 [GOV16] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Demystifying fixed knearest neighbor information estimators. arXiv preprint arXiv:1604.03006, 2016.
 [GSG14] Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient estimation of mutual information for strongly dependent variables. arXiv preprint arXiv:1411.2003, 2014.
 [GSG15] Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Estimating mutual information by local gaussian approximation. arXiv preprint arXiv:1508.00536, 2015.
 [HJ96] Nils Lid Hjort and MC Jones. Locally parametric nonparametric density estimation. The Annals of Statistics, pages 1619–1647, 1996.
 [HV15] SiuWai Ho and Sergio Verdú. Convexity/concavity of renyi entropy and mutual information. In Information Theory (ISIT), 2015 IEEE International Symposium on, pages 745–749. IEEE, 2015.
 [JBGW13] Dominik Janzing, David Balduzzi, Moritz GrosseWentrup, Bernhard Schölkopf, et al. Quantifying causal influences. The Annals of Statistics, 41(5):2324–2358, 2013.
 [JSSS15] D. Janzing, B. Steudel, N. Shajarisales, and B. Schölkopf. Justifying InformationGeometric Causal Inference, chapter 18, pages 253–265. Springer International Publishing, 2015.
 [KBG07] Shiraj Khan, Sharba Bandyopadhyay, Auroop R Ganguly, Sunil Saigal, David J Erickson III, Vladimir Protopopescu, and George Ostrouchov. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Physical Review E, 76(2):026209, 2007.
 [KKPW15] Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, and Larry Wasserman. Nonparametric von mises estimators for entropies, divergences and mutual informations. In Advances in Neural Information Processing Systems, pages 397–405, 2015.
 [KL87] LF Kozachenko and Nikolai N Leonenko. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
 [KSG04] A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutual information. Physical review E, 69(6):066138, 2004.
 [KSM14] Smita Krishnaswamy, Matthew H Spitzer, Michael Mingueneau, Sean C Bendall, Oren Litvin, Erica Stone, Dana Pe’er, and Garry P Nolan. Conditional densitybased analysis of t cell signaling in singlecell data. Science, 346(6213):1250689, 2014.
 [L96] Clive R Loader et al. Local likelihood density estimation. The Annals of Statistics, 24(4):1602–1618, 1996.
 [LP16] Damiano Lombardi and Sanjay Pant. Nonparametric knearestneighbor entropy estimator. Physical Review E, 93(1):013310, 2016.
 [MJG15] Jonas Mueller, Tommi Jaakkola, and David Gifford. Modeling trends in distributions. arXiv preprint arXiv:1511.04486, 2015.
 [Mod89] Rudy Moddemeijer. On estimation of entropy and mutual information of continuous distributions. Signal processing, 16(3):233–248, 1989.
 [MPJ15] J.M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf. Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research, 2015.
 [Owe13] Art B. Owen. Monte Carlo theory, methods and examples. 2013.
 [Pan03] Liam Paninski. Estimation of entropy and mutual information. Neural computation, 15(6):1191–1253, 2003.
 [Pea09] Judea Pearl. Causality. Cambridge university press, 2009.
 [PPS10] Dávid Pál, Barnabás Póczos, and Csaba Szepesvári. Estimation of rényi entropy and mutual information based on generalized nearestneighbor graphs. In Advances in Neural Information Processing Systems, pages 1849–1857, 2010.
 [PV10] Yury Polyanskiy and Sergio Verdú. Arimoto channel coding converse and rényi divergence. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 1327–1333. IEEE, 2010.
 [PXS12] Barnabás Póczos, Liang Xiong, and Jeff Schneider. Nonparametric divergence estimation with applications to machine learning on distributions. arXiv preprint arXiv:1202.3758, 2012.
 [RE15] Robin J Richardson and Thomas S Evans. Nonparametric causal models. 2015.
 [Rén59] Alfréd Rényi. On measures of dependence. Acta mathematica hungarica, 10(34):441–451, 1959.
 [Sha48] C.E. Shannon. A mathematical theory of communication. Bell System Tech. J., 27:379Ð423 and 623Ð656, 1948.
 [SJ91] Simon J Sheather and Michael C Jones. A reliable databased bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), pages 683–690, 1991.
 [SJSB15] N. Shajarisales, D. Janzing, B. Schölkopf, and M. Besserve. Telling cause from effect in deterministic linear dynamical systems. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Workshop and Conference Proceedings, page 285â¤?294. JMLR, 2015.
 [SMH03] Harshinder Singh, Neeraj Misra, Vladimir Hnizdo, Adam Fedorowicz, and Eugene Demchuk. Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23(34):301–321, 2003.
 [SRHI10] Kumar Sricharan, Raviv Raich, and Alfred O Hero III. Empirical estimation of entropy functionals with confidence. arXiv preprint arXiv:1012.4188, 2010.
 [TVdM96] Alexandre B Tsybakov and EC Van der Meulen. Rootn consistent estimators of entropy for densities with unbounded support. Scandinavian Journal of Statistics, pages 75–83, 1996.
 [VEH14] Tim Van Erven and Peter Harremoës. Rényi divergence and kullbackleibler divergence. Information Theory, IEEE Transactions on, 60(7):3797–3820, 2014.
 [WKV09] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú. Divergence estimation for multidimensional densities vianearestneighbor distances. Information Theory, IEEE Transactions on, 55(5):2392–2405, 2009.
Appendix
Appendix A Proof of the UMI estimator convergence in Theorem 1
We present the proof of the theorem for two separate UMI estimators: first for continuous and next for discrete . We first state the formal assumptions under which the theorem holds.
Assumption 1.
For continuous , define
(27)  
(28) 
We make the following assumptions:

.

There exists a finite constant such that the Hessian matrix of and exists and
almost everywhere. 
There exists a positive constant such that the conditional pdfs satisfy and almost everywhere.

There exist positive constants such that the marginal pdf satisfy, almost everywhere,

The bandwidth of kernel density estimator is chosen as .
For discrete , define
(29) 
We make the following assumptions:

, for all .

There exists a finite constant such that the Hessian matrix of exists and almost everywhere, for all .

There exists a finite constant such that the conditional pdf almost everywhere, for all .

There exists finite constants such that the prior and almost everywhere.
a.1 The case of continuous
Given these assumptions, we define
(30) 
such that . Define each quantity with the true prior as
(31)  
(32)  
(33) 
With equal to the uniform distribution on the support of , we apply the triangle inequality to show that each term converges to zero in probability.
(35)  
The first term (35) captures the error in the kernel density estimator and we have the following claim, whose proof is delegated to Appendix C.
Lemma 1.
The term in Equation (35) converges to 0 as in probability.
The second term in the error (35) comes from the sample noise in density estimation. Similar to the decomposition of mutual information, , we decompose our estimator into three terms:
where
(36)  
(37)  
(38) 
Notice that