Tree Structured Synthesis of Gaussian Trees
A new synthesis scheme is proposed to effectively generate a random vector with prescribed joint density that induces a (latent) Gaussian tree structure. The quality of synthesis is measured by total variation distance between the synthesized and desired statistics. The proposed layered and successive encoding scheme relies on the learned structure of tree to use minimal number of common random variables to synthesize the desired density. We characterize the achievable rate region for the rate tuples of multi-layer latent Gaussian tree, through which the number of bits needed to simulate such Gaussian joint density are determined. The random sources used in our algorithm are the latent variables at the top layer of tree, the additive independent Gaussian noises, and the Bernoulli sign inputs that capture the ambiguity of correlation signs between the variables.
Consider the problem of simulating a random vector with prescribed joint density. Such method can be used for prediction applications, i.e., given a set of inputs we may want to compute the output response statistics. This can be achieved by generating an appropriate number of random input bits to a stochastic channel whose output vector has its empirical statistics meeting the desired one measured by a given metric.
We aim to address such synthesis problem for a case where the prescribed output statistics induces a (latent) Gaussian tree structure, i.e., the underlying structure is a tree and the joint density of the variables is captured by a Gaussian density. The Gaussian graphical models are widely studied in the literature. They have diverse applications in social networks, biology, and economics , to name a few. Gaussian trees in particular have attracted much attention  due to their sparse structures, as well as existing computationally efficient algorithms in learning the underlying topologies . In this paper we assume that the parameters and structure information of the latent Gaussian tree is provided.
Our primary concern in such synthesis problem is about efficiency in terms of the amount of random bits required at the input, as well as the modeling complexity of given stochastic system through which the Gaussian vector is synthesized. We use Wyner’s common information  to quantify the information theoretic complexity of our scheme. Such quantity defines the necessary number of common random bits to generate two correlated outputs, through a single common source of randomness, and two independent channels.
In , Han and Verdu introduced the notion of resolvability of a given channel, which is defined as the minimal required randomness to generate output statistics in terms of a vanishing total variation distance between the synthesized and prescribed joint densities. In , the authors aim to define the common information of dependent random variables, to further address the same question in this setting.
In this paper, we show that unlike previous cases, the Gaussian trees can be synthesized not only using a single variable as a common source, but by relying on vectors (usually consisting of more than one variable). In particular, we consider an input vector (and not a single variable) to produce common random bits, and adopt a specific (but natural) structure to our synthesis scheme to decrease the number of parameters needed to model the synthesis scheme. It is worthy to point that the achievability results given in this paper are under the assumed structured encoding framework. Hence, although through defining an optimization problems, we show that the proposed method is efficient in terms of both modeling and codebook rates, the converse proof, which shows the optimality of such scheme and rate regions is postponed to future studies.
We show that in latent Gaussian trees, we are always having a sign singularity issue , which we can exploit to make our synthesis approach more efficient. To clarify, consider the following example.
It turns out that such sign singularity can be seen as another noisy source of randomness, which can further help us to reduce the code-rate corresponding to latent inputs to synthesize the latent Gaussian tree. In fact, we may think of the Gaussian tree shown in Figure ? as a communication channel, where information flows from a Gaussian source through three communication channels with independent additive Gaussian noise variables to generate (dependent) outputs with . We introduce as a binary Bernoulli random variable, which reflects the sign information of pairwise correlations. Our goal is to characterize the achievable rate region and through an encoding scheme to synthesize Gaussian outputs with density using only Gaussian inputs and through a channel with additive Gaussian noise, where the synthesized joint density is indistinguishable from the true output density as measured by total variation metric .
2.1The signal model of a multi-layer latent Gaussian tree
Here, we suppose a latent graphical model, with as the set of inputs (hidden variables), , with each being a binary Bernoulli random variable with parameter to introduce sign variables, and as the set of Gaussian outputs (observed variables) with . We also assume that the underlying network structure is a latent Gaussian tree, therefore, making the joint probability (under each sign realization) be a Gaussian joint density , where the covariance matrix induces tree structure , where is the set of nodes consisting of both vectors and ; is the set of edges; and is the set of edge-weights determining the pairwise covariances between any adjacent nodes. We consider normalized variances for all variables and . Such constraints do not affect the tree structure, and hence the independence relations captured by . Without loss of generality, we also assume , this constraint does not change the amount of information carried by the observed vector.
In  we showed that the vectors and are independent. We argued the intrinsic sign singularity in Gaussian trees is due to the fact that the pairwise correlations can be written as , i.e., the product of correlations on the path from to . Hence, roughly speaking, one can carefully change the sign of several correlations of the path, and still maintain the same value for . We showed that if the cardinality of the input vector is , then minimal Gaussian trees (that only differ in sign of pairwise correlations) may induce the same joint Gaussian density .
In order to propose the successive synthesis scheme, we need to characterize the definition of layers in a latent Gaussian tree. We define latent vector , to be at layer , if the shortest path between each latent input and the observed layer (consisting the output vector ) is through edges. In other words, beginning from a given latent Gaussian tree, we assume the output to be at layer , then we find its immediate latent inputs and define to include all of them. We iterate such procedure till we included all the latent nodes up to layer , i.e., the top layer. In such setting, the sign input vector with Bernoulli sign random variables is assigned to the latent inputs .
We adopt a communication channel to feature the relationship between each pair of successive layers. Assume and as the input vectors, as the output vector, and the noisy channel to be characterized by the conditional probability distribution , the signal model for such a channel can be written as follows,
where is the transition matrix that also carries the sign information vector , and is the additive Gaussian noise vector with independent elements, each corresponding to a different communication link from the input layer to the output layer . Hence, the outputs at each layer , are generated using the inputs at the upper layer. The case , is essentially for the outputs in , which will be produced using their upper layer inputs at . As we will see next, such modeling will be the basis for our successive encoding scheme. In fact, by starting from the top layer inputs , at each step we generate the outputs at the lower layer, this will be done till we reach the observed layer to synthesize the Gaussian vector . Finally, note that in order to take all possible latent tree structures, we need to revise the ordering of layers in certain situations, which will be taken care of in their corresponding subsections. For now, the basic definition for layers will be satisfactory.
2.2Synthesis Approach Formulation
In this section we provide mathematical formulations to address the following fundamental problem: using channel inputs and , what are the rate conditions under which we can synthesize the Gaussian channel output with a given , for each . Note that, at first we are only given , but using certain tree learning algorithms we can find those jointly Gaussian latent variables at every level . We propose a successive encoding scheme on multiple layers that together induce a latent Gaussian tree, as well as the corresponding bounds on achievable rate tuples. The encoding scheme is efficient because it utilizes the latent Gaussian tree structure to simulate the output. In particular, without resorting to such learned structure we need to characterize parameters (one for each link between a latent and output node), while by considering the sparsity reflected in a tree we only need to consider parameters (the edges of a tree).
Suppose we transmit input messages through channel uses, in which denotes the time index. We define to be the -th symbol of the -th codeword, with where is the codebook cardinality, transmitted from the existing sources at layer . We assume there are sources present at the -th layer, and the channel has layers. We can similarly define to be the -th symbol of the -th codeword, with where is the codebook cardinality, regarding the sign variables at layer . We will further explain that although we define codewords for the Bernoulli sign vectors as well, they are not in fact transmitted through the channel, and rather act as noisy sources to select a particular sign setting for latent vector distributions. For sufficiently large rates and and as grows the synthesized density of latent Gaussian tree converges to , i.e., i.i.d realization of the given output density , where is a compound random variable consisting the output, latent, and sign variables. In other words, the average total variation between the two joint densities vanishes as grows ,
where is the synthesized density of latent Gaussian tree, and , represents the average total variation. In this situation, we say that the rates are achievable . Our achievability proofs heavily relies on soft covering lemma shown in .
Loosely speaking, the soft covering lemma states that one can synthesize the desired statistics with arbitrary accuracy, if the codebook size (characterized by its corresponding rate) is sufficient and the channel through which these codewords are sent is noisy enough. This way, one can cover the desired statistics up to arbitrary accuracy, hence, any random sample that can be drawn from the desired distribution , it also exists in the synthesized distribution . The main objective is to maximize such rate region (hence minimizing the required codebook size), and develop a proper encoding scheme to synthesize the desired statistics.
For simplicity of notation, we drop the symbol index and use and instead of and , respectively, since they can be understood from the context.
Based on the proposed layered model for a Gaussian tree, we always end up with two cases: those cases where the variables at the same layer are not adjacent to each other, for any ; those cases where the variables at the same layer can be adjacent. An example for the first case in shown in Figure ?, where there is no edge between the variables in the same layer of a two layered Gaussian tree. Also, Figure 1 shows a Gaussian tree capturing the second case. As we discuss, the synthesis scheme for each of these cases is different. In particular, in the second case we need to pre-process the Gaussian tree and change the ordering of variables at each layer, and then perform the synthesis.
3.1The case with observables at the same layer
In this case, Figure ? shows the general encoding scheme to be used to synthesize the output vector.
In particular, to synthesize the output joint distribution at each layer , we need to generate two codebooks and at its upper layer . Then, we need to follow certain synthesis scheme to send such codewords on each of these channels to generate the entire Gaussian tree statistics. To better clarify our approach, it is best to begin the synthesis discussion by an illustrative example.
3.2The case with observables at different layers
To address this case, we need to reform the latent Gaussian tree structure by choosing an appropriate root such that the variables in the newly introduced layers mimic the basic scenario, i.e., having no edges between the variables at the same layer. We begin with the top layer nodes and as we move to lower layers we seek each layer for the adjacent nodes at the same layer, and move them to a newly added layer in between the upper and lower layers. This way, we introduce new layers consisting of those special nodes, but this time we are dealing with a basic case. Note that such procedure might place the output variables at different layers, i.e., all the output variables are not generated using inputs at a single layer. We only need to show that using such procedure and previously defined achievable rates, one can still simulate output statistics with vanishing total variation distance. To clarify, consider the case shown in Figure 1.
As it can be seen, there are two adjacent nodes in the first layer, i.e., and are connected. Using the explained procedure, we may move to another newly introduced layer, then we relabel the nodes again to capture the layer orderings. The reformed Gaussian tree is shown in Figure 2. In the new ordering, the output variables and will be synthesized one step after other outputs. The input is used to synthesize the vector , and such vector is used to generate the first layer outputs, i.e., to and . At the last step, the input will be used to simulate the output pair and . By Theorem ? we know that both simulated densities regarding to and approach to their corresponding densities as grows. We need to show that the overall simulated density also approaches to as well.
We need to be extra cautious in keeping the joint dependency among the generated outputs at different layers: For each pair of outputs , there exists an input codeword , which corresponds to the set of generated codewords , where together with they are generated using the second layer inputs. Hence, in order to maintain the overall joint dependency of the outputs, we always need to match the correct set of outputs to to each of the output pairs , where this is done via .
In general, we need to keep track of the indices of generated output vectors at each layer and match them with corresponding output vector indices at other layers. This is shown in Lemma ?, whose proof can be found in Appendix Section 8.
4Maximum Achievable Rate Regions under Gaussian Tree Assumption
Here, we aim to minimize the bounds on achievable rates shown in to make our encoding more efficient. Considering the first lower bound in , we derived an interesting result in  that shows for any Gaussian tree such mutual information value is only a function of given . However, considering the second second inequality in , in Theorem ?, whose proof can be found in Appendix Section 6. we show that under Gaussian tree assumption, the lower bound is minimized for homogeneous Bernoulli sign inputs.
In this paper, we proposed a new tree structure synthesis scheme, in which through layered forwarding channels certain Gaussian vectors can be efficiently generated. Our layered encoding approach is shown to be efficient and accurate in terms of reduced required number of parameters and random bits needed to simulate the output statistics, and its closeness to the desired statistics in terms of total variation distance.
6Proof of Theorem
Suppose the latent Gaussian tree has latent variables,i.e., . By adding back the sign variables the joint density becomes a Gaussian mixture model. One may model such mixture as the summation of densities that are conditionally Gaussian, given sign vector.
where each captures the overall probability of the binary vector , with . Here, is equivalent to having . The terms are conditional densities of the form
In order to characterize , we need to find in terms of and conditional Gaussian densities as well. First, let’s show that for any two hidden nodes and in a latent Gaussian tree, we have . The proof goes by induction: We may consider the structure shown in Figure ? as a base, where we proved that . Then, assuming such result holds for any Gaussian tree with hidden nodes, we prove it also holds for any Gaussian tree with hidden nodes. Let’s name the newly added hidden node as that is connected to several hidden and/or observable such that the total structure forms a tree. Now, for each newly added edge we assign , where is one of the neighbors of . Note that this assignment maintains the pairwise sign values between all previous nodes, since to find their pairwise correlations we go through at most once, where upon entering/exiting we multiply the correlation value by , hence producing , so overall the pairwise correlation sign does not change. Note that the other pairwise correlation signs that do not pass through remain unaltered. One may easily check that by assigning to the sign value of each newly added edge we make to follow the general rule, as well. Hence, overall we showed that for any . This way we may write , where and is diagonal matrix. One may easily see that both and its negation matrix induce the same covariance matrix . As a result, if we define as a compliment of , we can write the mixture density as follows,
where the conditional densities can be characterized as . We know that , where may correspond to either or .
First, we need to show that the mutual information is a convex function of for all . By equality , and knowing that given the entropy is fixed, we only need to show that the conditional entropy is a concave function of . Using definition of entropy and by replacing for and using equations and , respectively, we may characterize the conditional entropy. By taking second order derivative, we deduce the following,
where for simplicity of notations we write instead of . Also, for . Note the following relation,
The same procedure can be used to show,
By equalities shown in and , it is straightforward that can be turn into the following,
The matrix characterizes the Hessian matrix the conditional entropy . To prove the concavity, we need to show is non-positive definite. Define a non-zero real row vector , then we need to form as follows and show that it is non-positive.
Now that we showed the concavity of the conditional entropy with respect to , we only need to find the optimal solution. The formulation is defined in , where is the Lagrange multiplier.
by taking derivative with respect to , we may deduce the following,
where the last equality is due to . One may find the optimal solution by solving for all , which results in showing that , for all . In order to find the joint Gaussian density , observe that we should compute the exponent . Since, we are dealing with a latent Gaussian tree, the structure of can be summarized into four blocks as follows . that has diagonal and off-diagonal entries and , respectively, and not depending on the edge-signs; , with nonzero elements showing the edges between and particular and depending on correlation signs; ; , with nonzero off diagonal elements that are a function of edge-sign values, while the diagonal elements are independent of edge-sign values. One may show,
where are the observed neighbors of , and is the edge set corresponding only to hidden nodes, i.e., those hidden nodes that are adjacent to each other. can be defined similarly, with . Also . Suppose and are different at sign values . Let’s write,
Hence, we divide the summation into two parts and . Suppose for all . We may form as follows,
By negating all into , it is apparent that , , and do not change. Also, the terms in either remain intact or doubly negated, hence, overall remains intact also. However, by definition, will be negated, hence overall the sum will be negated. The same thing can be argued for , since exactly one variable or in the summation, will change its sign, so also will be negated. Overall, we can see that by negating , we will negate . It remains to show that such negation does not impact . Note that since includes all sign combinations and all of are equi-probable since we assumed so is symmetric with respect to , and such transformation on does not impact the value of , since by such negation we simply switch the position of certain Gaussian terms with each other.
For , we should first compute the term . We know , so (note, does not necessarily induce a tree structure). We have,
From this equation, we may interpret the negation of , simply as negation of . Hence, since includes all sign combinations, hence, such transformation only permute the terms , so remains fixed. Hence, overall remains unaltered. As a result, we show that for any given point in the integral we can find its negation, hence making the integrand an odd function, and the corresponding integral zero. Hence, making the solution , for all an optimal solution.
The only thing remaining is to show that from we may conclude that for all . By definition, we may write,
where . Assume all . Consider and find such that the two are different in only one expression, say at the -th place. Since, all are equal, one may deduce so . Note that such can always be found since ’s are covering all possible combinations of -bit vector. Now, find another , which is different from at some other spot, say , again using similar arguments, we may show . This can be done times to show that, if all , then . This completes the proof.
7Proof of Theorem
The signal model can be directly written as follows,
Here, we show the encoding scheme to generate from . Note that is a vector consisting of the variables . Also, is a vector consisting of variables . The proof relies on the procedure taken in . Note that our encoding scheme should satisfy the following constraints,
where the first constraint is due to the conditional independence assumption characterized in the signal model . The second one is to capture the intrinsic ambiguity of the latent Gaussian tree to capture the sign information. Condition is due to independence of joint densities at each time slot . Conditions and are due to corresponding rates for each of the inputs and . And finally, condition is the synthesis requirement to be satisfied. First, we generate a codebook of sequences, with indices and according to the explained procedure explained in Example ?. The codebook consists of all combinations of the sign and latent variables codewords, i.e., . We construct the joint density as depicted by Figure 3.
The indices and are chosen independently and uniformly from the codebook . As can be seen from Figure 3, for each the channel is in fact consists of independent channels . The joint density is as follows,
Note that already satisfies the constraints , , and by construction. Next, we need to show that it satisfies the constraint . The marginal density can be deduced by the following,
We know if , then by soft covering lemma  we have,
which shows that satisfies constraint . For simplicity of notations we use instead of , since it can be understood from the context. Next, let’s show that , nearly satisfies constraints and satisfies . We need to show that as grows the synthesized density approaches , in which the latter satisfies both and . In particular, we need to show that the total variation vanishes as grows. After taking several algebraic steps similar to the ones in , we should equivalently show that the following term vanishes, as ,
Note that given any fixed the number of Gaussian codewords is . Also, one can check by the signal model defined in that the statistical properties of the output vector given any fixed sign value does not change. Hence, for sufficiently large rates, i.e., , and by soft covering lemma, the term in the summation in vanishes as grows. So overall the term shown in vanishes. This shows that in fact nearly satisfies the constraints and . Hence, let’s construct another distribution using . Define,
It is not hard to see that such density satisfies . We only need to show that it satisfies as well. We have,