Smoothness of marginal log-linear parameterizations
We provide results demonstrating the smoothness of some marginal log-linear parameterizations for distributions on multi-way contingency tables. First we give an analytical relationship between log-linear parameters defined within different margins, and use this to prove that some parameterizations are equivalent to ones already known to be smooth. Second we construct an iterative method for recovering joint probability distributions from marginal log-linear pieces, and prove its correctness in particular cases. Finally we use Markov chain theory to prove that certain cyclic conditional parameterizations are also smooth. These results are applied to show that certain conditional independence models are curved exponential families.
Models for multi-way contingency tables may include restrictions on various marginal or conditional distributions, especially in the context of longitudinal or causal models (see, for example, Lang and Agresti, 1994; Bergsma et al., 2009; Evans and Richardson, 2013, and references therein). Such models can often be parameterized by combining log-linear parameters from within different marginal tables. The resulting marginal log-linear parameterizations, introduced by Bergsma and Rudas (2002), provide an elegant and flexible way to parameterize a multivariate discrete probability distribution.
Setting these marginal log-linear parameters to zero can be used to define arbitrary conditional independence models (Rudas et al., 2010; Forcina et al., 2010), including those corresponding to undirected graphical models or Bayesian networks. If these zero parameters can be embedded into a larger smooth parameterization of the joint distribution, then the model defined by the conditional independence constraints is a curved exponential family, and therefore possesses good statistical properties. This approach is applied by Rudas et al. (2010) and Evans and Richardson (2013) to classes of graphical models.
Unfortunately, there exist models of conditional independence which—though believed to be curved exponential families—cannot be embedded into parameterizations currently known to be smooth. Forcina (2012) studies examples of models defined by ‘loops’ of conditional independences, such as
which can be defined by constraints on the conditional distributions , and respectively. However it is not clear whether a smooth parameterization of the joint distribution can be constructed using these conditionals. The model can also be defined by setting a particular collection of marginal log-linear parameters to zero (see Section 5 for details), but there is no way to embed these parameters into a smooth parameterization of the kind studied by Bergsma and Rudas (2002), so their results do not apply. Forcina (2012) gives a numerical test for this model which is highly suggestive of smoothness, but no formal proof is available.
The contribution of this paper is to show that the class of smooth discrete parameterizations which can be constructed using marginal log-linear (MLL) parameters is considerably larger than had previously been known, and that models such as (1) can indeed be embedded into these parameterizations. We give three different methods for demonstrating smoothness in this context. First we provide an analytical expression for the relationship between log-linear parameters defined within different marginal distributions; this allows us to prove the equivalence of various parameterizations. Second we show that particular fixed point maps relating different parameters are contractions, and hence can be used to uniquely recover the joint probability distribution. Lastly we use Markov chain theory to show that we can smoothly recover joint probability distributions from ‘cyclic’ conditional distributions; this is used to show that certain conditional independence models, including the one above, are curved exponential families of distributions.
The rest of the paper is organized as follows: Section 2 reviews marginal log-linear parameters and their properties. Section 3 specifies the relationship between log-linear parameters defined within different margins, enabling certain parameterizations to be proven equivalent. Section 4 extends this by constructing fixed point methods that smoothly recover a joint distribution. Section 5 further extends the results of Section 3 using Markov chain theory, and demonstrates that certain conditional independence models are curved exponential families. Section 6 contains discussion, and a conjecture on the precise characterization of smooth MLL parameterizations.
2 Marginal Log-Linear Parameters
We consider multivariate distributions over a finite collection of binary random variables , for ; we denote their joint distribution by . All the results herein also hold (or have analogues) in the case of general finite discrete variables, but the notation becomes more cumbersome. For we denote the marginal distribution over by , and for disjoint we denote the relevant conditional distribution by . Distributions are assumed to be strictly positive: .
Let be the strictly positive probability simplex of dimension . We say that a homeomorphism onto an open set is a smooth parameterization of if is twice continuously differentiable, and its Jacobian has full rank everywhere.
The canonical smooth parameterization of is via log-linear parameters , defined by the Möbius expansion
here is the number of 1s in . It follows by Möbius inversion that
see, for example, Lauritzen (1996). For example, if ,
It is well known that the collection provides a smooth parameterization of the joint distribution with .
Clearly and, for example,
which is the log-odds ratio between and . In order to fit a model with the constraint we could choose a parameterization that includes , and fix it to be zero.
One way to characterize the main idea of Bergsma and Rudas (2002) is as follows: given some arbitrary margins of a joint distribution , what additional information does one need to smoothly reconstruct the full joint distribution ? They show that one possibility is to take the collection of log-linear parameters where for any .
It follows that given any inclusion-respecting sequence of margins (i.e. only if ), we can smoothly parameterize with marginal log-linear parameters of the form , where but for any .
Take the inclusion-respecting sequence of margins , , . This gives us the smooth parameterization consisting of the vector below. The pairs are summarized (grouped by margin) in the adjacent table.111Note that here, and in the sequel, we abbreviate sets of integers by omitting the braces and commas in order to avoid overburdened notation: so, for example, means .
Now, let be an arbitrary collection of effect-margin pairs such that . Define
to be the corresponding vector of marginal log-linear parameters. The main question considered by this paper is: under what circumstances does constitute a smooth parameterization of ?
2.1 Existing Results
We say that is complete if every non-empty subset of appears as an effect in exactly once. If, in addition, the margins can be ordered so that each effect appears with the first margin of which it is a subset, we say that is hierarchical. Parameterizations that can be constructed from an inclusion-respecting sequence of margins in the manner of Example 2.3 correspond precisely to hierarchical . Bergsma and Rudas (2002) show that if is complete and hierarchical then gives a smooth parameterization of the joint distribution; in addition, they show that completeness is necessary for smoothness. Forcina (2012) shows that if is complete and contains only two distinct margins , then is smooth.
To our knowledge, these are the only existing results on the smoothness of marginal log-linear parameterizations. No example has been provided of a complete parameterization which is non-smooth. In Sections 3, 4 and 5 we will show that, in fact, many more complete parameterizations are smooth than had previously been known.
The issue of smoothness in non-hierarchical models was raised by Forcina (2012) in the context of loop models of conditional independence, and expanded upon by Colombi and Forcina (2014) for models of context-specific conditional independence; the latter consider a more general class of models than we do, but there is no overlap in the theoretical results. Examples of ordinary conditional independences models that require non-complete parameterizations (and therefore are not curved exponential families) are found in Drton (2009).
3 An Analytical Map between Margins
To parameterize a marginal distribution we can use the marginal log-linear parameters . An analogous result holds for conditional distributions: for disjoint define
in other words, all the MLL parameters for the margin whose effect contains some element of . Then constitutes a smooth parameterization of the conditional distribution
A consequence of this is to aid us in understanding the relationship between log-linear parameters defined within different margins. Theorem 3 of Bergsma and Rudas (2002) shows that distinct MLL parameters corresponding to the same effect in different margins (i.e. and with ) are linearly dependent at certain points in the parameter space, and that therefore no smooth parameterization can include two such parameters. The following theorem elucidates the exact relationship between such parameters, and will later be used to demonstrate the smoothness of certain non-hierarchical parameterizations.
Let be disjoint subsets of . The log-linear parameter may be decomposed as
for a smooth function , which vanishes whenever for some .
In addition, if
(where are held fixed).
Since the second term is a smooth function of the conditional probabilities , it follows that it is also a smooth function of the claimed parameters. The implication of independence follows from Lemma 2.9 of Evans and Richardson (2013).
Hence the derivative of (3) in the case becomes
|and, since there is no dependence upon , this is the same as|
Then note that simply counts the number of 1s in and in , so is even if and only if is. Hence
which gives the required result. ∎
We have shown that if the conditional distribution of given is fixed the relationship between and (and indeed any parameter of the form for ) is linear. In particular, if we know , then and become interchangeable as part of a parameterization, preserving smoothness and (when relevant) variation independence.
3.1 Constructing Smooth Parameterizations
The following example shows how Theorem 3.1 can be used to prove the smoothness of a parameterization.
Consider the complete collections and below.
is not hierarchical because in any inclusion-respecting ordering the margin 23 must precede 123, in which case the effect 2 (contained in the pair ) is not associated with the first margin of which it is a subset. Existing results therefore cannot tell us whether or not is smooth. However, by fixing the parameters Theorem 3.1 shows that and are interchangeable. Hence is smooth if and only if is also smooth which, since satisfies the conditions of a hierarchical parameterization, it is. In addition, and are both variation independent parameterizations (i.e. any corresponds to a valid probability distribution).
We generalize the approach used in the preceding example with the following definition and proposition.
Let be a collection of MLL parameters, and define
That is, all effects involving are removed, and any margins containing are replaced by .
Let be a complete collection of marginal log-linear parameters over such that the variable is not in any margin except . Then is a smooth parameterization of if and only if is a smooth parameterization of . In addition, is variation independent if and only if is.
Since is the only margin containing and the parameterization is complete, we have the parameters . Hence we can smoothly parameterize the distribution of with these parameters.
By Theorem 3.1, any other parameter such that is (having fixed the distribution of ) a smooth function of . It follows that we have a smooth map between and . Since is a function of , and smoothly parameterizes , it follows that smoothly parameterizes if and only if smoothly parameterizes .
Lastly, the two pieces and are variation independent of one another because this is a parameter cut, and parameters within are all variation independent since they are just ordinary log-linear parameters; therefore is variation independent if and only if is. ∎
Any complete parameterization in which the margins are strictly nested () is smooth and variation independent.
Lemma 6 of Forcina (2012) deals with the special case , which to our knowledge was the only prior result showing that a non-hierarchical MLL parameterization may be smooth.
Let be a complete parameterization, and suppose that for some , and every , the sets and appear as effects within the same margin in .
Then is a smooth parameterization of if and only if is a smooth parameterization of . In addition, is variation independent if and only if is variation independent.
Since and appear in the same margin, say , set
|which is zero unless , leaving|
But notice this is of the same form as an MLL parameter for the pair over the conditional distribution . It follows that for fixed the parameters form a complete MLL collection of the form for the conditional distribution of . If is smooth then we can smoothly recover the conditional distribution . Furthermore, if the effect is in a margin , then using (3) we obtain
and smoothly recover . In addition is variation independent of (since , constitutes a parameter cut) and has range , so the same is true of .
Conversely if is smooth, then given parameters we can set up a dummy distribution on in which for each , and , thus smoothly recovering . ∎
4 Fixed Point Mappings
The previous section gives analytical maps between some parameterizations, but Propositions 3.5 and 3.8 only apply directly to a relatively small number of cases. In this section we build on these results by presenting conditions for the existence of a smooth map, even without a closed form expression.
For a given this suggests that might be recovered using fixed point methods; the identity (4) gives us information about the Jacobian of .
Consider the parameterization based on
If we can smoothly recover , and from then it follows that is a smooth parameterization. From (3) we have
|since , and are given in the parameterization we can assume these to be fixed, so abusing notation slightly|
Similarly, for some smooth , so is a solution to the equation
If can be shown to be a contraction mapping, then we are guaranteed to find a unique solution, and therefore recover the joint distribution. In addition, if is a contraction for all , then since it varies smoothly in we will have shown that is a smooth parameterization.
Define to be the smallest amount of probability assigned to any cell in our joint distribution, and to be the probability simplex consisting of such distributions. The Jacobian of an otherwise smooth parameterization can become singular on the boundary of the probability simplex, so it is useful to have control over this quantity.
The next result allows us to control the magnitude of the columns (or rows) of the Jacobian of in certain examples. The proof is given in the appendix.
Let , and . Then
Alternatively, if , then
Returning to the parameterization in Example 4.1, the derivative of is
which is the dot product of the vectors
By applying the two parts of Lemma 4.2, these vectors each have magnitude at most . Hence , and is a contraction on for every . It follows that the equation has a unique solution among all positive probability distributions (and this can be found by iteratively applying to any initial distribution), and by the inverse function theorem it is a smooth function of . Hence is indeed smooth.
Lemma 4.2 enables us to formulate the following generalization of the idea used in the example above.
Let be complete and such that for any with , there is at most one other margin in with . Then is smooth.
By Theorem 3.1,
Since is the only margin in such that , it follows that all the parameters in are known and fixed except for , where is the set of effects contained in the margin . Hence
Now, consider the vector equation obtained by stacking (5) over all pairs . This defines a fixed point equation whose solution is , and the column of the Jacobian corresponding to has non-zero entries
From Lemma 4.2, each column has magnitude at most , and therefore the mapping is a contraction on for each . It follows that the fixed point equation has a unique solution which, by the inverse function theorem, is a smooth function of . ∎
From this result we obtain the following corollary, the conditions of which are easy to verify.
Any complete parameterization with at most three margins is smooth.
Since one of the margins must be , it is clear that the conditions of Lemma 4.5 hold. ∎
Although it does not satisfy the conditions of Lemma 4.5 directly, one can use the basic idea to set up a smooth contraction mapping from to ; since is hierarchical, both parameterizations are smooth.
5 Cyclic Parameterizations
This section takes a third approach to determining smoothness, by using Markov chain theory to recover certain marginal distributions. This method allows us to demonstrate the smoothness of certain conditional independence models.
Forcina (2012, Example 2) considers the model defined (up to some relabelling) by the conditional independences
which is equivalent to setting the parameters
to zero. Note that we cannot embed these parameters into a larger hierarchical parameterization, because each pairwise effect will ‘belong’ to a margin preceding it; for example, is a subset of , so for hierarchy the margin must precede ; by a similar argument, must precede which must precede . We therefore have a cyclic parameterization, referred to as a ‘loop’ by Forcina. None of the methods used in the previous sections seem well suited to dealing with this situation.
Forcina (2012) presents an algorithm for recovering joint distributions given parameterizations of this kind, together with a condition under which it is guaranteed to converge to the unique solution. However, this condition is on the spectral radius of a complicated Jacobian, and is difficult to verify except in a few special cases: a numerical test is suggested, but this does not constitute a proof of smoothness. Here we show that, at least in some cases, Forcina’s algorithm can be recast as a Markov chain whose stationary distribution is some margin of the relevant probability distribution.
Let be a disjoint sequence of sets with such that the conditional distributions for are known, together with . Then the marginal distributions are smoothly recoverable.
Define a matrix with entries
This is a (right) stochastic matrix with strictly positive entries, and the marginal distribution satisfies
In other words, is an invariant distribution for the Markov chain with transition matrix defined by . Since has a finite state-space and all transition probabilities are positive, the chain is positive recurrent and the equations have a unique solution (see, e.g. Norris, 1997). Hence is defined by the kernel of the matrix , and this is a smooth function of the original conditional probabilities. ∎
The Markov chain corresponding to is that which would be obtained by picking some , and then evolving using until we get back to . The equations can be solved iteratively by repeatedly right multiplying any positive vector by , so that it converges to the stationary distribution of the chain; this corresponds precisely to Forcina’s algorithm.
Example 5.3 (Forcina (2012), Example 9).
Consider the cyclic parameterization .
The parameters corresponding to the first three margins in are equivalent to the conditional distributions , and . Using the conditionals in the manner suggested by Theorem 5.1, we can smoothly recover (for example) the margin (or equivalently ), and consequently is equivalent to the hierarchical parameterization .
Consider the model defined by
it consists of setting the parameters in below to zero.
We can embed in the complete parameterization . Note that using and the fact that , means we can construct the conditional distribution . Similarly we have , and . In a manner analogous to the previous example, we can set up a Markov chain whose stationary distribution is the marginal as follows. First pick . Now, for
draw from the distribution ;
draw from the distribution ;
draw from the distribution ;
draw from the distribution .
Then the distribution of converges to . We can therefore smoothly recover a distribution satisfying the conditional independence constraints from the 7 free parameters. The dimension of the model is full, so we have a smooth parameterization of the model, which is therefore a curved exponential family Lauritzen (1996).
Note that the construction of the Markov chain in Example 5.5 is only possible when the conditional independence constraints hold, so—unlike in Examples 5.3 and 5.4—we have not actually demonstrated that is generally smooth, only that the model defined by setting is a curved exponential family.
Some conditional independence models are non-smooth: e.g. the model defined by and (Drton, 2009). This is essentially because it requires that , and setting repeated (non-redundant) effects to zero always leads to non-smooth parameterizations.
We remark that all discrete conditional independence models on four variables either require repeated effects to be constrained in different margins, or can be shown to be smooth using the results of this section. However, the next example shows that for five variables the picture is incomplete.
The conditional independence model defined by
contains no repeated effects, and yet does not appear to be approachable using the methods outlined above. Empirically, Forcina’s algorithm seems to converge to the correct solution, which suggests that the model is indeed smooth.
We have presented three new approaches to demonstrating that complete but non-hierarchical marginal log-linear parameterizations are smooth, although a general result eludes us. Note that each of the approaches provides an explicit algorithm for obtaining the probabilities from the parameterization, either using the map in Section 3, the fixed point iteration in Section 4, or the Markov chain in Section 5.
There are 104 complete MLL parameterizations on three variables, of which 23 are hierarchical and a further 4 consist of only two margins, so are smooth by the results of Bergsma and Rudas (2002) and Forcina (2012) respectively. These 27 were the only ones known to be smooth prior to this paper.
A further 5 can be shown smooth using Proposition 3.5, and one using Proposition 3.8 (Example 3.9). Another 26 can be dealt with using Lemma 4.5 in combination with other methods, and the approach in Example 4.7 can be applied to three more. Example 5.3 brings the total number of known smooth models to 63.
In addition, of the remaining 41 complete parameterizations, there are smooth mappings between a group of four and a group of three, so it remains to establish the smoothness (or otherwise) of at most 36 distinct parameterizations. As an example of a parameterization whose smoothness is still not established, consider:
We conjecture that any complete parameterization is smooth, a result which would enable us to show that models such as that given in Example 5.7 are curved exponential families of distributions.
Any complete MLL parameterization is smooth.
- Bergsma et al. (2009) W. Bergsma, M. A. Croon, and J. A. Hagenaars. Marginal models: For dependent, clustered, and longitudinal categorical data. Springer Science & Business Media, 2009.
- Bergsma and Rudas (2002) W. P. Bergsma and T. Rudas. Marginal models for categorical data. Ann. Stat., 30(1):140–159, 2002.
- Colombi and Forcina (2014) R. Colombi and A. Forcina. A class of smooth models satisfying marginal and context specific conditional independencies. Journal of Multivariate Analysis, 126:75–85, 2014.
- Drton (2009) M. Drton. Discrete chain graph models. Bernoulli, 15(3):736–753, 2009.
- Evans and Richardson (2013) R. J. Evans and T. S. Richardson. Marginal log-linear parameterizations for graphical Markov models. Journal of Royal Statistical Society, Series B, 75:743–768, 2013.
- Forcina (2012) A. Forcina. Smoothness of conditional independence models for discrete data. Journal of Multivariate Analysis, 106:49–56, 2012.
- Forcina et al. (2010) A. Forcina, M. Lupparelli, and G. M. Marchetti. Marginal parameterizations of discrete models defined by a set of conditional independencies. Journal of Multivariate Analysis, 101:2519–2527, 2010.
- Lang and Agresti (1994) J. B. Lang and A. Agresti. Simultaneously modeling joint and marginal distributions of multivariate categorical responses. Journal of the American Statistical Association, 89(426):625–632, 1994.
- Lauritzen (1996) S. L. Lauritzen. Graphical Models. Clarendon Press, Oxford, UK, 1996.
- Norris (1997) J. R. Norris. Markov Chains. Cambridge University Press, 1997.
- Rudas et al. (2010) T. Rudas, W. P. Bergsma, and R. Németh. Marginal log-linear parameterization of conditional independence models. Biometrika, 94:1006–1012, 2010.
Appendix A Technical Proofs
a.1 Proof of Lemma 4.2
Let be a vector indexed by subsets . Then if and only if for any ,
The -matrix with th entry is orthogonal, and therefore preserves vector lengths. Then the vector has entries with magnitude at most , and therefore has total magnitude at most 1. The same is therefore true of . ∎
Proof of Lemma 4.2.
For , define
so that for . Given ,
and note that
|where the expression in braces is 2 if or 0 otherwise, so|
which is an alternating sum of probabilities which sum to one, so has absolute value at most . The result follows from Lemma A.1. The second result is essentially identical, due to the symmetry between in (4). ∎