Hierarchical Models, Marginal Polytopes, and Linear Codes
In this paper, we explore a connection between binary hierarchical models, their marginal polytopes and codeword polytopes, the convex hulls of linear codes. The class of linear codes that are realizable by hierarchical models is determined. We classify all full dimensional polytopes with the property that their vertices form a linear code and give an algorithm that determines them.
Key words and phrases:0/1 polytopes, linear codes, hierarchical models, exponential families
2000 Mathematics Subject Classification:52B11, 94B05, 60C05
In theoretical statistics the marginal polytope plays an important role. It is the polytope of possible values that a sufficient statistics can take. It encodes in its face lattice the combinatorial structure of the boundary of the exponential family defined by the statistics. For a model on discrete random variables it can be represented with vertices that have only components 0 or 1, commonly called a 0/1 polytope.
In coding theory when decoding binary linear codes one can apply techniques from linear programming and optimize a linear function over the convex hull of the code words, known as the codeword polytope [FWK03].
Observing that for certain choices of sufficient statistics on binary random variables these two notions coincide, our main contribution is a characterization of the corresponding polytopes. We do not address problems that are directly linked to coding theory. However, we do hope that our result will contribute to a better understanding of the closure of exponential families, which is an important problem in statistics.
The paper is organized as follows: In Section 2, we introduce the necessary notions to define hierarchical models and fix the notation. We review different descriptions of so called interaction spaces in Section 3. In Section 4, we establish the link to coding theory. Finally, in Section 5 we give our main result, the classification of all such full dimensional polytopes whose vertices form a linear code and give a recursive formula for their number.
2.1. Exponential Families of Hierarchical Models
Given a non-empty finite set , we denote the set of probability distributions on by . The support of is defined as . The set of distributions with full support is denoted as . The set has the geometrical structure of a -dimensional simplex lying in an affine hyperplane of
the vector space of real valued functions on . Statistical models, such as hierarchical models are subsets of . In this paper, we will only consider so called exponential families which are smooth manifolds.
is called the exponential map. It acts component wise by exponentiating and normalizing. Then, an exponential family (in ) is defined as the image of a linear subspace of .
An exponential family naturally has full support and is therefore contained in the open simplex . However, to get probability distributions with reduced support one has to pass to the closure with respect to the standard topology of .
Now we consider a compositional structure of induced by the set . Given a subset , we define
and the natural projection
In the following, we will abbreviate . One can view as the set of joint probability distributions of the binary random variables . We now use the compositional structure of in order to define exponential families in given by interaction spaces. Now, decompose in the form with , , and define to be the subspace of functions that do not depend on the configurations :
In the following, we apply these interaction spaces as building blocks for more general interaction spaces and associated exponential families [DS83]. The definition of a hierarchical model is based on the notion of a hypergraph [Lau96]:
A pre-hypergraph is a non-empty subset of that contains all atoms for .
A hypergraph is a pre-hypergraph that is (inclusion) complete in the following sense: If and it follows that .
For technical convenience, we have defined hypergraphs to be complete. In this way, it is easy to define a hierarchical model for each hypergraph. However, the notion of a pre-hypergraph turns out to be more natural in the context of the polytopes and linear codes that we consider below.
Given a hypergraph, we define the associated interaction space by
Note that, since a function that depends only on its arguments in , only depends on its arguments in , it suffices to consider the inclusion maximal elements in . We denote them by and have
We consider the corresponding exponential family:
The hierarchical model assigned to the hypergraph is the exponential family
We give two examples for hypergraphs:
(1) Graphical models: Let be an undirected graph, and define
Here, a clique is a set that satisfies the following property:
The exponential family is
characterized by Markov
properties with respect to (see [Lau96]).
(2) Interaction order: The hypergraph associated with a given interaction order is defined as
The elements of this hierarchy have nice interpretations. It can be seen that the closure of the family contains exactly all probability distributions that factor. This means that
where are the marginal distributions of . Generally, an element will allow a factorization as
where depends only on . However, the are not necessarily probability distributions and not unique. Note that , does not necessarily admit such a product structure.
We will clarify these definitions in the following simple
Consider the case . The configuration space is given as
The vector space of real valued functions is 4-dimensional and the probability measures form a 3-dimensional tetrahedron. Considering the hypergraphs of fixed interaction order and their exponential families, one has only two examples here: and , only the first being nontrivial.
Figure 1 shows the situation. The exponential family is a two-dimensional manifold lying inside the simplex. One should already think about this as a square (the two dimensional cube) molded into the simplex.
In the following, we will study the interaction spaces more thoroughly by comparing different generating systems.
3. Generating Systems of Interaction Spaces
In this section, let be fixed. In statistics, different representations of exponential families have been considered, each of which has its own benefits and highlights different aspects. We will review a number of these representations. The particular choice of parity functions will allow us to make a link to coding theory.
As we have introduced exponential families, the key concept is the interaction space which is sometimes also called tangent space to the exponential family. This space completely characterizes the exponential family. However, there is a choice of the parameterization of this space, which has been made differently in different fields. Speaking in terms of linear algebra, one has to choose a generating system of a linear space.
Let be any finite generating system of . Each such choice gives a different parameterization of the exponential family and a different sufficient statistics. The parameterization is identifiable if is a basis. The exponential family is parameterized as
where again is the normalization and equals the number of parameters. In statistical physics the exponent is commonly called the energy.
To each choice of there is a polytope constructed as follows. Consider the vectors
Each such vector has as its components the evaluation of every element in at . The polytope is
Since contains all atoms, it can be seen that the polytope has vertices and the dimension equals the dimension of the exponential family. By applying some classical theorems from statistics, such as the existence and uniqueness of maximum likelihood estimates [Kul68, Csi75], it can be seen that the points of the polytope are in one to one correspondence with points in the closure of the exponential family. As we have introduced it here, it is clear that the different choices of yield different representation of the same polytope in the sense that they are all affinely equivalent. In particular, they have the same face lattice.
The polytope encodes in its face lattice the combinatorial structure of the exponential family in the sense that a knowledge of the face lattice gives precise knowledge about the supports of elements in the closure of the exponential family. However, direct computation is infeasible for real world problems.
In statistical physics, and also for various inference methods it is of interest to compute the free energy, given as the logarithm of the partition function. There, variational principles and the techniques of Legendre transform are applied. In this setting the points in the polytope are then the so called dual parameters. See for instance [WJ03].
We will review a number of choices for :
Statistical Physics - Potentials
In statistical physics one considers so called potentials [Win03, Geo88]. A potential is a collection of functions , where and , such that the energy can be written as a linear combination hereof. Typically one has a distinguished state called the vacuum. A potential is called normalized if as soon for some . Given a strictly positive distribution, a corresponding normalized potential exists and is unique. In our binary setting, choosing as the vacuum state, the normalized potential is given by the functions , where .
One has , and a basis of the interaction space is given by together with the constant function . Expanding a function in terms of this basis was called the -expansion in the works of Caianiello [Cai86, Cai75].
In the case of pair interactions where the hypergraph is given by , the polytope coincides with the so called correlation polytope [DL97]. Extending the terminology to an arbitrary hypergraph , we call the moment polytope, as each point in it is the vector of moments of some distribution.
One representation of an exponential family is given via the linear map that computes the marginals. Denote the set of inclusion maximal sets in . Consider the linear map
That, for a given vector computes the set of its maximal marginals defined as
When represented as a matrix with respect to the canonical basis, has rows indexed by pairs of a set and a configuration . The columns are indexed by configurations . Each component then contains the value of the indicator .
Denote the -th column of this matrix as then, the exponential family is parameterized as
In terms of these vectors, the polytope is commonly called the marginal polytope. It is represented as 0/1 polytope embedded in a high dimensional space.
An orthogonal basis of characters
In the binary case , a natural basis for is given by the characters of . Here, we assume pointwise addition modulo 2 as the group operation. For every subset define the function by
where . It can be seen that, if is a hypergraph, together with the constant function is an orthogonal basis of the interaction space . This approach was followed in [KA06]. Various people, starting with Caianiello [Cai75] have called this the -expansion. Note that if one considers random variables taking values in this basis equals the monomial basis considered above.
A basis of parity functions
Finally, we will introduce yet another basis of which is derived from the basis of characters. To each , we define a vector in .
The following proposition is easily checked:
Let be the constant function . The set is a basis of .
One crucial point about choosing this representation is that it gives, if the constant function is omitted, a full dimensional 0/1 polytope, the vertices of which form an additive group and thereby a linear code (see Proposition 10). For all other choices of discussed in this section the image of is not a subgroup of or the multiplicative group .
While in the construction of a hierarchical model we assumed a hypergraph, the following polytope is an interesting object of study also in the general case of a pre-hypergraph:
Let be a pre-hypergraph. We define
If is a hypergraph, then this is affinely equivalent to the marginal polytope of the corresponding exponential family. In the case of the hypergraphs we write . The rest of the paper is devoted to the study of this class of polytopes.
Namely, to each graph we can associate the hypergraph . This is distinct from what was called a graphical model above, as not the cliques are considered. Some authors refer to the corresponding statistical model as a graph model. From construct the coned graph with an additional vertex:
Then, denoting the CUT polytope of as one has
Using the representation in terms of the vectors , , the proof of this equivalence becomes a simple renaming of coordinates.
Remark (Covariance Mapping).
As remarked above, in the representation with monomials one finds the correlation polytope as a special case. From the last remark it follows that the CUT-polytope of the complete graph is equal to . There exists an affine equivalence between and called the covariance mapping [DL97]. It can be seen that this mapping generalizes to a mapping between binary marginal polytopes and the corresponding moment polytopes. It therefore might be suitable to consider the parity representations of binary marginal polytopes for a generalization of CUT-polytopes to arbitrary (pre)-hypergraphs.
3.1. Computations and elementary properties
Using the geometry software polymake [GJ00], one can compute linear descriptions of polytopes. As an example, we give here the F-Vectors of for the cases . For , the F-Vector is too complicated to be computed by the brute force approach of polymake. However, waiting sufficiently long, one can get the 6800 facet defining inequalities of and the 3835488 facets of .
In Tables 1 and 2, we give the F-Vectors of for , computed using polymake. The rows label the dimension of the faces, the columns the value of . The reader might wonder about the fact that the face lattices of are up to a certain dimension isomorphic the face lattice of the simplex. This property, commonly called neighborliness, follows from a general result in [Kah08]. The last row refers to whether the polytope is simple or not.
In the following, we will list elementary properties of that follow easily from the definition.
is the -cube.
is the -dimensional simplex.
every has dimension .
every has vertices.
is a vertex.
every is a projection of the -dimensional simplex along coordinate axes.
For every , there is a projection along coordinate axes that projects it to the -cube .
In [HS02] it was remarked that has exactly facets. The extreme points of these facets are also known. A set defines a face if and only if it contains neither nor its complement. Note that the set and its complement are exactly the set of configurations with a fixed parity. As the vertices of have only one affine dependency, it is not difficult to prove this fact using the Gale transform. By the above is combinatorially isomorphic to the so called cyclic polytope [Zie94].
In the following, we develop the connection to coding theory.
4. A Link to Coding Theory
We briefly recall the definition of a linear code. For a detailed introduction into coding theory see for instance [van99]. Consider the finite field with addition and multiplication mod 2. In coding theory, one studies particularly vector spaces over this field.
A binary -linear code is a linear subspace of such that . A generator matrix for is a by matrix which has as its rows a basis of . Given one can find an equivalent111Two codes are called equivalent if one can be transformed into the other by applying a permutation on the positions in the codewords, and for each position a permutation of the symbols. code such that it has a generator matrix in standard form, i.e. , where is the by identity matrix.
The following proposition states that the vertices of form a linear code for any pre-hypergraph . A special case of this connection has been mentioned in Example 2 in [WJ03].
Let be considered as a vector space over the finite field . Then the image of under is a linear subspace. If we also consider as a vector space over , is an injective homomorphism between vector spaces. Its image forms an -linear code. A generator matrix in standard form has as its rows the vectors for , where is the -th unit vector in .
Since scalar multiplication is trivial, we only need to show
Let , it suffices to show the identity for . To do so, introduce
Then , , and . We find that is the symmetric difference of and :
Since , we have that in
and therefore (2) holds. We now show that is injective. To see this, assume that . Since contains all atoms , we get for every : . This implies and, hence, . Since considered as an vector space has dimension , also has dimension and therefore forms an -linear code. ∎
To write down the generator matrix, one has to impose a numbering on the elements in . If the numbering is in such a way that for , then the generator matrix is in standard form.
An important property of a linear code is its distance, which is defined as the minimal Hamming distance between different elements of the code. For the hierarchical model of the hypergraph , the distance of the code is given by
Let denote the hamming distance of . If , then equals the number of subsets of which contain a given element and have cardinality at most . ∎
In the following, we will elaborate on the opposite direction. Let . Assume we are given an linear code. Without loss of generality, we assume that it has a generator matrix in standard form. We will construct a pre-hypergraph from the columns of the generator matrix. Since is a set, while the columns are a list, repetitions of columns will be lost. If one considers only non-repetitive codes, then our construction is injective, and the codewords are given by the vertices of .
Let denote the identity matrix in dimension . Assume the generator matrix has no 2 identical columns. (This implies .) Denote by the canonical basis of . Using the columns of , we define sets
Note that the elements of are numbered in a natural way such that we can use as an index set for the columns of .
To see that is the set of rows of the generator matrix, we evaluate
which holds by definition of the .
Summarizing, every binary linear code (in standard form) corresponds to a pre-hypergraph. However, two codes that differ only in repetitions of columns in the generator matrix will be mapped to the same pre-hypergraph. Then, if it is a hypergraph, the linear code is the linear code of a hierarchical model.
As we have seen, the polytopes are full dimensional polytopes such that the vertices form a linear code. In this last section, we classify all polytopes with this property. Then we investigate which of them can be realized as polytopes of hierarchical models. For a convex polytope , let denote the vertex set of . For , put
Hence is an Abelian group that is canonically isomorphic to . We consider as a subset of and write “” whenever we mean addition modulo 2, while “” means ordinary addition in .
In the following, we develop an algorithm that determines - by induction for every - all polytopes with satisfying the following conditions:
is a subgroup of .
has dimension .
Note that the number of vertices of such a polytope is a power of two. Of course, the full -dimensional cube satisfies (I) and (II). To start the induction, we remark that there are no further such polytopes in the cases and . For , the 3-dimensional regular simplex with
More generally, by [Wen06, Theorem 2.2], we have the following
For , the following statements are equivalent:
contains some subgroup such that is a regular simplex of dimension .
is some power of 2.
determines a subgroup of ; however, the convex closure has dimension 2.
For fixed , define the bijections and by
Moreover, let denote the center of the -cube .
The polytope , given by
has dimension .
There does not exist some index with such that . In other words, none of the affine hyperplanes separates and .
One has .
One has .
Since has dimension and since , we must have whenever . This means that - and hence also - is contained in the -dimensional hyperplane , in contradiction to .
This means that
where the sum is taken in .
Now fix . Since , we get also
Suppose that the assertion is wrong; hence is contained in some -dimensional -homogeneous- hyperplane . Since
the polytope has the same dimension as , that is . Thus, the restriction is a linear isomorphism from onto , and there exists some -linear map satisfying
By definition of , this means:
Hence, and are linearly separated by the affine hyperplane
By (iv) this is impossible. ∎
We investigate the statement of Proposition 12 for polytopes corresponding to a pre-hypergraph . We start by considering the matrix which has as its rows the vectors , where . The rows are labeled by the binary strings of length , that is by , while the columns are indexed by the non-empty subsets of . Therefore the rows of this matrix are the coordinates of the vertices of the simplex :
We note the following facts:
The columns of this matrix are exactly the non-zero binary strings of length .
There are subgroups of index 2 of the -cube, which correspond to the columns of the matrix. To define them let a column be fixed, then put . The maps are exactly the surjective homomorphisms having the nontrivial subgroups as their kernels.
The vertices of every polytope are given by deleting columns from this matrix that correspond to sets not in .
In particular, by restriction to the first columns, we get the vertices of the -cube .
Now, assume that is the -cube. We choose a column of the matrix, corresponding to a subgroup of index 2. There are two possibilities. If we choose a column corresponding to an atom, then is wrong, the dimension does not grow when adding this column to the coordinates (as we have doubled a coordinate). If, on the other hand, we choose a column corresponding to a set with cardinality two or more, then we are in the situation of Proposition 12, since holds. The lift (3) will be full dimensional, and its vertices are given by the submatrix with columns . Continuing from here, choosing another subgroup, the dimension will grow if and only if it does not correspond to one of the sets . Iteratively, the choices narrow down and, finally, when all columns have been chosen, the polytope is a simplex.
We will now formalize this procedure. For a fixed polytope as in Proposition 12, put
Initialization for :
Step : Based on construct a new set consisting of all polytopes such that there exists with
where runs through all subgroups of with and for .
Note that in the case , the number of vertices is doubled, while in the other cases the number of vertices of equals the number of vertices of . Furthermore, it is interesting to see that the two possible operations commute in the following sense. Starting from some cube , lifting it to and then choosing a subgroup to apply the lift (4) gives the same polytope as choosing the subgroup from and then taking the prism over the lifted polytope, where is the canonical projection. Therefore, all polytopes that are constructed by the algorithm can be thought of as lifted cubes .
The classification will be complete with:
If satisfies (4), then clearly is a subgroup of , because is a subgroup of with . Moreover, (ii) (i) in Proposition 12 implies that has dimension , because for . Hence, satisfies conditions (I) and (II).
Vice versa, assume that fulfills (I) and (II). Consider again the projection onto the first coordinates, and put . Since has dimension , has dimension . If is not injective, then is the prism , because is a subgroup of . If is injective, put
Then is a subgroup of with , because has dimension . Moreover, equation (4) holds for as just defined. Finally, Proposition 12, (i) (ii) shows that for . Hence, our algorithm includes the determination of . ∎
As a first application of Theorem 15 we can count the number of -dimensional polytopes that satisfy conditions (I) and (II). Let . For , let denote the number of all 0/1 polytopes with that satisfy (I) and (II). Then one has obviously
We have for , because a polytope with at most vertices cannot have dimension . Furthermore, we have clearly for all .
For one has
The first few values are given in the Table 3.
It is easy to compute this number also for larger values of . For instance
Finally, using the Corollary we can show that, among the full dimensional 0/1-polytopes with vertices the convex hulls of linear codes are exceptional. For , let denote the number of all 0/1 polytopes with vertices satisfying only condition (II). Hence, the number of all 0/1 polytopes of dimension trivially satisfies
Moreover, we get
For , one has
Suppose that is a proper subgroup of with and .
If is another subgroup of with , then we have
There are subsets of with and ; namely, these are all sets of the form
For as in (9), we get , because otherwise, would be contained in a -unique- hyperplane with , a contradiction to . Together with (8), we obtain the first inequality in (7). The second one is trivial in view of .
As a concluding remark, we study the question of constructing a statistical model from a given polytope. Assume