Hierarchical Models, Marginal Polytopes, and Linear Codes

Hierarchical Models, Marginal Polytopes, and Linear Codes

Thomas Kahle Max Planck Institute for Mathematics in the Science
Inselstrasse 22
D-04103 Leipzig, Germany
kahle@mis.mpg.de
Walter Wenzel Fachhochschule Magdeburg
Fachbereich Wasser- und Kreislaufwirtschaft
Breitscheidstrasse 2
D-39114 Magdeburg, Germany
walter@math.uni-bielefeld.de
 and  Nihat Ay Max Planck Institute for Mathematics in the Science
Inselstrasse 22
D-04103 Leipzig, Germany
and Santa Fe Institute
1399 Hyde Park Road
Santa Fe, NM 87501, USA
nay@mis.mpg.de
July 16, 2019
Abstract.

In this paper, we explore a connection between binary hierarchical models, their marginal polytopes and codeword polytopes, the convex hulls of linear codes. The class of linear codes that are realizable by hierarchical models is determined. We classify all full dimensional polytopes with the property that their vertices form a linear code and give an algorithm that determines them.

Key words and phrases:
0/1 polytopes, linear codes, hierarchical models, exponential families
2000 Mathematics Subject Classification:
52B11, 94B05, 60C05
The first author is supported by the Volkswagen Foundation
The third author is supported by the Santa Fe Institute

1. Introduction

In theoretical statistics the marginal polytope plays an important role. It is the polytope of possible values that a sufficient statistics can take. It encodes in its face lattice the combinatorial structure of the boundary of the exponential family defined by the statistics. For a model on discrete random variables it can be represented with vertices that have only components 0 or 1, commonly called a 0/1 polytope.

In coding theory when decoding binary linear codes one can apply techniques from linear programming and optimize a linear function over the convex hull of the code words, known as the codeword polytope [FWK03].

Observing that for certain choices of sufficient statistics on binary random variables these two notions coincide, our main contribution is a characterization of the corresponding polytopes. We do not address problems that are directly linked to coding theory. However, we do hope that our result will contribute to a better understanding of the closure of exponential families, which is an important problem in statistics.

The paper is organized as follows: In Section 2, we introduce the necessary notions to define hierarchical models and fix the notation. We review different descriptions of so called interaction spaces in Section 3. In Section 4, we establish the link to coding theory. Finally, in Section 5 we give our main result, the classification of all such full dimensional polytopes whose vertices form a linear code and give a recursive formula for their number.

2. Preliminaries

2.1. Exponential Families of Hierarchical Models

Given a non-empty finite set , we denote the set of probability distributions on by . The support of is defined as . The set of distributions with full support is denoted as . The set has the geometrical structure of a -dimensional simplex lying in an affine hyperplane of

the vector space of real valued functions on . Statistical models, such as hierarchical models are subsets of . In this paper, we will only consider so called exponential families which are smooth manifolds.

Definition 1.

The map

is called the exponential map. It acts component wise by exponentiating and normalizing. Then, an exponential family (in ) is defined as the image of a linear subspace of .

An exponential family naturally has full support and is therefore contained in the open simplex . However, to get probability distributions with reduced support one has to pass to the closure with respect to the standard topology of .

Now we consider a compositional structure of induced by the set . Given a subset , we define

and the natural projection

In the following, we will abbreviate . One can view as the set of joint probability distributions of the binary random variables . We now use the compositional structure of in order to define exponential families in given by interaction spaces. Now, decompose in the form with , , and define to be the subspace of functions that do not depend on the configurations :

In the following, we apply these interaction spaces as building blocks for more general interaction spaces and associated exponential families [DS83]. The definition of a hierarchical model is based on the notion of a hypergraph [Lau96]:

Definition 2.

A pre-hypergraph is a non-empty subset of that contains all atoms for .

A hypergraph is a pre-hypergraph that is (inclusion) complete in the following sense: If and it follows that .

Remark.

For technical convenience, we have defined hypergraphs to be complete. In this way, it is easy to define a hierarchical model for each hypergraph. However, the notion of a pre-hypergraph turns out to be more natural in the context of the polytopes and linear codes that we consider below.

Given a hypergraph, we define the associated interaction space by

Note that, since a function that depends only on its arguments in , only depends on its arguments in , it suffices to consider the inclusion maximal elements in . We denote them by and have

We consider the corresponding exponential family:

Definition 3.

The hierarchical model assigned to the hypergraph is the exponential family

We give two examples for hypergraphs:

Example 4.


(1) Graphical models: Let be an undirected graph, and define

Here, a clique is a set that satisfies the following property:

The exponential family is characterized by Markov properties with respect to (see [Lau96]).
(2) Interaction order: The hypergraph associated with a given interaction order is defined as

If appropriate, we will sometimes drop the and write . We have defined a corresponding hierarchy of exponential families studied in [Ama01, AK06]:

The elements of this hierarchy have nice interpretations. It can be seen that the closure of the family contains exactly all probability distributions that factor. This means that

where are the marginal distributions of . Generally, an element will allow a factorization as

where depends only on . However, the are not necessarily probability distributions and not unique. Note that , does not necessarily admit such a product structure.

We will clarify these definitions in the following simple

Example 5.

Consider the case . The configuration space is given as

The vector space of real valued functions is 4-dimensional and the probability measures form a 3-dimensional tetrahedron. Considering the hypergraphs of fixed interaction order and their exponential families, one has only two examples here: and , only the first being nontrivial.

Figure 1. The exponential family in the simplex of probability distributions.

Figure 1 shows the situation. The exponential family is a two-dimensional manifold lying inside the simplex. One should already think about this as a square (the two dimensional cube) molded into the simplex.

In the following, we will study the interaction spaces more thoroughly by comparing different generating systems.

3. Generating Systems of Interaction Spaces

In this section, let be fixed. In statistics, different representations of exponential families have been considered, each of which has its own benefits and highlights different aspects. We will review a number of these representations. The particular choice of parity functions will allow us to make a link to coding theory.

As we have introduced exponential families, the key concept is the interaction space which is sometimes also called tangent space to the exponential family. This space completely characterizes the exponential family. However, there is a choice of the parameterization of this space, which has been made differently in different fields. Speaking in terms of linear algebra, one has to choose a generating system of a linear space.

Let be any finite generating system of . Each such choice gives a different parameterization of the exponential family and a different sufficient statistics. The parameterization is identifiable if is a basis. The exponential family is parameterized as

where again is the normalization and equals the number of parameters. In statistical physics the exponent is commonly called the energy.

To each choice of there is a polytope constructed as follows. Consider the vectors

Each such vector has as its components the evaluation of every element in at . The polytope is

Since contains all atoms, it can be seen that the polytope has vertices and the dimension equals the dimension of the exponential family. By applying some classical theorems from statistics, such as the existence and uniqueness of maximum likelihood estimates [Kul68, Csi75], it can be seen that the points of the polytope are in one to one correspondence with points in the closure of the exponential family. As we have introduced it here, it is clear that the different choices of yield different representation of the same polytope in the sense that they are all affinely equivalent. In particular, they have the same face lattice.

The polytope encodes in its face lattice the combinatorial structure of the exponential family in the sense that a knowledge of the face lattice gives precise knowledge about the supports of elements in the closure of the exponential family. However, direct computation is infeasible for real world problems.

In statistical physics, and also for various inference methods it is of interest to compute the free energy, given as the logarithm of the partition function. There, variational principles and the techniques of Legendre transform are applied. In this setting the points in the polytope are then the so called dual parameters. See for instance [WJ03].

We will review a number of choices for :

Statistical Physics - Potentials

In statistical physics one considers so called potentials [Win03, Geo88]. A potential is a collection of functions , where and , such that the energy can be written as a linear combination hereof. Typically one has a distinguished state called the vacuum. A potential is called normalized if as soon for some . Given a strictly positive distribution, a corresponding normalized potential exists and is unique. In our binary setting, choosing as the vacuum state, the normalized potential is given by the functions , where .

One has , and a basis of the interaction space is given by together with the constant function . Expanding a function in terms of this basis was called the -expansion in the works of Caianiello [Cai86, Cai75].

In the case of pair interactions where the hypergraph is given by , the polytope coincides with the so called correlation polytope [DL97]. Extending the terminology to an arbitrary hypergraph , we call the moment polytope, as each point in it is the vector of moments of some distribution.

Marginals

One representation of an exponential family is given via the linear map that computes the marginals. Denote the set of inclusion maximal sets in . Consider the linear map

That, for a given vector computes the set of its maximal marginals defined as

When represented as a matrix with respect to the canonical basis, has rows indexed by pairs of a set and a configuration . The columns are indexed by configurations . Each component then contains the value of the indicator .

Denote the -th column of this matrix as then, the exponential family is parameterized as

In terms of these vectors, the polytope is commonly called the marginal polytope. It is represented as 0/1 polytope embedded in a high dimensional space.

An orthogonal basis of characters

In the binary case , a natural basis for is given by the characters of . Here, we assume pointwise addition modulo 2 as the group operation. For every subset define the function by

where . It can be seen that, if is a hypergraph, together with the constant function is an orthogonal basis of the interaction space . This approach was followed in [KA06]. Various people, starting with Caianiello [Cai75] have called this the -expansion. Note that if one considers random variables taking values in this basis equals the monomial basis considered above.

A basis of parity functions

Finally, we will introduce yet another basis of which is derived from the basis of characters. To each , we define a vector in .

(1)

The following proposition is easily checked:

Proposition 6.

Let be the constant function . The set is a basis of .

One crucial point about choosing this representation is that it gives, if the constant function is omitted, a full dimensional 0/1 polytope, the vertices of which form an additive group and thereby a linear code (see Proposition 10). For all other choices of discussed in this section the image of is not a subgroup of or the multiplicative group .

While in the construction of a hierarchical model we assumed a hypergraph, the following polytope is an interesting object of study also in the general case of a pre-hypergraph:

Definition 7.

Let be a pre-hypergraph. We define

If is a hypergraph, then this is affinely equivalent to the marginal polytope of the corresponding exponential family. In the case of the hypergraphs we write . The rest of the paper is devoted to the study of this class of polytopes.

Remark (CUT-Polytopes).

There is a well known [DL97] affine equivalence between CUT polytopes of graphs [Zie00] and binary marginal polytopes:

Namely, to each graph we can associate the hypergraph . This is distinct from what was called a graphical model above, as not the cliques are considered. Some authors refer to the corresponding statistical model as a graph model. From construct the coned graph with an additional vertex:

and edges

Then, denoting the CUT polytope of as one has

Using the representation in terms of the vectors , , the proof of this equivalence becomes a simple renaming of coordinates.

Remark (Covariance Mapping).

As remarked above, in the representation with monomials one finds the correlation polytope as a special case. From the last remark it follows that the CUT-polytope of the complete graph is equal to . There exists an affine equivalence between and called the covariance mapping [DL97]. It can be seen that this mapping generalizes to a mapping between binary marginal polytopes and the corresponding moment polytopes. It therefore might be suitable to consider the parity representations of binary marginal polytopes for a generalization of CUT-polytopes to arbitrary (pre)-hypergraphs.

3.1. Computations and elementary properties

Using the geometry software polymake [GJ00], one can compute linear descriptions of polytopes. As an example, we give here the F-Vectors of for the cases . For , the F-Vector is too complicated to be computed by the brute force approach of polymake. However, waiting sufficiently long, one can get the 6800 facet defining inequalities of and the 3835488 facets of .

Example 8.

In Tables 1 and 2, we give the F-Vectors of for , computed using polymake. The rows label the dimension of the faces, the columns the value of . The reader might wonder about the fact that the face lattices of are up to a certain dimension isomorphic the face lattice of the simplex. This property, commonly called neighborliness, follows from a general result in [Kah08]. The last row refers to whether the polytope is simple or not.

1 2 3
0 8 8 8
1 12 28 28
2 6 56 56
3 1 68 70
4 - 48 56
5 - 16 28
6 - 1 8
7 - - 1
sum 27 225 255
simple y n y
Table 1. Face structure of
1 2 3 4
0 16 16 16 16
1 32 120 120 120
2 24 560 560 560
3 8 1780 1820 1820
4 1 3872 4368 4368
5 - 5592 8008 8008
6 - 5060 11440 11440
7 - 2600 12868 12870
8 - 640 11424 11440
9 - 56 7952 8008
10 - 1 4256 4368
11 - - 1680 1820
12 - - 448 560
13 - - 64 120
14 - - 1 16
15 - - - 1
sum 81 20297 65025 65535
simple y n n y
Table 2. Face structure of

In the following, we will list elementary properties of that follow easily from the definition.

  1. is the -cube.

  2. is the -dimensional simplex.

  3. every has dimension .

  4. every has vertices.

  5. is a vertex.

  6. every is a projection of the -dimensional simplex along coordinate axes.

  7. For every , there is a projection along coordinate axes that projects it to the -cube .

Remark.

In [HS02] it was remarked that has exactly facets. The extreme points of these facets are also known. A set defines a face if and only if it contains neither nor its complement. Note that the set and its complement are exactly the set of configurations with a fixed parity. As the vertices of have only one affine dependency, it is not difficult to prove this fact using the Gale transform. By the above is combinatorially isomorphic to the so called cyclic polytope [Zie94].

In the following, we develop the connection to coding theory.

4. A Link to Coding Theory

We briefly recall the definition of a linear code. For a detailed introduction into coding theory see for instance [van99]. Consider the finite field with addition and multiplication mod 2. In coding theory, one studies particularly vector spaces over this field.

Definition 9.

A binary -linear code is a linear subspace of such that . A generator matrix for is a by matrix which has as its rows a basis of . Given one can find an equivalent111Two codes are called equivalent if one can be transformed into the other by applying a permutation on the positions in the codewords, and for each position a permutation of the symbols. code such that it has a generator matrix in standard form, i.e. , where is the by identity matrix.

The following proposition states that the vertices of form a linear code for any pre-hypergraph . A special case of this connection has been mentioned in Example 2 in [WJ03].

Proposition 10.

Let be considered as a vector space over the finite field . Then the image of under is a linear subspace. If we also consider as a vector space over , is an injective homomorphism between vector spaces. Its image forms an -linear code. A generator matrix in standard form has as its rows the vectors for , where is the -th unit vector in .

Proof.

Since scalar multiplication is trivial, we only need to show

(2)

Let , it suffices to show the identity for . To do so, introduce

Then , , and . We find that is the symmetric difference of and :

Since , we have that in

and therefore (2) holds. We now show that is injective. To see this, assume that . Since contains all atoms , we get for every : . This implies and, hence, . Since considered as an vector space has dimension , also has dimension and therefore forms an -linear code. ∎

Remark.

To write down the generator matrix, one has to impose a numbering on the elements in . If the numbering is in such a way that for , then the generator matrix is in standard form.

Remark.

An important property of a linear code is its distance, which is defined as the minimal Hamming distance between different elements of the code. For the hierarchical model of the hypergraph , the distance of the code is given by

Proof.

Let denote the hamming distance of . If , then equals the number of subsets of which contain a given element and have cardinality at most . ∎

In the following, we will elaborate on the opposite direction. Let . Assume we are given an linear code. Without loss of generality, we assume that it has a generator matrix in standard form. We will construct a pre-hypergraph from the columns of the generator matrix. Since is a set, while the columns are a list, repetitions of columns will be lost. If one considers only non-repetitive codes, then our construction is injective, and the codewords are given by the vertices of .

Let denote the identity matrix in dimension . Assume the generator matrix has no 2 identical columns. (This implies .) Denote by the canonical basis of . Using the columns of , we define sets

and then,

Note that the elements of are numbered in a natural way such that we can use as an index set for the columns of .

To see that is the set of rows of the generator matrix, we evaluate

which holds by definition of the .

Summarizing, every binary linear code (in standard form) corresponds to a pre-hypergraph. However, two codes that differ only in repetitions of columns in the generator matrix will be mapped to the same pre-hypergraph. Then, if it is a hypergraph, the linear code is the linear code of a hierarchical model.

5. Classification

As we have seen, the polytopes are full dimensional polytopes such that the vertices form a linear code. In this last section, we classify all polytopes with this property. Then we investigate which of them can be realized as polytopes of hierarchical models. For a convex polytope , let denote the vertex set of . For , put

Hence is an Abelian group that is canonically isomorphic to . We consider as a subset of and write “” whenever we mean addition modulo 2, while “” means ordinary addition in .

In the following, we develop an algorithm that determines - by induction for every - all polytopes with satisfying the following conditions:

  1. is a subgroup of .

  2. has dimension .

Note that the number of vertices of such a polytope is a power of two. Of course, the full -dimensional cube satisfies (I) and (II). To start the induction, we remark that there are no further such polytopes in the cases and . For , the 3-dimensional regular simplex with

satisfies (I) and (II), too.

More generally, by [Wen06, Theorem 2.2], we have the following

Proposition 11.

For , the following statements are equivalent:

  1. contains some subgroup such that is a regular simplex of dimension .

  2. is some power of 2.

In the case , the full 3-cube as well as the regular simplex mentioned above are the only polytopes satisfying conditions (I) and (II). Note that also

determines a subgroup of ; however, the convex closure has dimension 2.

For fixed , define the bijections and by

For put

Moreover, let denote the center of the -cube .

To determine recursively all 0/1-polytopes that fulfill (I) and (II), we prove first the following

Proposition 12.

Suppose that and that is a 0/1-polytope satisfying (I) and (II). Assume that is a subgroup of with . Then the following statements are equivalent:

  1. The polytope , given by

    (3)

    has dimension .

  2. There does not exist some index with such that . In other words, none of the affine hyperplanes separates and .

  3. One has .

  4. One has .

Proof.

(i) (ii):
Suppose that holds for some with . Put

Since has dimension and since , we must have whenever . This means that - and hence also - is contained in the -dimensional hyperplane , in contradiction to .

(ii) (iii):
For , let denote the linear map given by . By assumption, is surjective for . Hence we have

This means that

where the sum is taken in .

Now fix . Since , we get also

and hence

(iii) (iv) is trivial.

(iv) (i):
Consider the projection given by

Suppose that the assertion is wrong; hence is contained in some -dimensional -homogeneous- hyperplane . Since

the polytope has the same dimension as , that is . Thus, the restriction is a linear isomorphism from onto , and there exists some -linear map satisfying

By definition of , this means:

Hence, and are linearly separated by the affine hyperplane

By (iv) this is impossible. ∎

Example 13.

We investigate the statement of Proposition 12 for polytopes corresponding to a pre-hypergraph . We start by considering the matrix which has as its rows the vectors , where . The rows are labeled by the binary strings of length , that is by , while the columns are indexed by the non-empty subsets of . Therefore the rows of this matrix are the coordinates of the vertices of the simplex :

We note the following facts:

  • The columns of this matrix are exactly the non-zero binary strings of length .

  • There are subgroups of index 2 of the -cube, which correspond to the columns of the matrix. To define them let a column be fixed, then put . The maps are exactly the surjective homomorphisms having the nontrivial subgroups as their kernels.

  • The vertices of every polytope are given by deleting columns from this matrix that correspond to sets not in .

  • In particular, by restriction to the first columns, we get the vertices of the -cube .

Now, assume that is the -cube. We choose a column of the matrix, corresponding to a subgroup of index 2. There are two possibilities. If we choose a column corresponding to an atom, then is wrong, the dimension does not grow when adding this column to the coordinates (as we have doubled a coordinate). If, on the other hand, we choose a column corresponding to a set with cardinality two or more, then we are in the situation of Proposition 12, since holds. The lift (3) will be full dimensional, and its vertices are given by the submatrix with columns . Continuing from here, choosing another subgroup, the dimension will grow if and only if it does not correspond to one of the sets . Iteratively, the choices narrow down and, finally, when all columns have been chosen, the polytope is a simplex.

We will now formalize this procedure. For a fixed polytope as in Proposition 12, put

Clearly, conditions (I) and (II) imply that holds whenever .

Based on the equivalence of (i) and (ii) in Proposition 12, we are now able to prove that the following algorithm yields recursively all 0/1-polytopes satisfying (I) and (II).

Algorithm 14.

Initialization for :

  • .

Step : Based on construct a new set consisting of all polytopes such that there exists with

  • or

  • with

    (4)

    where runs through all subgroups of with and for .

Remark.

Note that in the case , the number of vertices is doubled, while in the other cases the number of vertices of equals the number of vertices of . Furthermore, it is interesting to see that the two possible operations commute in the following sense. Starting from some cube , lifting it to and then choosing a subgroup to apply the lift (4) gives the same polytope as choosing the subgroup from and then taking the prism over the lifted polytope, where is the canonical projection. Therefore, all polytopes that are constructed by the algorithm can be thought of as lifted cubes .

The classification will be complete with:

Theorem 15.

For all , the set in Algorithm 14 consists of all -dimensional 0/1 polytopes that satisfy conditions (I) and (II).

Proof.

First we show that all polytopes satisfy conditions (I) and (II), with replaced by . This is clear in the case of the prism .

If satisfies (4), then clearly is a subgroup of , because is a subgroup of with . Moreover, (ii) (i) in Proposition 12 implies that has dimension , because for . Hence, satisfies conditions (I) and (II).

Vice versa, assume that fulfills (I) and (II). Consider again the projection onto the first coordinates, and put . Since has dimension , has dimension . If is not injective, then is the prism , because is a subgroup of . If is injective, put

Then is a subgroup of with , because has dimension . Moreover, equation (4) holds for as just defined. Finally, Proposition 12, (i) (ii) shows that for . Hence, our algorithm includes the determination of . ∎

As a first application of Theorem 15 we can count the number of -dimensional polytopes that satisfy conditions (I) and (II). Let . For , let denote the number of all 0/1 polytopes with that satisfy (I) and (II). Then one has obviously

(5)

We have for , because a polytope with at most vertices cannot have dimension . Furthermore, we have clearly for all .

As mentioned already in Example 13, a 0/1-polytope that satisfies (I), (II), and has among its vertices exactly subgroups of index 2. Hence by ignoring the groups for , we get

Corollary.

For one has

The first few values are given in the Table 3.

n k 1 2 3 4 5 6 7 8
1 1 1
2 0 1 1
3 0 1 1 2
4 0 0 5 1 6
5 0 0 15 16 1 32
6 0 0 30 175 42 1 248
7 0 0 30 1605 1225 99 1 2960
8 0 0 0 12870 31005 6769 219 1 50864
Table 3. The number of -dimensional 0/1 polytopes with vertices that form a group.

It is easy to compute this number also for larger values of . For instance

Finally, using the Corollary we can show that, among the full dimensional 0/1-polytopes with vertices the convex hulls of linear codes are exceptional. For , let denote the number of all 0/1 polytopes with vertices satisfying only condition (II). Hence, the number of all 0/1 polytopes of dimension trivially satisfies

(6)

Moreover, we get

Proposition 16.
  1. For , one has

    (7)
  2. We have

Proof.
  1. Suppose that is a proper subgroup of with and .

    If is another subgroup of with , then we have

    and, hence,

    This means

    (8)

    There are subsets of with and ; namely, these are all sets of the form

    (9)

    For as in (9), we get , because otherwise, would be contained in a -unique- hyperplane with , a contradiction to . Together with (8), we obtain the first inequality in (7). The second one is trivial in view of .

  2. By (5), (6), and (7) we get for :

    This proves the second statement. ∎

As a concluding remark, we study the question of constructing a statistical model from a given polytope. Assume