Phylogenetic complexity of the Kimura 3-parameter model

Phylogenetic complexity of the Kimura -parameter model

Mateusz Michałek and Emanuele Ventura
Abstract

In algebraic statistics, the Kimura -parameter model is one of the most interesting and classical phylogenetic models. We prove that the ideals associated to this model are generated in degree four, confirming a conjecture by Sturmfels and Sullivant.

\setboolean

@twosidefalse

2010 Mathematics Subject Classification. Primary 52B20, Secondary 14M25, 13P25

1 Introduction

The part of computational biology that models evolution and describes mutations in this process is called phylogenetics [40]. This is a fertile subject witnessing many connections to several parts of mathematics such as algebraic geometry [8, 23], combinatorics [4, 15, 34], and representation theory [9, 31]. The methods used in this context of research are powerful and do not only apply to biology, but are employed in several other fields [2] such as modeling changes of words in languages [21], literary studies [3] or linguistics itself [37] with ideas going back to Darwin [14].
A crucial object in phylogenetics is a tree model, which is a parametric family of probability distributions. It consists of a tree , a finite set of states and a family of transition matrices, usually given by a linear subspaces of all matrices. The case of particular interest is when , where the basis elements correspond to the four nucleobases of DNA: adenine (A), cytosine (C), guanine (G), and thymine (T).
The models for which is a proper subspace of matrices reflect some symmetries among elements of . These symmetries are usually encoded by the action of a finite group on . In these terms, can be regarded as the space of -invariant matrices or tensors. Such models constitute a class of interest and they are called equivariant [18]. If is the trivial group, we obtain the general Markov model, corresponding, on the algebraic geometry side, to secant varieties of Segre products. When the elements of can be identified with those of , the model is called group-based. Henceforth we assume to be abelian.
The simplest among the equivariant, and group-based, models is the Cavender-Farris-Neyman model. This is the instance for , the group with two elements. A good understanding of this model from the algebraic geometry point of view has led to tremendous advances in this field. Sturmfels and Sullivant [41, Theorem 28] showed that the algebraic varieties arising from it are defined by quadrics. Additionally, Buczyńska and Wiśniewski described many of its remarkable algebro-geometric properties [8]. Consequently, Sturmfels and Xu [44], and Manon [32] described the connections of the model to toric degenerations of moduli spaces of rank two vector bundles on marked curves of fixed genus. For more relations to conformal field theory, we refer to [29, 31].
The Cavender-Farris-Neyman model is the simplest among the hyperbinary models [6, Section 3], that are given by . The most biologically meaningful example of those is the Kimura -parameter model; this corresponds to . In this case, , and, moreover, the action of reflects the pairing between purines (A,G) and pyrimidines (C,T). This model was introduced by Kimura [28] much before the setting above was developed. Using numerical experiments, Sturmfels and Sullivant conjectured that the ideals of the algebraic varieties associated to this model are generated by polynomials of degree at most four [41, Conjecture 30]. The confirmation of this conjecture is the main result of the present article. For any group , Sturmfels and Sullivant defined the phylogenetic complexity of .

Definition 1.1 (Phylogenetic complexity [41]).

Let be the star with leaves, and the variety associated to the group-based model. Let be the maximal degree of a generator in a minimal generating set of the ideal . The phylogenetic complexity of is .

In [35], it was shown that for any abelian group , its phylogenetic complexity is finite. The main contribution of this article is a more detailed study of the phylogenetic complexity of .

Main Theorem.

The phylogenetic complexity of the Kimura -parameter model equals four.

For more interesting results on the Kimura -parameter model we refer to [9, 10, 11, 30].

Algebraic varieties associated to a model.

We recall the explicit construction of the algebraic variety associated to a model. It is the Zariski closure of the locus of all probability distributions on the states of leaves allowed in the model.
A representation of a model on a tree is an association of transition matrices to edges of . The set of all representations is denoted by . (Here we do not mention the root distribution, since it does not affect the family of probability distributions we obtain.) To each vertex of we associate an dimensional vector space with basis . We may regard an element of associated to an edge as an element of the tensor product . We fix a representation and an association . Here is the set of leaves, i.e. vertices of degree one, of . Following the usual Markov rule, we may compute the probability of :

where is the set of vertices of . We may identify with a basis element of . This provides the map:

The image of this map is the family of probability distributions described by the model and its Zariski closure is the algebraic variety that represents the model. For group-based models, we denote this variety , where is the group defining the model and is the tree as above.

Earlier contributions.

Our proof of the main theorem relies on previous results by many authors that we now recall.
The first fundamental tool is the Discrete Fourier Transform. This is a linear change of coordinates, based on the representation theory of . For special cases in phylogenetics, it was first used by Hendy and Penny [26], and by Erdös, Székely, and Steel [42]. In higher generality, it is treated in [33, 41]. For group-based models, the DFT turns into a monomial map, proving that the associated algebraic variety is a toric variety. This translates the classical algebraic problem of finding defining equations of a variety into a combinatorial one. For more information about toric methods we refer to [12, 25, 43].
Another key result is the reduction from arbitrary trees to the so-called stars or claw-trees , i.e., trees with one inner vertex and leaves. The general procedure for group-based models to obtain ideals arising from arbitrary trees, knowing the ideals for , was discovered in [41]. Again, this turned out to be very influential, leading, on one hand, to the general constructions of toric fiber products [31, 45], and, on the other, to generalizations for equivariant models [18].
Combinatorial and computational methods in toric geometry are very well developed. As a starting point in our article we need to compute algebraic invariants of toric varieties embedded in very high dimensional ambient spaces. Here the computer algebra packages Normaliz [7], 4ti2 [47], along with previous computational results from [16] and [41] are used. In particular, Castenluovo-Mumford regularity plays a crucial role in the proof for . These classical invariants are briefly discussed in the Appendix 4, for the sake of completeness.
This work may be also seen in the framework of the stabilisation of equations of a family of algebraic varieties. Indeed, our proof not only bounds the degrees of the generators, but in principle provides an inductive procedure to obtain all generators in case of , assuming the generators for to be known. Finding equations of an infinite sequence of algebraic varieties, that come naturally in families, is an interesting current theme of research. This usually involves classical varieties such as secants of Segre varieties [19] and Grassmannians [20]. Indeed, the main result of Draisma and Eggermont in [17] shows that for equivariant models the associated algebraic variety can always be defined set-theoretically in some bounded degree, once and are both fixed. The fact that is finite constitutes the main result of [35]. Recently, another ideal-theoretic result was proved by Sam [38] showing that the ideal of th secant variety of th Veronese embeddings is generated in bounded degree that is independent of . Interestingly, the ideal-theoretic generation in bounded degree for secants of Segre varieties and Grassmannians are still central open problems. Finiteness issues are strongly connected with the theory of twisted commutative algebras and -modules by Sam and Snowden [39], and the theory of noetherianity by Draisma and Kuttler [19], Hillar and Sullivant [27], and others.
Apart from beautiful results of existence, that are quite often non-constructive or very far from optimal, it is of interest finding an explicit description of phylogenetic algebraic varieties. One of the most well-known examples is the salmon conjecture [1], since the prize offered by Allman for the hypothetical solver would be a smoked Copper river salmon. It asks for the description of , the algebraic variety representing the general Markov model for and . The generators of the ideal are still unknown, however a set-theoretic description was found by Friedland and Gross [24]. More recently, Daleo and Hauenstein [13] gave a numerical proof of the salmon conjecture.
As far as we know, our result is the only ideal-theoretic description, apart from the Jukes-Cantor model, where and is an arbitrary tree.

Plan of the article.

The whole article is devoted solely to the proof of the main theorem. In Section 2 we introduce the notation that is used throughout the proof. As the proof consists of several parts, some of them very technical, we present the overview of its structure in Section 3.1. The main result is established in Sections 3.2 and 3.3.

2 Preliminaries and notation

In this section we collect all the notation and terminology we will use in the rest of the paper. We divide this section into paragraphs to facilitate the reading.

Groups.
Henceforth we set , unless otherwise stated. We denote the elements of by , and . To denote unknown elements of , we use letters We also refer to an unknown element, that is not relevant in a specific argument, with question mark “?”.
Apart from , the most natural groups that enter the picture are the symmetric group on leaves , the group of flows , and the automorphism group . The group of flows is the following.

Definition 2.1 (Group of flows).

Let be a abelian group and . The set of flows of length of forms a group under the componentwise group operation. It is non-canonically isomorphic to the group , the direct product of copies of .

The automorphism group of , , is the group of bijective group homomorphisms from to itself. The automorphism of specified by is simply denoted by ; similarly for all the other automorphisms of having a non-trivial fixed element.

The toric variety .
For any abelian group , the variety is a projective toric variety of dimension living in , where the projective coordinates are in bijection with flows [33].
Let us recall here its corresponding polytope. Let be the lattice whose basis corresponds to the elements of . Consider with the basis indexed by pairs . We define a map of sets from the group of flows to the lattice, , by . The vertices of the polytope of are the images of the flows under the injective map .

Remark 2.2.

The family of varieties has a wealth of symmetries; the group , the group of flows , and the automorphism group all act on the ideals of these varieties.

Binomials, tables, and moves.
Ideals of toric varieties are binomial prime ideals. Thus they admit a minimal generating set of binomials. Binomials may be identified with a pair of tables of the same size, and , of elements of , regarded up to row permutation; this is another natural group in this setting which we implicitly take into account. Indeed, a binomial is a pair of monomials and the variables correspond to rows. Given the number of leaves , coordinates are in bijection with flows of length of . Hence rows are identified with flows of elements in . Columns are in bijection with the leaves. From the definition of the toric ideals [41], it follows that a binomial belongs to if and only if the two tables representing it are compatible, i.e., for each , the th column of and the th column of are equal as multisets. We index the columns of a given pair of tables , with columns, by integers . We refer to the element in the th column of row as .
Let be any table of elements of . The procedure consisting of selecting a subset of rows in of cardinality at most , and replacing it with a compatible set of rows is a move of degree . A binomial, represented by a pair of tables of elements of , is generated by binomials of degree at most if and only if there exists a finite sequence of moves of degree applied to or that transform into .

Example 2.3.

Let be the table

The table can be transformed by a move of degree three into the table

Indeed, the set of the first three rows of is compatible with the set of the first three rows of . Note that if the rows in are flows, then the rows of are flows as well. The move described above is denoted by

Remark 2.4.

In the notation for moves, we do not use the indices of the columns involved in the move. Instead, the indices are always clear from the move itself. For instance, the move in Example 2.3 is in columns . Also, note that, in general, the columns used for a move do not need to be consecutive.

Remark 2.5.

The groups , the group of flows , and the automorphism group act on the equations of , and hence on the tables. The group acts permuting the columns of the pair of tables corresponding to a binomial in the ideal of the variety. The groups and act on the entries of the tables in the natural way, i.e., by evaluation.

We now introduce one of the most important concepts for our approach. Given a pair of flows, we define a distance between them, which will enable us to use an inductive procedure on tables. The distance we consider is the classical Hamming distance between two words.

Definition 2.6 (Hamming distance).

Let and be two flows in :

Let and . The multiset constitutes the disagreement string of the pair of flows and . The cardinality is the Hamming distance between and . The multiset constitutes their agreement string. Up to the action of the group of flows on both flows, we may assume that the group elements for all .

Remark 2.7 (Tables and Hamming distance).

Given a pair of tables , we “compare” them using the notion of Hamming distance as follows. Since the tables come with undistinguishable rows, we may choose as first rows of and two rows that minimize the Hamming distance among all the pairs of rows from and . After fixing the first row in and in , as described in Section 3.1, one of the techniques adopted in Sections 3.2 and 3.3 is as follows. With moves of degree at most four, we create another pair of rows with strictly smaller Hamming distance than the initial one.

Counting functions.
We will make use of counting functions on the tables and . A counting function on the columns of has the same values as counting function on the columns of , since the pairs of tables we are interested in are compatible, i.e., columnwise they are the same as multisets. Given , we denote by the number of copies appearing in the columns in , or in .

Example 2.8.

The function counts the number of copies of in columns and minus two times the number of copies of in column .

From an algebraic point of view, a counting function defines a grading of the variables, that is a specialization of the multi-grading. Thus the fact that the counting function gives the same value on two tables is equivalent to the fact that the two corresponding monomials have the same degree with respect to the induced grading. Additionally, from the perspective of toric geometry, the counting function is induced by restricting the torus action to a special one-parameter subgroup.

Group homomorphisms.
We will make use of group homomorphisms in order to do counting arguments in a given pair of tables. We denote

the group homomorphism given by the quotient map sending each element to its class modulo the subgroup generated by the element .

3 Complexity of the Kimura -parameter model

The aim of this section is to establish the phylogenetic complexity of the Kimura -parameter model. In Section 3.1, we discuss the structure of the proof, postponing the technical part of it to Sections 3.2 and 3.3.

3.1 Main result and structure of the proof

We proceed presenting our main result along with the outline of the plan of the proof strategy.

Theorem 3.1.

The phylogenetic complexity of the Kimura -parameter model equals four.

Figure 1: Matryoshka of the proof.

The structure of the proof is presented in Figure 1. Our proof is an induction on the number of leaves , i.e., the number of columns of the tables. The base of our induction is . The case of leaves has been studied computationally. More precisely, for the result is presented in [41] and for it is computed in [16]. For we used the program featured in [16] to produce the vertices of the polytope. The computer algebra program 4ti2 [47] specialized for toric ideals was able to compute the Markov basis using a server equipped with a CPU 4 Intel-Xeon E7-8837/32 cores/2.67GHz and a memory of 1024Gb RAM.

Proposition 3.2.

The ideal is minimally generated by polynomials of degree at most four: quadrics, cubics, and quartics.

The case is treated in Section 3.3.3. Methods similar to the general case and bounds on Castelnuovo-Mumford regularity obtained using Normaliz [7] allow us to reduce the problem to a computation handled with 4ti2. From the computational point of view, it is interesting to note that we were not able to address the case only with computational tools. Based on our experiments with 4ti2, we expect the computation to be not feasible: it would run for several years on a server of the same capability as the one mentioned above, and a memory of 1Tb RAM would not be sufficient to finish the computation.
For , we have an induction on the degree of the generators, i.e., the number of rows of the table. Inside a specific degree , we have an induction on the Hamming distance of two rows of the tables. The strategy in this inner induction on the Hamming distance is the following. Suppose we have a binomial generator of degree . Hence, we have a pair of tables consisting of rows each and with columns. Two rows have Hamming distance and we reduce it to ; in other words, the given pair of tables is transformed into a pair of tables that have an identical row. This is a binomial which is a product of a binomial of degree and a variable. By induction on , such a binomial can be generated in degree at most .
Hence the aim of the induction on the Hamming distance is to reduce it to . In order to achieve this, we address the case into two separate propositions in Section 3.2; see Proposition 3.5 and Corollary 3.6, and Proposition 3.12. This reduces the proof to . Recall that there do not exist flows whose Hamming distance is , since they cannot disagree only in one entry.
We now discuss the strategy in case , the technical heart of the proof, which is tackled in Section 3.3. In spite of many symmetries, discussed in Section 2, there are several cases one has to consider: We identify ten cases, indexed by roman numerals, where the first two rows of the given pair of tables have a disagreement string of length . Here we provide a uniform proof for three crucial cases: Case I, II, and III. As we show them simultaneously with the very same techniques, we refer to those as the “main case”. The rest of the cases is treated by reducing them to the main case.
For the proof in the main case, we look at the second rows of each of the tables and . Let denote the length of the disagreement string between those two, in columns not involving the first two. By Corollary 3.6, we are able to assume and, since , the length of the agreement string between the second row of and the second row of , outside columns and , is at least . Since the columns are indistinguishable up to the action of , we may assume that the columns and are involved in the agreement string. Now the aim is to reduce to the situation in which no row has two nonzero entries in the columns and : employing moves of degree at most four, we would like to eliminate all the strings which have nonzero entries on both columns and . We call such strings bad pairs.

Definition 3.3 (Bad pairs).

A bad pair is a string , where the elements are such that:

  1. they are both nonzero;

  2. is in column and is in column .

We now show that eliminating all the bad pairs we fall back to the case of leaves, which allows us to conclude, by the outermost induction.

Theorem 3.4.

Suppose that a pair of compatible tables with columns do not contain rows with bad pairs. Then the corresponding binomial is generated in degree at most .

Proof.

The assumption implies that for every row of and we have either or . Summing up the columns and , we obtain two tables and . The crucial observation is that and are compatible tables with columns. Hence they correspond to a binomial in . This binomial is generated in degree at most by definition. This implies that and can be transformed into each other by a finite sequence of moves of degree at most . Each of these moves lifts to the tables and , transforming all their columns accordingly, except columns and . Here the moves permute the pairs of elements, where each pair is formed by the two elements in columns and , in a fixed row. These moves transform into . The latter need not be the same though; indeed, they may differ in columns and . As in the proof of [35, Theorem 3.12], we make quadratic moves to adjust the elements in columns and . These transform into . Hence the tables are generated in degree at most . ∎

Figure 2: Zoom in of Hamming distance step.

3.2 Reduction of Hamming distance 3

In this section, we start our reduction of the Hamming distance. More precisely, we assume the Hamming distance to be at least three and we prove that we can reduce it to two; the latter will be discussed in Section 3.3. We proceed analyzing the cases when the disagreement string is given by at least four entries.

Proposition 3.5.

The disagreement strings (i) , (ii) , (iii) , and (iv) can be reduced.

Proof.

(i). Consider the function . By the action of the group of flows , we may assume that this counting function is nonpositive on both of the tables. Since the function is stricly positive in the first row of , there exists a row in where there are strictly more copies of than copies of in the columns . On the other hand, cannot contain in two of the columns , since we would exchange those with the corresponding entries in the first row and this would decrease the Hamming distance. Thus has one copy of and no copies of in columns . If the row has both copies of and , we would move the string to the first row of , reducing the Hamming distance. Whence we may assume that contains the string in columns . Notice that in columns of , there are no strings of the form or , otherwise quadratic moves would decrease the Hamming distance. Additionally, in columns there is no string of the form ; for this we can apply in the cubic move . Now, we introduce the counting function on . By the previous discussion about the possible strings in columns , this function is at least one in every row of . Consequently, there exists a row in where this function is three. As a consequence, the row contains either the string or . This would decrease the Hamming distance.

(ii). Consider the counting function . By the action of the group of flows , we may assume it is nonpositive on both of the tables. Since this function is strictly positive on the first row of , there exists a row in where the function is strictly negative. Note that on the row , one has ; otherwise we would make a quadratic move, involving and the first row of , reducing the Hamming distance.
If in the row we have , then , by the value of the counting function on . Hence in the row , there exists , which allows us to make a quadratic move reducing the Hamming distance. Without loss of generality, we have , and . Thus the row contains either the string or the string . In both cases, we exchange with the first row of and we act with the flow on producing , which is (i).

(iii). Consider the function . By the action of the group of flows , we may assume it is nonpositive on both of the tables. Therefore there exists a row in where the function is strictly positive. Note that on the row one has .
If in the row we have and , then we may assume contains the string in columns . We have , as otherwise in each of these circumstances we would make a quadratic move between and the first row of , reducing the Hamming distance. Then the function is zero on , which is not possible by assumption. Analogously, we may conclude when and .
If in the row we have , and , then . In this case we have , because of a quadratic move between and the first row of . Hence the row contains the string in columns , which again would reduce the Hamming distance.
If in the row we have , then either or . If , then in columns the row contains the string ; indeed we cannot have copies of or by quadratic moves with the first row of . This implies that the counting function is zero on the row , which is not possible by the assumption. If in the row we have and , then . In the row we can now exclude all the possible elements in each column by quadratic moves, obtaining the string in columns . We exchange this string with the first row of , reducing the Hamming distance. Analogously, if in the row we have and , we obtain in columns , and we conclude in the same way.

(iv). Consider the counting function . By the action of the group of flows , we may assume it is nonpositive on the tables. Therefore there exists a row in where the function is strictly negative. Thus on the row we have , as .
Suppose that in the row we have . Then and , by the assumption on the value of the counting function on . In two of the columns we cannot have or by quadratic moves, involving and the first row of . Thus we have a copy of ; we now make a quadratic move between and the first row of , which decrease the Hamming distance.
Suppose that in the row we have . If in the row we have , then . In columns we cannot have , as otherwise we would exchange the string with the first row of , thus reducing the Hamming distance. Whence contains the string in columns . If in the row we have , then . In this situation, by the same argument, contains the string (or or ). We claim that having the string can be reduced to the case of having the string up to quadratic moves and group automorphism. Indeed, suppose we have the string in the row . We exchange from with from the first row of in columns . We act with the flow on both tables and we transpose column and column . Now the row contains the string in columns .
By the previous discussion, it is enough to deal only with the string in . Consider the counting function . Note that this function has only odd values. We now show that the function cannot be positive on a row of . Indeed, assume there is a row where the function takes a positive value. Then the row contains either , or in columns . The first two cases are not possible, because we would exchange them with the string in the row ; this would produce or in the row , which we would exchange with in the first row of . We are left with the possibility of having in columns . For this we apply in the cubic move .
In conclusion, the counting function is strictly negative on every row of . Since the value of this function on the first row of is , there exists a row in on which the function is . Thus in we have either or in columns . In this case, we would exchange them with the first row of reducing the Hamming distance. ∎

Corollary 3.6.

Suppose that a table contains two rows and having disagreement string of cardinality four. Then, using moves if degree at most three, can be transformed in such a way that the disagreement string has cardinality at most three. Moreover, only the four columns of the disagreement string are involved in the reduction.

Proof.

Assume two rows and do not agree on four elements. Up to the action of the group of flows and , the elements of in the disagreement string can be set to be ; all the possibilities for the elements of in the disagreement string are , , , and . By Proposition 3.5, these disagreement strings can be reduced. Hence, performing the moves in the proof of the Proposition 3.5, we transform the tables in such a way that the cardinality of the disagreement string is at most three. ∎

Now we deal with the disagreement string of length three, . We begin with preparatory lemmas.

Lemma 3.7.

Suppose that the disagreement string between and is , in columns . Then we may assume that there exists a row in containing the string in columns .

Proof.

We introduce the counting function . By the action of the group of flows , we may assume that the sum is nonnegative on . Then there exists a row in where the function is strictly positive.
If in the row we have , then , by the assumption on the counting function evaluated at . By the action of the group of flows , we may assume without loss of generality that contains the string in columns . Then by assumption. Also, , as otherwise we would exchange the string with in the first row of , reducing the Hamming distance between and . Hence . Similarly, and , as otherwise we exchange with in the first row of . Hence contains the string in columns , which we exchange with the first row of . ∎

Lemma 3.8.

We may assume that the row of Lemma 3.7 in contains the string in columns . More generally, for every row containing the string in columns , the nonzero element of in columns coincides with the corresponding entry of the first row of .

Proof.

The row contains a string with exactly two elements equal to in the columns and . By the action of , we may assume that contains the string in columns . Note that , as in both cases we make a quadratic move between and the first row of , reducing the Hamming distance between and . Thus . By the action of the group of flows , in every row containing the string in columns , the nonzero entry coincides with the corresponding entry of the first row of . ∎

Lemma 3.9.

Suppose that in we have a row containing . Then this is the only string that a row with in columns may contain.

Proof.

Since the row in contains , then it cannot contain another copy of , as we would exchange with the first row of , thus reducing the Hamming distance. Hence contains , since it is a flow. Assume there exists another row containing a string with , different from . By Lemma 3.8, the unique nonzero entry in columns of agrees with the corresponding entry of the first row of . Assume that contains in columns . Then we apply the cubic move , reducing the Hamming distance. For a row containing we conclude in the same way. ∎

Lemma 3.10.

As in the proof of Lemma 3.9, we assume that contains in columns . There exists a row in such that and, moreover, contains the string in columns .

Proof.

Such a row exists in by the compatibility of the two tables. The structure of is:

By Lemma 3.9, we have . Analogously, we have by applying Lemma 3.9, upon exchanging the string in the first row with in the second row.
Note that and , as otherwise, exchanging with the first row, in the first case with and in the second with , we would reduce the Hamming distance; analogously, and . Furthermore, by Lemma 3.8, we have as otherwise we would create the string and respectively. Analogously . Hence the only remaining possibility is and . ∎

Lemma 3.11.

The counting function is at most on every row of .

Proof.

For the sake of contradiction, suppose there exists a row in , where the counting function is nonnegative. In , there exists a row with . By Lemma 3.10, the row contains the string .
If in the row we have , then , again, by Lemma 3.10. Hence we have at least two differences with and we can make a quadratic move between and . This reduces the Hamming distance. Thus on the row one has .
If on , we have the following possibilities:

  1. contains ;

  2. contains ;

  3. contains ;

  4. contains .

In case (i), we have by the assumption on the value of the counting function. Additionally, , as we would exchange the string with the first row in . Consider the differences between and . If , then we can make a move involving column , at most one of columns and either column or between and . This allows us to exchange in with in ; this contradicts Lemma 3.10. Hence , which on the other hand contradicts the nonnegativity of the counting function. Exchanging , the row appearing in Lemma 3.10 containing , with the first row of , case (ii) is the same as case (i).
In case (iii), by the assumption on the value of the counting function. Moreover, since we would exchange in with the string in in columns , contradicting Lemma 3.8. We also have , because we could make a quadratic move in columns between in with in , obtaining the string in . Now, we exchange in columns , the string in with in , which produces the string ; this reduces the Hamming distance. Finally, if , we exchange in columns , the string in with in , obtaining , which again reduces the Hamming distance.
In case (iv), by the assumption on the value of the counting function. Additionally, , because otherwise we would exchange in columns the string in with in , thus contradicting Lemma 3.10. Also, , as we would make a quadratic move on columns between and , contradicting again Lemma 3.10. Analogously, . Hence contains the string . We exchange in columns the string in with in , which produces in , which in turn implies by Lemma 3.8. This contradicts the nonnegativity of the counting function.
If , by symmetry, we may assume or . If , then by Lemma 3.10, contains , which contradicts the nonnegativity of the counting function. If , then contains . Then by the assumption. Moreover, , as we would exchange with in columns , contradicting Lemma 3.10.
If , we now consider the value of . We have by assumption. We have by assumption on the nonnegativity of the counting function. Moreover, , since otherwise we would exchange in columns the string in with in contradicting Lemma 3.10. Hence , i.e., contains the string . Now, , by the assumption on the value of . Moreover , by the assumption on the value of the counting function on . Also notice that , as otherwise we exchange in columns the string of with of the first row of , and then we exchange from the first row with in reducing the Hamming distance. Therefore contains the string , which we exchange with the string in in columns and , contradicting Lemma 3.10.

If , then contains . Furthermore, by assumption on the value of . Moreover, exchanging in columns the string of with of , contradicting Lemma 3.10. Analogously, we would contradict Lemma 3.10 for , exchanging in columns , the string in with in . Hence contains the string . Here , by assumption. Moreover, , because of the nonnegativity of the counting function. Also, , because we would contradict Lemma 3.10, exchanging of with of . Therefore contains , but we exchange it with in columns contradicting Lemma 3.10.
If , then , by the assumption on the nonnegativity of the function on . Thus contains different from in columns . Hence we have two identical differences between and , which allow to make a quadratic move, contradicting Lemma 3.10. ∎

Proposition 3.12.

The disagreement string can be reduced.

Proof.

By Lemma 3.11, the counting function is at most on every row of . As a consequence, there exists a row in , where the function is at most . By the value of the counting function on the row , the entries in must agree in two, three, four or five entries with .
If agrees in five entries, it contains . We exchange with in the first row of , which reduces the Hamming distance between and . If agrees in four entries, we denote by the element where does not agree with . If , then we would have either the string or , which is also in table ; this reduces the Hamming distance. Suppose contains . If , the table contains the same flow. If or , we exchange or with in the first row of .
If agrees with in three entries, we denote by the remaining two. First, note that if are in columns or in columns , we exchange or with in the first row of ; this decreases the Hamming distance.
Assume that both of and are in columns . If contains , then , because otherwise we would exchange the string or with the first row of reducing the Hamming distance. Whence . Moreover , by definition. Additionally, , because we would move to the first row of , reducing the Hamming distance. It follows that . On the other hand, , since the counting function is at most on . Furthermore, , as otherwise we would exchange with in the first row of , reducing the Hamming distance between and . Hence contains either or . For the first, we exchange in columns , the string with in the first row of , and we exchange in with the first row of . For the second, we exchange with the first row of and in with the first row of , which reduces the Hamming distance.
If contains , then applying the automorphism and a transposition between columns and , we are in the case when the row contains .
If are both in columns , we apply analogous moves as the ones featured above. Then we may assume that is either in column or , and is either in column or . In all these cases, we have and , as all the other possibilities are excluded by exchanging with the first row of . The fact that contradicts the value of the counting function on .
If agrees with in two entries, we have on , since the value of the counting function on is at most . In columns , there is at least one entry which does not agree with the corresponding entry in , because otherwise we would move to the first row of , reducing the Hamming distance. Denoting the elements where they do not agree by , the strings that may contain are: , , and . Note that these are all the possible, as the remaining ones are resolved in the same way upon exchanging the string in the first row with in the second row of . If contains , then we exchange the string of in columns with in . We now exchange the string of with the first row in ; these two rows have lower Hamming distance. If contains in columns , then , by the counting function. Moreover, since . Hence or . Now we exchange or with in the first row of reducing the Hamming distance. If contains , by definition or by quadratic moves we can exclude the cases , and . Hence contains , which we exchange with the first row of , decreasing the Hamming distance. ∎

The preceding results of this section show the following corollary.

Corollary 3.13.

The Hamming distance of two flows can be reduced to at most two.

3.3 The disagreement string

In this section, we proceed in the case of the disagreement string .

(1)

Let us denote the row in starting with the string by and the row in starting with the string by . After fixing the first rows and the first two columns, we make moves of degree at most four on the rest of tables in such a way that the number of agreements in and is maximized.

Remark 3.14.

Corollary 3.6 ensures that, after possibly making moves of degree at most four, the rows and in and respectively, agree in at least entries. Up to the action of on the leaves, and hence on the columns, these are the last columns.

Definition 3.15.

The string in the last columns of the rows and is the the agreement string between and . Up to the action of the group of flows , these entries are zeros.

Our aim is to prove the following three crucial cases, which we refer to as the main case:

()

In Section 3.3.1, we reduce any other possible case to one of the above.

3.3.1 Reduction to the main case

Up to the action of the group of flows , there are at least as many copies of as copies of in the first two columns of . Up to the action of , we may assume . We will show that all cases can be resolved, by reducing to the main case (3.3).
We first collect a useful lemma which we will use to resolve easily some of the cases.

Lemma 3.16.

If in table in (1) we have , then the corresponding cases can be reduced. If in table in (1) we have , then the corresponding cases can be reduced.

Proof.

If , then in we have either the cubic move or . The second sentence is the symmetric version of the first: acting with the flow on the tables, we produce the same tables as in the first statement. ∎

We now analyze all the possible cases. We refer to the tables and in (1).

Case . In this case, the table has the form:

We may have .

.
Here, is reduced by Lemma 3.16. Hence we have (Case I) or (Case II).

.
Here, (Case X), (Case VII), (Case VI).

.
Here, (Case IV), (Case V), is resolved by Lemma 3.16.

Case . In this case, the table has the form:

We may have .

.
Here, (which is Case II by acting with the flow and ), (Case III), resolved by Lemma 3.16.

.
Here, (Case IX), (which is Case II by acting with the flow , transposing and ), (which is Case V by acting the flow and transposition).

.
Here, (which is Case V by acting with the flow