The Complexity ofPhylogeny Constraint Satisfaction ProblemsThe first author has received funding from the European Research Council (ERC, grants no. 257039 and no. 681988) and funding from the German Science Foundation (DFG, project no. 622397). The second author is partially supported by the Swedish Research Council (VR) under grant 621-2012-3239. The third author received funding from the European Research Council under the European Community’s Seventh Framework Programme (Grant no. 257039), the project P27600 of the Austrian Science Fund (FWF), and the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 101.99-2016.16.

The Complexity of
Phylogeny Constraint Satisfaction Problemsthanks: The first author has received funding from the European Research Council (ERC, grants no. 257039 and no. 681988) and funding from the German Science Foundation (DFG, project no. 622397). The second author is partially supported by the Swedish Research Council (VR) under grant 621-2012-3239. The third author received funding from the European Research Council under the European Community’s Seventh Framework Programme (Grant no. 257039), the project P27600 of the Austrian Science Fund (FWF), and the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 101.99-2016.16.

Manuel Bodirsky111Institut für Algebra, TU Dresden, 01062 Dresden, Germany , Peter Jonsson222Department of Computer and Information Science, Linköpings Universitet, SE-581 83 Linköping, Sweden , and Trung Van Pham333Department of Mathematics for Computer Science, Institute of Mathematics, Vietnam Academy of Science and Technology, 18 Hoang Quoc Viet Road, Cau Giay District, Hanoi, Vietnam
Abstract

We systematically study the computational complexity of a broad class of computational problems in phylogenetic reconstruction. The class contains for example the rooted triple consistency problem, forbidden subtree problems, the quartet consistency problem, and many other problems studied in the bioinformatics literature. The studied problems can be described as constraint satisfaction problems where the constraints have a first-order definition over the rooted triple relation. We show that every such phylogeny problem can be solved in polynomial time or is NP-complete. On the algorithmic side, we generalize a well-known polynomial-time algorithm of Aho, Sagiv, Szymanski, and Ullman for the rooted triple consistency problem. Our algorithm repeatedly solves linear equation systems to construct a solution in polynomial time. We then show that every phylogeny problem that cannot be solved by our algorithm is NP-complete. Our classification establishes a dichotomy for a large class of infinite structures that we believe is of independent interest in universal algebra, model theory, and topology. The proof of our main result combines results and techniques from various research areas: a recent classification of the model-complete cores of the reducts of the homogeneous binary branching C-relation, Leeb’s Ramsey theorem for rooted trees, and universal algebra.

1 Introduction

Phylogenetic consistency problems are computational problems that have been studied for phylogenetic reconstruction in computational biology, but also in other areas dealing with large amounts of possibly inconsistent data about trees, such as database theory [2], computational genealogy, and computational linguistics. Given a collection of partial information about a tree, we would like to know whether the information is consistent in the sense that there exists a single tree that it is compatible with all the given partial information. A concrete example of a computational problem in this context is the rooted triple consistency problem. For an informal description of this problem we consider the evolution process as a rooted binary tree in which each node presents a species and the root presents the origin of life. In an instance of the problem, we are given a set of variables, and a set of triples from , written in the form where , and we would like to know whether there exists a rooted tree whose leaves are from such that for each of the given triples the youngest common ancestor of and in this tree is a descendant of the youngest common ancestor of and . Aho, Sagiv, Szymanski, and Ullmann presented a polynomial-time algorithm for this problem [2].

Many computational problems that are defined similarly to the rooted triple consistency problem have been studied in the literature. Examples include the subtree avoidance problem (Ng, Steel, and Wormald [38]) and the forbidden triple problem (Bryant [23]) which are NP-hard problems. Bodirsky & Mueller [14] have determined the complexity of rooted phylogeny problems for the special case where the constraint relations are disjunctions of atomic formulas of form . This result covers, for instance, the subtree avoidance problem and the forbidden triple problem.

We present a considerable strengthening of the result of Bodirsky & Mueller [14], and classify the complexity of phylogeny problems for all sets of phylogeny constraints that can be defined as a Boolean combination of the mentioned rooted triple relation and the equality relation (on leaves). The reader should be aware that many problems of this type may appear exotic from a biological point of view — the name “phylogeny” should not be taken too literally. Our results show that each of the problems obtained in this way is polynomial-time solvable or NP-complete. As we will demonstrate later (see Section 2), this class of problems is expressive enough to contain also unrooted phylogeny problems. A famous example of such an unrooted phylogeny problem is the NP-complete quartet consistency problem [40]: here we are given a set of variables, and a set of quartets with , and we would like to know whether there exists a tree with leaves from such that for each of the given quartets the shortest path from to does not intersect the shortest path from to in . Another phylogeny problem that has been studied in the literature and that falls into the framework of this paper (but not into the one in [14]) is the tree discovery problem [2]: here, the input consists of a set of 4-tuples of variables, and the task is to find a rooted tree such that for each -tuple in the input the youngest common ancestor of and is a proper descendant of the youngest common ancestor of and .

The proof of the complexity classification is based on a variety of methods and results. Our first step is that we give an alternative description of phylogeny problems as constraint satisfaction problems (CSPs) over a countably infinite domain where the constraint relations are first-order definable over the (up to isomorphism unique) homogeneous binary branching -relation, a well-known structure in model theory. A central result that simplifies our work considerably is a recent analysis of the endomorphism monoids of such relations [10]. Informally, this result implies that there are precisely four types of phylogeny problems: (1) trivial (i.e., if there is a solution, there is a constant solution), (2) rooted, (3) unrooted, and (4) degenerate cases that have been called equality CSPs [12]. We will show that all unrooted phylogeny problems are NP-hard, and the complexity of all equality CSPs is already known.

The basic method to proceed from there is the algebraic approach to constraint satisfaction problems. Here, one studies certain sets of operations (known as polymorphisms) instead of analysing the constraints themselves. An important tool to work with polymorphisms over infinite domains is Ramsey theory. In this paper, we need a Ramsey result for rooted trees due to Leeb [36], for proving that polymorphisms behave canonically on large parts of the domain (in the sense of Bodirsky & Pinsker [16]), and this allows us to perform a simplified combinatorial analysis.

Interestingly, all phylogeny problems that can be solved in polynomial time fall into one class and can be solved by the same algorithm. This algorithm is a considerable extension of the algorithm by Bodirsky & Mueller [14] for the rooted triple consistency problem. It repeatedly solves systems of linear Boolean equations to decide satisfiability of a phylogeny problem from this class. An illustrative example of a phylogeny problem that can be solved in polynomial time by our algorithm, but not the algorithms from [2, 14], is the following computational problem: the input is a 4-uniform hypergraph with vertex set ; the question is whether there exists a rooted tree with leaf set such that for every hyperedge in the input has two disjoint subtrees that each contain precisely two of the vertices of the hyperedge.

All phylogeny problems that cannot be solved by our algorithm are NP-complete. Our results are stronger than this complexity dichotomy, though, and we prove that every phylogeny problem satisfies a universal-algebraic dichotomy statement that holds for a large class of infinite structures (Theorem 17), which is of independent interest in the study of homogeneous structures and their polymorphism clones. In this respect, the situation is similar to previous classifications for CSPs where the constraints are first-order definable over the order of the rationals from [13], or the random graph [18]. In comparison to these previous works, the dichotomy we present here is easier to state (there is just one tractable class), but harder to prove with existing methods: in particular, unlike the situation for constraints that are first-order definable over the random graph [18], the polymorphisms that characterise the tractable cases cannot be chosen to be canonical (in the sense of Bodirsky & Pinsker [16]) on the entire domain. As such, our dichotomy provides an important test case for potentially much wider classifications of CSPs of homogeneous structures.

The paper has the following structure. We provide basic definitions concerning phylogeny problems in Section 2, and also explain how these problems can be viewed as constraint satisfaction problems for reducts of the homogeneous binary branching -relation. Section 3 provides a brief but self-contained introduction to the universal-algebraic approach to the complexity of constraint satisfaction, and in Section 4 we collect known results that we will use in our proof. Section 5 applies the universal algebraic approach to phylogeny problems, and we derive structural properties of phylogeny problems that do not simulate a known hard phylogeny problem. In Section 6 we translate these structural properties into definability properties in terms of syntactically restricted formulas, called affine Horn formulas. This section also contains our algorithm for solving the tractable cases. In Section 7, we present a characterisation of our tractable class of phylogeny problems based on polymorphisms. Finally, in Section 8 we put everything together and state and prove our main results, including the mentioned complexity dichotomy.

This article is a revised and extended version of an earlier conference publication  [9].

2 Phylogeny Problems

In this section, we define (in Sections 2.1 and 2.2) the class of phylogeny problems studied in this article and illustrate it by providing examples from the literature. We continue in Section 2.3 by showing how to formulate such phylogeny problems as constraint satisfaction problems over an infinite domain.

2.1 Rooted trees

We fix some standard terminology concerning rooted trees. Let be a tree (i.e., an undirected, acyclic, and connected graph) with a distinguished vertex , the root of . The vertices of are denoted by . All trees in this paper will be binary, i.e., all vertices except for the root have either degree or , and the root has either degree or . The leaves of are the vertices of of degree one.

For , we say that lies below if the path from to passes through . We say that lies strictly below if lies below and . The youngest common ancestor (yca) of a set of vertices is the node that lies above all vertices in and has maximal distance from ; this node is uniquely determined by .

Definition 1.

The leaf structure of a binary rooted tree is the relational structure where holds in if and only if lies strictly below in . We also call the underlying tree of the leaf structure.

It is well-known that a rooted tree is uniquely determined by its leaf structure (Theorem 3 in [40]).

Definition 2.

For finite , we write if neither of and lies below the other. For arbitrary sequences of (not necessarily distinct) vertices and with we write if .

In particular, (this notation is widespread in the literature on phylogeny problems [40, 41, 34, 43]) is equivalent to . Note that if then this includes the possibility that ; however, implies that and . Hence, for every triple of leaves in a rooted binary tree, we either have , , , or . Also note that if and only if and for all and .

2.2 Phylogeny problems

An atomic phylogeny formula is a formula of the form or of the form . A phylogeny formula is a quantifier-free formula that is built from atomic phylogeny formulas with the usual Boolean connectives (disjunction, conjunction and negation).

We say that a phylogeny formula with variables is satisfiable if there exists a rooted binary tree and a mapping such that is satisfied by under (with the usual semantics of first-order logic). In this case we also say that is a solution to .

Let be a finite set of phylogeny formulas. Then the phylogeny problem for is the following computational problem.

Phylo
INSTANCE: A finite set of variables, and a finite set of phylogeny formulas obtained from phylogeny formulas by substituting the variables from by variables from .
QUESTION: Is there a tree and a mapping such that satisfies all formulas from ?

If and are sequences of leaves in a binary tree , then by our observation above holds in if and only if and hold in for arbitrary and . Thus for variables , we may use as a shortcut for the formula

and we use as a shortcut for .

Example 1.

A fundamental problem in phylogenetic reconstruction is the rooted triple consistency problem [2, 24, 32, 40] that was already mentioned in the introduction. This problem can be stated conveniently as Phylo. That is, an instance of the rooted triple consistency problem consists of a finite set of variables and a finite set of atomic formulas of the form where , and the question is whether there exists a tree and a mapping such that for every formula in the input, holds in . ∎

Example 2.

The following NP-complete problem was introduced and studied in a closely related form by Ng, Steel, and Wormald [38]. We are given a set of rooted trees on a common leaf set , and we would like to know whether there exists a tree with leaf set such that, intuitively, for each of the given trees the subtree of induced by the leaves of is not the same as .

The hardness proof for this problem given Ng, Steel, and Wormald [38] shows that already the phylogeny problem , which can be seen as a special case of the problem above, is NP-hard. ∎

Example 3.

The hardness proof for the rooted subtree avoidance problem given by Ng, Steel, and Wormald [38] cannot be adapted to show hardness of Phylo; a hardness proof can be found in Bryant’s PhD thesis [23] (Section 2.6.2). ∎

Example 4.

The quartet consistency problem described in the introduction can be cast as Phylo where is the following phylogeny formula.

Indeed, this formula describes all rooted trees with leaves where the shortest path from to does not intersect the shortest path from to (whether or not this is true is in fact independent from the position of the root). ∎

Example 5.

Let be the formula . Then Phyl models the following computational problem. The input consists of a 4-uniform hypergraph with a finite set of vertices ; the task is to determine a binary tree with leaf set such that for every hyperedge in the input, exactly two out of lie below each child of in . This example cannot be solved by the algorithm of Aho, Sagiv, Syzmanski, and Ullman [2], and neither by the generalisation of this algorithm presented in [14]. However, the problem can be solved in polynomial time by the algorithm presented in Section 6.4. ∎

Our main results (stated in Section 8) imply a full classification of the computational complexity of Phylo.

Theorem 1.

Let be a finite set of phylogeny formulas. Then Phylo is in P or NP-complete.

2.3 Phylogeny problems as CSPs

As mentioned in the introduction, every phylogeny problem can be formulated as a constraint satisfaction problem over an infinite domain. This reformulation will be essential to use universal-algebraic and Ramsey-theoretic tools in our complexity classification of phylogeny problems.

Let be a structure with relational signature . This is, is a tuple where is the (finite or infinite) domain of and where is a relation of arity over . When and are two -structures, then a homomorphism from to is a mapping from the domain of to the domain of such that for all and for all we have .

Suppose that the signature of is finite. Then the constraint satisfaction problem for , denoted by CSP, is the following computational problem.

CSP
INSTANCE: A finite -structure .
QUESTION: Is there a homomorphism from to ?

We say that is the template or constraint language of the problem CSP. We now formulate phylogeny problems as constraint satisfaction problems. Let be a finite set of phylogeny formulas. If are the variables of , then we introduce a new relation symbol of arity , and we write for the set of all these relation symbols.

If is an instance of with variables , then we associate to a -structure with domain as follows. For of arity , the relation contains the tuple if and only if the instance contains a formula that has been obtained from a formula by replacing the variables of by the variables .

Proposition 1.

Let be a finite set of phylogeny formulas. Then there exists a -structure with countable domain and the following property: an instance of is satisfiable if and only if homomorphically maps to .

The structure in Proposition 1 is by no means unique, and such structures are easy to construct. The specific choice for presented below is important later in the proof of our complexity classification for phylogeny problems; as we will see, it has many pleasant model-theoretic properties. To define , we first define a ‘base structure’ , and then define in terms of . The structure is a well-studied object in model theory and the theory of infinite permutation groups, and it will be defined via Fraïssé-amalgamation.

We need a few preliminaries from model theory. Injective homomorphisms that also preserve the complement of each relation are called embeddings. Let be the domain of a relational -structure , and arbitrarily choose . Then the substructure induced by in is the -structure with domain such that for each -ary ; we also write for . Let and be -structures with not necessarily disjoint domains and , respectively. The intersection of and is the structure with domain such that for all . A -structure is an amalgam of and if for there are embeddings of to such that for all . A class of -structures has the amalgamation property if for all there is a that is an amalgam of and . A class of finite -structures that has the amalgamation property, is closed under isomorphism and taking induced substructures is called an amalgamation class.

Homomorphisms from to are called endomorphisms of . An automorphism of is a bijective endomorphism whose inverse is also an endomorphism; that is, they are bijective embeddings of into . The set containing all endomorphisms of is denoted while the set of all automorphisms is denoted by . For two arbitrary sets and , a map from a subset of to is called a partial map from to . Let be an arbitrary partial map from to . The map is called a partial isomorphism of if is an isomorphism from to , where denotes the domain of . A relational structure is called homogeneous if every partial isomorphism of with a finite domain can be extended to an automorphism of . In this paper a partial isomorphism always means a partial map with a finite domain. Homogeneous structures with finite relational signature are -categorical, i.e., all countable structures that satisfy the same first-order sentences as are isomorphic (see e.g. [26] or [33]).

Theorem 2 (Fraïssé; see Theorem 7.1.2 in [33]).

Let be an amalgamation class with countably many non-isomorphic members. Then there is a countably infinite homogeneous -structure such that is the class of structures that embeds into . The structure , which is unique up to isomorphism, is called the Fraïssé limit of .

When working with relational structures, it is often convenient to not distinguish between a relation and its relation symbol. For instance, when we write for a leaf structure (Definition 1), the letter stands both for the relation symbol, and for the relation itself. This should never cause confusion.

Proposition 2 (see Proposition 7 in [10]).

The class of all leaf structures of finite rooted binary trees is an amalgamation class.

We write for the Fraïssé-limit of the amalgamation class from Proposition 2. This structure is well-studied in the literature, and the relation is commonly referred to as the binary branching homogeneous C-relation. It has been studied in particular in the context of infinite permutation groups [1, 26]. There is also a substantial literature on C-minimal structures, which are analogous to o-minimal structures, but where a C-relation plays the role of the order in an o-minimal structure [31, 37].

Definition 3.

Let be a structure. Then a relational structure with the same domain as is called a reduct of if all relations of have a first-order definition in (using conjunction, disjunction, negation, universal and existential quantification, in the usual way, but without parameters). That is, for every relation of arity of there exists a first-order formula with free variables such that if and only if holds in .

It is well-known that all structures with a first-order definition in an -categorical structures are again -categorical (we refer once again to [33], Theorem 7.3.8; the analogous statement for homogeneity is false).

Proof.

(Proposition 1) Let be a finite set of phylogeny formulas. Let be the reduct of defined as follows. For every with free variables , we have the -ary relation in which is defined by the formula over . It follows (in a straightforward way) from the definitions that this structure has the properties required in the statement of Proposition 1. ∎

Conversely, every CSP for a reduct of corresponds to a phylogeny problem. To see this, we need the following well-known fact.

Theorem 3 (see, e.g., [33]).

An -categorical structure is homogeneous if and only if it has quantifier-elimination, that is, every first-order formula over is equivalent to a quantifier-free formula.

Let be a quantifier-free first-order definition of in . When is an instance of , consider the instance of where the variables are the vertices of , and where contains for every tuple the formula . It is again straightforward to verify that homomorphically maps to if and only if is a satisfiable instance of .

Therefore, the class of phylogeny problems corresponds precisely to the class of CSPs whose template is a reduct of .

3 The Universal-Algebraic Approach

We utilize the so-called universal-algebraic approach to obtain our results. For a more detailed introduction to this approach for -categorical templates, see Bodirsky [5]. We introduce some central concepts concerning definability (Section 3.1), polymorphisms (Section 3.2), and model-completeness and cores (Section 3.3). In the final section, we discuss how Ramsey theory can be used for analyzing polymorphisms. By using the language of universal algebra, one can elegantly state the border between tractability and NP-hardness for phylogeny problems; we present this border in Section 3.3.

3.1 Primitive Positive Definability

Let denote a first-order formula over the signature , and assume that the variables are free in . The formula is primitive positive if it is of the form where are atomic, that is, each equals either or where is -ary and . When is a -structure, then defines over a -ary relation, namely the set of all -tuples that satisfy in . We let denote the set of all finitary relations that are primitive positive definable in .

Lemma 1 below illustrates the concept of primitive positive definability. The relations that appear in this lemma will be important in later sections.

Lemma 1.

, , and .

Proof.

Note that the formula is equivalent to the primitive positive formulas , , and . Thus, , , and . We have that

so . To see that , note that

Finally,

which implies that , because . ∎

The following result motivates why we are interested in positive primitive definability in connection with the complexity of CSPs.

Lemma 2[35]).

Let be a template and let be the structure obtained from by adding the relation . If is primitive positive definable in , then and are polynomial-time equivalent.

The following is an application of the above lemma.

Lemma 3.

If then is NP-hard.

Proof.

Lemma 1 shows that . Since implies , and implies , it follows from the definition of that is equivalent to . We have already mentioned in Example 3 that Bryant [23] showed that the CSP for is NP-complete. By Lemma 2, is NP-hard. ∎

Therefore, in the following sections we are particularly interested in those reducts of where . We will prove later that when is a reduct of with finite relational signature such that and , then is in P.

3.2 Polymorphisms

Primitive positive definability can be characterised by preservation under so-called polymorphisms – this is the starting point of the universal-algebraic approach to constraint satisfaction (see, for instance, Bulatov, Jeavons, and Krokhin [25] for this approach over finite domains). The (direct–, categorical–, or cross–) product of two relational -structures and is a -structure on the domain . For all relations the relation , …, holds in iff holds in and holds in . Homomorphisms from to are called polymorphisms of . When is a relation over the domain , then we say that preserves (or that is closed under ) if is a polymorphism of . Note that unary polymorphisms of are endomorphisms of . When is a first-order formula that defines , and preserves , then we also say that preserves . If an operation does not preserve a relation , we say that violates .

The set of all polymorphisms of a relational structure forms an algebraic object called a function clone [42], which is a set of operations defined on a set that is closed under composition and that contains all projections. We write for the -ary functions in . The set is locally closed in the following sense. A set of functions with domain is locally closed if every function with the following property belongs to : for every finite subset of there is some operation such that for all . We write for the smallest set that is locally closed and contains . We say that generates an operation if is in the smallest locally closed function clone that contains .

Polymorphism clones can be used to characterize primitive positive definability over a finite structure; this follows from results by Bodnarčuk, Kalužnin, Kotov, and Romov [22] and Geiger [30]. This is false for general infinite structures. However, the result remains true if the relational structure is -categorical.

Theorem 4 (Bodirsky & Nešetrřil [15]).

Let be a countable -categorical structure. Then the primitive positive definable relations in are precisely the relations preserved by the polymorphisms of .

Let be a permutation group on a set . The orbit of a -tuple under is the set of all tuples of the form , where is a permutation from . The following has been discovered independently by Engeler, Svenonius, and Ryll-Nardzewski.

Theorem 5 (See, e.g., Theorem 7.3.1 in Hodges [33]).

A countable relational structure is -categorical if and only if the automorphism group of is oligomorphic, that is, if for each there are finitely many orbits of -tuples under . A relation has a first-order definition in an -categorical structure if and only if is preserved by all automorphisms of .

We also need the following observation.

Lemma 4 (Bodirsky & Kara [13]).

Let be a relational structure and let be a -ary relation that is a union of orbits of -tuples of . If has a polymorphism that violates , then also has an at most -ary polymorphism that violates .

Given a function , we tacitly extend it to tuples in the natural way:

When , we also write for the set . These conventions will be very convenient when working with polymorphisms.

3.3 Model-Complete Cores and the Border Between Tractability and Hardness

A structure is a core if all of its endomorphisms are embeddings. Note that endomorphisms preserve existential positive formulas and embeddings preserve existential formulas. A first-order theory is said to be model-complete if every embedding between models of preserves all first-order formulas. A structure is called model-complete if its first-order theory is model-complete. Homogeneous -categorical structures provide examples of model-complete structures: the reason is that if is -categorical and homogeneous, then every first-order formula is equivalent to a quantifier-free formula (Theorem 3). Since embeddings of into preserve quantifier-free formulas, the statement follows from Lemma 6.

Lemma 5.

The structures and are model-complete cores.

Proof.

Let be an endomorphism of . Suppose for contradiction that for distinct elements of . Then , but not , in contradiction to the assumption that preserves . Hence, is injective. Note that the negation of is equivalent to , and thus has an existential positive definition in . It follows that preserves , too. This implies that is an embedding and is a core. Model-completeness of follows from homogeneity. The structure is a model-complete core, too; the proof is very similar to the proof for and left to the reader. ∎

If is -categorical, then it is possible to characterize model-completeness in terms of self-embeddings of , this is, embeddings of into .

Lemma 6 (Lemma 13 in Bodirsky & Pinsker [17]).

A countable -categorical structure is model-complete if and only if the self-embeddings of are generated by the automorphisms of .

If is a core, then every endomorphism of is an embedding. We get the following consequence.

Corollary 1.

A countable -categorical structure is a model-complete core if and only if the endomorphisms of are generated by the automorphisms of .

Note that every first-order expansion of an -categorical model-complete core remains a model-complete core.

We say that two structures and are homomorphically equivalent if there exists a homomorphism from to , and one from to . Clearly, homomorphically equivalent structures have identical CSPs.

Theorem 6 (Theorem 16 in Bodirsky [4]).

Let be an -categorical structure. Then is homomorphically equivalent to an -categorical model-complete core . The structure is unique up to isomorphism, and again -categorical.

Hence, we speak in the following of the model-complete core of an -categorical structure. Using the concept of polymorphisms and model-complete cores, we can now give a concise description of the border between CSPs for reducts of that can be solved in polynomial time, and those that are NP-complete.

Theorem 7.

Let be a reduct of with a finite signature, and let be the model-complete core of . If has a binary polymorphism and endomorphisms such that for all elements of , then is in P. Otherwise, is NP-complete.

The proof of Theorem 7 can be found in Section 8.

3.4 Ramsey theory for trees

We apply Ramsey theory to find regular behavior in polymorphisms of constraint languages. This approach has succesfully been adopted earlier, see e.g. [13, 16, 21]. The Ramsey theorem we use here is less well known and will be described below. We first give a brief introduction to the way Ramsey theory enters the analysis of constraint languages.

Let be finite -structures. We write for the set of all substructures of that are isomorphic to . When are -structures, then we write if for all colorings there exists such that is constant on .

Definition 4.

A class of finite relational structures that is closed under isomorphisms and substructures is called Ramsey if for all and for every finite there exists a such that .

A homogeneous structure is called Ramsey if the class of all finite structures that embed into is Ramsey. We use Ramsey theory to show that polymorphisms of must behave canonically on large parts of the domain, in the sense defined below. A wider introduction to canonical functions can be found in Bodirsky & Pinsker [16] and Bodirsky [6].

Definition 5.

Let be a structure and be a subset of the domain of . A function is canonical on with respect to if for all , , and , there exists such that

When is Ramsey, then the following theorem allows us to work with canonical polymorphisms of the expansion of by constants.

Theorem 8 (Lemma 21 in Bodirsky, Pinsker, and Tsankov [21]).

Let be a homogeneous ordered Ramsey structure with domain . Let , and let be any operation. Then generates an operation that is canonical with respect to , and which is identical with on all tuples containing only values from .

We now discuss the Ramsey class that is relevant in our context. We have to work with an expansion of by a linear order on , which is also defined as a Fraïssé-limit as follows. A linear order on the elements of a leaf structure is called convex if for all with we have that either or that (but not ). Let be the class of all convexly ordered leaf structures. The following can be shown by using an appropriate variant of Proposition 2, and we omit the straightforward proof.

Proposition 3.

The class is an amalgamation class; its Fraïssé-limit is isomorphic to an expansion of by a convex linear ordering .

Clearly, has an automorphism such that if and only if ; we denote this automorphism by .

Theorem 9 (Leeb [36]).

The structure is Ramsey. In other words, for all convexly ordered leaf structures and for all , there exists a convexly ordered leaf structure such that .

A self-contained proof of Theorem 9 can be found in Bodirsky [7].

4 Toolbox

In this section we collect some known results and certain straightforward consequences of them. The section is divided into three parts where we recapitulate results concerning endomorphisms of phylogeny languages in Section 4.1, binary injective polymorphisms in Section 4.2, and equality constraint languages in Section 4.3.

4.1 A Preclassification

We use a fundamental result which can be seen as a classification of the endomorphism monoids of model-complete cores of reducts of .

Theorem 10 (Bodirsky, Jonsson, & Pham [10]).

Let be a reduct of . Then it satisfies at least one of the following:

  1. has a constant endomorphism;

  2. the model-complete core of is isomorphic to a reduct of ;

  3. the set of endomorphisms of equals the set of endomorphisms of ;

  4. the set of endomorphisms of equals the set of endomorphisms of .

Item 2 in this theorem has been stated slightly differently in Theorem 1 of [10], namely that is homomorphically equivalent to a reduct of . Note that this is equivalent to the model-complete core of being isomorphic to a reduct of unless has a constant endomorphism. The reason is that the core of a reduct of either has one element or is itself a reduct of .

If has a constant endomorphism, then is trivial. If is homomorphically equivalent to a reduct of , then the complexity of can be determined by known results which we present in Section 4.3 below. In items 3 and 4 of Theorem 10, we can deduce a statement about primitive positive definability of the relation and in .

Lemma 7.

Let be a phylogeny constraint language which does not have a constant endomorphism and which is not homomorphically equivalent to an equality constraint language. Then is a model-complete core, and or is primitive positive definable in .

Proof.

We first show that the relation consists of a single orbit of -tuples of . Arbitrarily choose . Since the entries of the tuples in are pairwise distinct, we have that the map that sends to for all , is a partial isomorphism of . Since is homogeneous, the partial map can be extended to an automorphism of . This implies that consists of one orbit of -tuples of .

If has a primitive positive definition in , then so has by Lemma 1, and because is a model-complete core by Lemma 5, so is by the remark after Corollary 1. If does not have a primitive positive definition in , then there is a polymorphism of that violates by Theorem 4. Since consists of one orbit of -tuples of , there is an endomorphism of that violates by Lemma 4. This implies that is violated by , too, since by Lemma 1 and the polymorphisms of and coincide. Since does not have constant endomorphisms and is not homomorphically equivalent to an equality constraint language, Theorem 10 implies that the relation is preserved by all endomorphisms of . Since is a model-complete core (Lemma 5), it follows that in this case is a model-complete core, too. Recall that is primitive positive definable in by Lemma 1 so is preserved by all endomorphisms of .

For arbitrary tuples , we have that the map that sends to for all is a partial isomorphism of . Since is homogeneous (see e.g. Lemma 14 in [10]), this partial map can be extended to an automorphism of . This implies that is contained in one orbit of -tuples of . If is not preserved by some polymorphism of , it follows from Lemma 4 that is not preserved by an endomorphism of which leads to a contradiction. Therefore, is preserved by all polymorphisms of . We conclude that the relation is primitive positive definable in by Theorem 4. ∎

The problem has been shown to be NP-complete by Steel [40]. Also recall that by Lemma 1. Lemma 7 therefore shows that in order to classify the computational complexity of , we can concentrate on the situation where the relations and are primitive positive definable in .

4.2 Binary Injective Polymorphisms

In this part we present a condition that implies that an -categorical structure has a binary injective polymorphism. The existence of binary injective polymorphisms plays an important role in later parts of the paper.

The following shows a sufficient condition for the existence of a constant endomorphism. A finite subset of the domain of is called a -set if it has elements. The orbit of a -set is the set , where is the image of under .

Lemma 8 (Lemma 18 in Bodirsky & Kára [13]).

If has only one orbit of -sets and a non-injective polymorphism, then has a constant endomorphism.

Definition 6.

The automorphism group is called -transitive if for any two sequences and of distinct elements there is such that for any .

By the homogeneity of , the structure , and all its reducts, have a -transitive automorphism group. Also note that -transitivity of implies that there only exists one orbit of -sets.

Definition 7.

The relation is 1-independent with respect to if for all primitive positive -formulas , if both and are satisfiable over , then is satisfiable over , too.

This terminology is explained in greater detail by Cohen, Jeavons, Jonsson, and Koubarakis [28]. Let

and

We will use the following known results.

Lemma 9 (Corollary 2.3 in Bodirsky, Jonsson, & von Oertzen [11]).

Let be an infinite set. Then every relation with a first-order definition in is in .

A function is called essentially unary if there exists an and a function such that for all . Otherwise, is called essential.

Lemma 10 (Lemma 1.3.1 in Pöschel & Kalužnin [39]).

Let be an infinite set and let be an operation. If preserves , then is essentially unary.

Lemma 11 (Contraposition of Lemma 5.3 in [11]).

Let be a structure over an infinite domain . If the binary relations in are and , then is 1-independent of .

Lemma 12 (Lemma 42 in Bodirsky & Pinsker [17]).

Let be a countable -categorical structure such that is in . Then the following are equivalent.

  1. is 1-independent of and

  2. has a binary injective polymorphism.

Theorem 11.

Let be an -categorical structure over a countably infinite domain with a 2-transitive automorphism group. Also suppose that has an essential polymorphism and no constant endomorphism. Then has a binary injective polymorphism.

Proof.

If , then , since by Lemma 9. Lemma 10 implies that is preserved by essentially unary operations only and this contradicts the assumption that is preserved by at least one essential polymorphism.

If is binary, then , since the automorphism group of is 2-transitive. We continue by showing that has a primitive positive definition in . Assume otherwise; then by Theorem 4 there must be a polymorphism of which violates . Since consists of one orbit of pairs under , by Lemma 4 there is an endomorphism of which violates . This implies that is not injective. Since has a -transitive automorphism group, has only one orbit of -sets. Lemma 8 implies that has a constant endomorphism which contradicts our assumptions.

Now, Lemma 11 implies that is 1-independent of since . We can now apply Lemma 12 and conclude that has a binary injective polymorphism. ∎

Corollary 2.

Every reduct of such that and has a binary injective polymorphism.

Proof.

Since all endomorphisms of preserve , there is no constant endomorphism, and all endomorphisms also preserve . Since is violated by some polymorphism of by Theorem 4, it follows that must have an essential polymorphism. Reducts of have a 2-transitive automorphism group, and the statement follows from Theorem 11. ∎

4.3 Equality Constraint Satisfaction Problems

The CSPs for reducts of have been called equality constraint satisfaction problems [12], and the statement of Theorem 7 was already known in this special case.

Theorem 12 (Bodirsky & Kára [12]; see also Bodirsky [6]).

Let be a reduct of . Then is in P if is preserved by a constant operation or an injective binary operation. In both cases, has polymorphisms , and such that for all elements of . Otherwise, all polymorphisms of are essentially unary, and is NP-complete.

In the case that a reduct of is preserved by an injective binary operation, the relations of can be characterised syntactically. A Horn formula is a formula in conjunctive normal form where there is at most one positive literal per clause.

Lemma 13 (Bodirsky, Chen, & Pinsker [8]).

A relation with a first-order definition over is preserved by a binary injective polymorphism if and only if has a definition over which is quantifier-free Horn. In this particular case, each clause can contain at most one literal of the type .

5 Violating the Forbidden Triple Relation

In this section we assume that is a reduct of such that and . We will see in the following subsections that these assumptions have quite strong consequences on the relations in .

We begin in Section 5.1 by introducing the central concept of domination that will be intensively used in the rest of the section. We continue in Sections 5.25.4 by introducing the notions of affine splits, separation, and freeness. These properties will be the basis for the characterization of affine Horn formulas that we present in Section 6.

5.1 Dominance

In this part we introduce the notion of domination for functions .

Definition 8.

Let . A function is called

  • dominated by the first argument on if for all and we have whenever ;

  • dominated by the second argument on if for all and , we have whenever .

When , we simply speak of domination by the first (or by the second) argument. Note that we extend the function f to tuples as described in the end of Section 3.2.

In this section, we will show that binary polymorphisms of