The SCJ Small Parsimony Problem for Weighted Gene Adjacencies

The SCJ Small Parsimony Problem for Weighted Gene Adjacencies

Nina Luhmann, Manuel Lafond, Annelyse Thévenin, Aïda Ouangraoua, Roland Wittler and Cedric Chauve N. Luhmann and R. Wittler are with the International Research Training Group “Computational Methods for the Analysis of the Diversity and Dynamics of Genomes”, Bielefeld University, Germany. N. Luhmann, R. Wittler and A. Thévenin are with the Genome Informatics group, Faculty of Technology and Center for Biotechnology, Bielefeld University, Germany. M. Lafond is with the Department of Computer Science and Operational Research, Université de Montréal, Montréal, Canada. A. Ouangraoua is with the Department of Computer Science, Université de Sherbrooke, Sherbrooke, Canada. C. Chauve is with the Department of Mathematics, Simon Fraser University, Burnaby (BC), Canada.
Abstract

Reconstructing ancestral gene orders in a given phylogeny is a classical problem in comparative genomics. Most existing methods compare conserved features in extant genomes in the phylogeny to define potential ancestral gene adjacencies, and either try to reconstruct all ancestral genomes under a global evolutionary parsimony criterion, or, focusing on a single ancestral genome, use a scaffolding approach to select a subset of ancestral gene adjacencies, generally aiming at reducing the fragmentation of the reconstructed ancestral genome.

In this paper, we describe an exact algorithm for the Small Parsimony Problem that combines both approaches. We consider that gene adjacencies at internal nodes of the species phylogeny are weighted, and we introduce an objective function defined as a convex combination of these weights and the evolutionary cost under the Single-Cut-or-Join (SCJ) model. The weights of ancestral gene adjacencies can e. g. be obtained through the recent availability of ancient DNA sequencing data, which provide a direct hint at the genome structure of the considered ancestor, or through probabilistic analysis of gene adjacencies evolution. We show the NP-hardness of our problem variant and propose a Fixed-Parameter Tractable algorithm based on the Sankoff-Rousseau dynamic programming algorithm that also allows to sample co-optimal solutions. We apply our approach to mammalian and bacterial data providing different degrees of complexity. We show that including adjacency weights in the objective has a significant impact in reducing the fragmentation of the reconstructed ancestral gene orders.
An implementation is available at http://github.com/nluhmann/PhySca.

1 Introduction

Reconstructing ancestral gene orders is a long-standing computational biology problem with important applications, as shown in several recent large-scale projects [15, 28, 29]. Informally, the problem can be defined as follows: Given a phylogenetic tree representing the speciation history leading to a set of extant genomes, we want to reconstruct the structure of the ancestral genomes corresponding to the internal nodes of the tree.

Existing ancestral genome reconstruction methods concentrate on two main strategies. Local approaches consider the reconstruction of one specific ancestor at a time independently from the other ancestors of the tree. Usually, they do not consider an evolutionary model and proceed in two stages: (1) comparing gene orders of ingroup and outgroup species to define potential ancestral gene adjacencies, and (2) selecting a conflict-free subset of ancestral gene adjacencies – where a conflict is defined as an ancestral gene extremity belonging to more than two potential adjacencies, e. g. due to convergent evolution –, to obtain a set of Contiguous Ancestral Regions (CARs) [7, 13, 24]. The second stage of this approach is often defined as a combinatorial optimization problem aiming to minimize the number of discarded ancestral adjacencies, thus maximizing the number of selected adjacencies  [7, 24, 26]. This stage follows principles common in scaffolding methods used to obtain gene orders for extant genomes from sequencing data [10, 25]. This approach was recently used to scaffold an ancestral pathogen genome for which ancient DNA (aDNA) sequencing data could be obtained [30]. Global approaches on the other hand simultaneously reconstruct ancestral gene orders at all internal nodes of the considered phylogeny, generally based on a parsimony criterion within an evolutionary model. This so called Small Parsimony Problem has been studied with several underlying genome rearrangement models, such as the breakpoint distance or the Double-Cut-and-Join (DCJ) distance [2, 21, 37]. While rearrangement scenarios based on complex rearrangement models can give insights into underlying evolutionary mechanisms, from a computational point of view, the Small Parsimony Problem is NP-hard for most rearrangement distances [34]. One exception is the Single-Cut-or-Join (SCJ) distance, for which linear/circular ancestral gene orders can be found in polynomial time [17], however constraints required to ensure algorithmic tractability yield fragmented ancestral gene orders.

The two approaches outlined above optimize somewhat orthogonal criteria. For example, the underlying goal of the local approach is to maximize the agreement between the resulting ancestral gene order and the set of potential ancestral adjacencies, independently of the other ancestral gene orders. Would it be applied independently to all ancestral nodes, potential ancestral adjacencies exhibiting a mixed profile of presence/absence in the extant genomes might then lead to a set of non-parsimonious ancestral gene orders. The global approach aims only at minimizing the evolutionary cost in the phylogeny and can result in more fragmented ancestral gene orders. Nevertheless, there is little ground to claim that one approach or the other is more accurate or to be preferred, and the work we present is an attempt to reconcile both approaches.

We introduce a variant of the Small Parsimony Problem based on an optimality criterion that accounts for both an evolutionary distance and the difference between the initial set of potential ancestral adjacencies and the final consistent subset of adjacencies conserved at each ancestral node. More precisely we consider that each potential ancestral gene adjacency can be provided with a (prior) non-negative weight at every internal node. The contribution of the discarded adjacencies to the objective function is then the sum of their weights. These adjacency weights can e. g. be obtained as probabilities computed by sampling scenarios for each potential adjacency independently [12] or can be based on ancient DNA (aDNA) sequencing data providing direct prior information assigned to certain ancestral nodes. It follows that the phylogenetic framework we present can then also assist in scaffolding fragmented assemblies of aDNA sequencing data [22, 30].

We prove NP-hardness of the problem variant we introduce and describe an exact exponential time algorithm for reconstructing consistent ancestral genomes under this optimality criterion, based on a mixed Dynamic Programming / Integer Linear Programming approach. We show that this Small Parsimony Problem variant is Fixed-Parameter Tractable (FPT), with a parameter linked to the amount of conflict in the data. Moreover, this also allows us to provide an FPT sampling algorithm for co-optimal solutions, – a problem recently addressed in [27] using a MCMC approach. We evaluate our method on a simulated dataset and compare our results to several other methods reconstructing ancestral genomes. Further, we apply our method to two real data sets: mammalian genomes spanning roughly one million years of evolution, and bacterial genomes (pathogen Yersinia) spanning years of evolution and for which some aDNA sequencing data is available. We show that we can reduce the fragmentation of ancestral gene orders in both datasets by integrating adjacency weights while reconstructing robust ancestral genomes.

This paper is an extended version of the work previously presented in [23], particularly including new results on simulated datasets and a hardness proof of the defined problem.

2 Background and problem statement

2.1 Genomes and adjacencies

Genomes consist of chromosomes and plasmids. Each such component can be represented as a linear or circular sequence of oriented markers over a marker alphabet. Markers correspond to homologous sequences between genomes, e. g. genes or synteny blocks. We assume that each marker appears exactly once in each genome, so our model does not consider duplications or deletions. To account for its orientation, each marker  is encoded as a pair of marker extremities or .

An adjacency is an unordered pair of marker extremities, e. g. . The order of markers in a genome can be encoded by a set of adjacencies. Two distinct adjacencies are said to be conflicting if they share a common marker extremity. If a set of adjacencies contains conflicting adjacencies, it is not consistent with a mixed linear/circular genome model. We assume that the set of adjacencies for an extant assembled genome is consistent. The set of adjacencies for one genome naturally defines an adjacency graph, where nodes represent marker extremities and edges represent adjacencies. Conflicting adjacencies can be identified as branching nodes in this graph.

2.2 The Small Parsimony Problem and rearrangement distances

In a global phylogenetic approach, we are given a phylogenetic tree with extant genomes at its leaves and internal nodes representing ancestral genomes. We denote by the set of all adjacencies present in at least one extant genome and assume that every ancestral adjacency belongs to . Then the goal is to find a labeling of the internal nodes by consistent subsets of minimizing a chosen genomic distance over the tree. This is known as the Parsimonious Labeling Problem.

Definition 1 (Parsimonious Labeling Problem)

Let be a tree with each leaf labeled with a consistent set of adjacencies , and a distance between consistent sets of adjacencies. A labeling with for each leaf is parsimonious for if none of the internal nodes contains a conflict and it minimizes the sum of the distances along the branches of :

This problem is NP-hard for most rearrangement distances taken as evolutionary models. The only known exception is the set-theoretic Single-Cut-or-Join (SCJ) distance [17]. It defines a rearrangement distance by two operations: the cut and join of adjacencies. Given two genomes defined by consistent sets of adjacencies and , the SCJ distance between these genomes is

The Small Parsimony Problem under the SCJ model can be solved by computing a parsimonious gain/loss history for each adjacency separately with the dynamic programming Fitch algorithm [18, 19] in polynomial time. Consistent labelings can be ensured with the additional constraint that in case of ambiguity at the root of the tree, the absence of the adjacency is chosen [17]. As each adjacency is treated independently, this constraint might automatically exclude all adjacencies being part of a conflict to ensure consistency. This results in an unnecessarily sparse reconstruction in terms of reconstructed adjacencies and thus more fragmented genomes higher up in the tree.

2.3 Generalization by weighting adjacencies

When considering an internal node , we define node  as its parent node in . We assume that a specific adjacency graph is associated to each ancestral node , whose edges are annotated by a weight  representing a confidence measure for the presence of adjacency  in species . Then in a global reconstruction, cutting an adjacency of a higher weight has higher impact in terms of the optimization criterion than cutting an adjacency of lower weight.

Formally, we define two additional variables for each adjacency at each internal node : The presence (or absence) of at node is represented by , while indicates a change for the status of an adjacency along an edge , i.e., . We consider the problem of optimizing the following objective function, where is a convex combination factor.

Definition 2 (Weighted SCJ Labeling Problem)

Let be a tree with each leaf  labeled with a consistent set of adjacencies and each adjacency is assigned a given weight for each node . A labeling of the internal nodes of  with for each leaf is an optimal weighted SCJ labeling if none of the internal nodes contains a conflict and it minimizes the criterion

Further, we can state the corresponding co-optimal sampling problem. A sampling method is important to examine different co-optimal rearrangement scenarios that can explain evolution toward the structure of extant genomes.

Definition 3 (Weighted SCJ Sampling Problem)

Given the setting of the Weighted SCJ Labeling Problem, sample uniformly from all labelings  of the internal nodes of that are solutions to the Weighted SCJ Labeling Problem.

2.4 Problem complexity

Aside of the many heuristics for the Small Parsimony Problem for non-SCJ rearrangement models (see for example [36, 37, 21] for the DCJ distance), there exist a few positive results for the Weighted SCJ Labeling Problem with specific values of .

If , the objective function corresponds to the Small Parsimony Problem under the SCJ distance and hence a solution can be found in polynomial time [17]. A generalization towards multifurcating, edge-weighted trees including prior information on adjacencies at exactly one internal node of the tree is given in [22]. Recently, Miklós and Smith [27] proposed a Gibbs sampler for sampling optimal labelings under the SCJ model with equal branch lengths. It starts from an optimal labeling obtained as in [17], and then explores the space of co-optimal labelings through repeated constrained parsimonious modifications of a single adjacency evolutionary scenario. This method addresses the issue of the high fragmentation of internal node labelings, but convergence is not proven, and so there is no bound on the computation time.

If , i.e., we do not take evolution in terms of SCJ distance along the branches of the tree into account, we can solve the problem by applying independently a maximum-weight matching algorithm at each internal node [26]. So the extreme cases of the problem are tractable, and while we assume that the general problem is hard, we will now prove it for a small range of .

Theorem 1

The Weighted SCJ Labeling Problem is NP-hard for any .

We show the hardness by reduction from the Maximum Intersection Matching Problem, which is defined as follows. Let and be two graphs on the same vertex set. Find a perfect matching in and such that the number of edges common to both matchings is maximized. We prove NP-hardness of this problem by reduction from 3-Balanced-Max-2-SAT (see appendix for details).

Theorem 2

The Maximum Intersection Matching Problem is NP-complete.

The relation of the Weighted SCJ Labeling Problem and the Maximum Intersection Matching Problem can be sketched as follows. For a given instance of the Maximum Intersection Matching Problem, and , we construct a tree that contains the edges of both graphs as potential adjacencies. For , an optimal labeling of two internal nodes then corresponds to perfect matchings in and . Maximizing the number of common edges of the matching then minimizes the SCJ distance between the nodes. A detailed proof is given in the appendix.

3 Methods

In order to find a solution to the Weighted SCJ Labeling Problem, we first show that we can decompose the problem into smaller independent subproblems. Then, for each subproblem containing conflicting adjacencies, we show that, if it contains a moderate level of conflict, it can be solved using the Sankoff-Rousseau algorithm [32] with a complexity parameterized by the size of the subproblem. For a highly conflicting subproblem, we show that it can be solved by an Integer Linear Program (ILP).

3.1 Decomposition into independent subproblems

We first introduce a graph that encodes all adjacencies present in at least one internal node of the considered phylogeny (Def. 4 and Supplementary Fig. 7). As introduced previously, we consider a tree where each node is augmented with an adjacency graph.

Definition 4 (Global adjacency graph)

The set of vertices of the global adjacency graph consists of all marker extremities present in at least one of the adjacency graphs. There is an edge between two vertices that are not extremities of a same marker, if there is an internal node in the tree whose adjacency graph contains the adjacency . The edge is labeled with the list of all internal nodes that contain this adjacency.

Each connected component  of the global adjacency graph defines a subproblem composed of the species phylogeny, the set of marker extremities equal to the vertex set of , and the set of adjacencies equal to the edge set of . According to the following lemma, whose proof is straightforward, it is sufficient to solve each such subproblem independently.

Lemma 1

The set of all optimal solutions of the Weighted SCJ Labeling Problem is the set-theoretic Cartesian product of the sets of optimal solutions of the instances defined by the connected components of the global adjacency graph.

To solve the problem defined by a connected component  of the global adjacency graph containing conflicts, we rely on an adaptation of the Sankoff-Rousseau algorithm with exponential time complexity, parameterized by the size and nature of conflicts of , and thus can solve subproblems with moderate amount of conflict.

3.2 Overview of the Sankoff-Rousseau algorithm

The Sankoff-Rousseau dynamic programming algorithm [32] solves the general Small Parsimony Problem for discrete characters. Let be the set of all possible labels of a node in the phylogeny. Then for each node  in the tree, the cost of assigning a label to this node is defined recursively as follows:

where in our case is defined as in Definition 2. This equation defines a dynamic programming algorithm whose base case is when is a leaf in which case if and otherwise. Afterwards, choosing a label with the minimum cost at the root node and backtracking in a top-down traversal of the tree results in a most parsimonious labeling. We refer to [14] for a review on the Sankoff-Rousseau algorithm.

3.3 Application to the Weighted SCJ Labeling Problem

In order to use the Sankoff-Rousseau algorithm to solve the problem defined by a connected component  of the global adjacency graph, we define a label of an internal node of the phylogeny as the assignment of at most one adjacency to each marker extremity. More precisely, let be a marker extremity in , an internal node of , and be all edges in the global adjacency graph that are incident to and whose label contains (i.e., represent adjacencies in the adjacency graph of node ). We define the set of possible labels of as . The set of potential labels  of node is then the Cartesian product of the label sets for all resulting in a set of discrete labels for of size . Note that not all of these joint labelings are valid as they can assign an adjacency to but not to , or adjacency to and to thus creating a conflict (see Supplementary Fig. 8 for an example).

For an edge in the tree, we can then define a cost matrix that is indexed by pairs of labels of and , respectively. The cost is infinite if one of the labels is not valid, and defined by the objective function otherwise. We can then apply the Sankoff-Rousseau approach to find an optimal labeling of all internal nodes of the tree for connected component .

Note that, if is a connected component with no conflict, it is composed of two vertices and a single edge, and can be solved in space and time .

3.4 Complexity analysis

The time and space complexity of the algorithm is obviously exponential in the size of . Indeed, the time (resp. space) complexity of the Sankoff-Rousseau algorithm for an instance with a tree having leaves and possible labels for each node is (resp. [14]. In our algorithm, assuming leaves in (i.e. extant species), vertices in the global adjacency graph of and a maximum degree for vertices (marker extremities) in this graph, is an upper bound for the size of the label set  for a node . Moreover, computing the distance between two labels of and , where is an edge of , can trivially be done in time and space : If both labels are valid, it suffices to check how many common adjacencies are present in both labels, while deciding if a label is not valid can be done by a one-pass examination of the label. Combining this with the Sankoff-Rousseau complexity yields a time complexity in and a space complexity in .

Given a general instance, i.e., an instance not limited to a single connected component of the global adjacency graph, we can consider each connected component independently (Lemma 1). For a set of markers and connected components in the global adjacency graph defining a conflicting instance, we define as the maximum degree of a vertex and as the maximum number of vertices in all such components. Then, the complexity analysis above shows that the problem is Fixed-Parameter Tractable (FPT).

Theorem 3

The Weighted SCJ Labeling Problem can be solved in worst-case time and space .

In practice, the exponential complexity of our algorithm depends on the structure of the conflicting connected components of the global adjacency graph. The dynamic programming algorithm will be effective on instances with either small conflicting connected components or small degrees within such components, and will break down with a single component with a large number of vertices of high degree. For such components, the time complexity is provably high and we propose an ILP to solve them.

3.5 An Integer Linear Program

We can formulate the optimization problem as a simple ILP. We consider two variables for any adjacency  and node , and , defined as in Section 2.

The constraints ensure parsimony ( and ), consistency of the solution () and define the correct value for dependent on the value of along an edge (). This ILP has obviously a size that is polynomial in the size of the problem.

3.6 Sampling co-optimal labelings

The Sankoff-Rousseau DP algorithm can easily be modified to sample uniformly from the space of all optimal solutions to the Weighted SCJ labeling Problem in a forward-backward fashion. The principle is to proceed in two stages: first, for any pair we compute the number of optimal solutions under this label for the subtree rooted at . Then, when computing an optimal solution, if a DP equation has several optimal choices, one is randomly picked according to the distribution of optimal solutions induced by each choice (see Appendix for more details). This classical dynamic programming approach leads to the following result.

Theorem 4

The Weighted SCJ Sampling Problem can be solved in worst-case time and space .

For subproblems that are too large for being handled by the Sankoff-Rousseau algorithm, the SCJ Small Parsimony Gibbs sampler recently introduced [27] can easily be modified to incorporate prior weights, although there is currently no proven property regarding its convergence.

3.7 Weighting ancestral adjacencies

A first approach to assign weights to ancestral adjacencies consist in considering evolutionary scenarios for an adjacency independently of the other adjacencies. An evolutionary scenario for an adjacency is a labeling of the internal nodes of the species phylogeny by the presence or absence of the adjacency, and the parsimony score of a scenario is the number of gains/losses of the adjacency along the branch of , i. e. the SCJ score for this single adjacency. For a scenario , we denote by its parsimony score. Its Boltzmann score is then defined as , where is a given constant. If we denote the set of all possible evolutionary scenarios for the adjacency by , the partition function of the adjacency and its Boltzmann probability are defined as

The weight of the adjacency at internal node  is then the sum of the Boltzmann probabilities of all scenarios where the adjacency is present at node . All such quantities can be computed in polynomial time [12].

Parameter is useful to skew the Boltzmann probability distribution: If tends to zero, parsimonious scenarios are heavily favored and the Boltzmann probability distribution tends to the uniform distribution over optimal scenarios, while when tends to , the Boltzmann distribution tends toward the uniform distribution over the whole solution space. In our experiments, we chose a value of that favors parsimonious scenarios but considers also slightly suboptimal scenarios.

When aDNA sequence data is available for one or several ancestral genomes, markers identified in extant species can be related to assembled contigs of the ancestral genome, as in [30] for example. For an ancestral adjacency in a species for which aDNA reads are available, it is then possible to associate a sequence-based weight to the adjacency – either through gap filling methods (see Section 4, where we use the probabilistic model of GAML [11]), or scaffolding methods such as BESST [31] for example. In comparison to the weighting approach described above, these weights are then not directly based on the underlying phylogeny, but provide an external signal for the confidence of adjacencies at the respective internal node.

4 Results

We evaluated our algorithm on a simulated dataset and compared its sensitivity and precision to several other reconstruction methods. Further, we applied our method to two real datasets: mammalian and Yersinia genomes. The mammalian dataset was used in the studies [13] and [27]. It contains six mammalian species and two outgroups, spanning over million years of evolution, and five different marker sets of varying resolution (minimal marker length). Our experimental results consider issues related to the complexity of our algorithm, the use of a pure SCJ reconstruction (obtained when the parameter equals ) and the relative impact of the value of on both the total evolutionary cost and the ancestral gene orders fragmentation. Our second dataset contains eleven Yersinia genomes, an important human pathogen. This dataset contains contigs from the recently sequenced extinct agent of the Black Death pandemic [9] that occurred roughly years ago. We refer to Supplementary Fig. 12 and 12 for the species phylogenies of these two datasets.

4.1 Simulations

We created simulated datasets as described in [16]: with a birth-rate of and a death rate of , we simulated binary trees with leaves and scaled the branch lengths such that the tree has a diameter , where is the number of markers in each unichromosomal genome. The root genome with markers is then evolved along the branches of the tree by applying inversions and translocations with a probability of and respectively. The number of rearrangements at each branch corresponds to the simulated branch length, the total number of rearrangements ranges from to in the simulated trees. We compare results of our implementation PHYSCA for different values of with the tools RINGO [16], MGRA [4], Fitch-SCJ [8], ROCOCO [33, 35] (dense approach for signed adjacencies) and ANGES [20] (adjacencies only). We computed adjacency weights as described in subsection 3.7 with the software DeClone [12] and parameter .

The methods RINGO and MGRA are global approaches minimizing the DCJ-distance in the tree, while ANGES reconstructs specific ancestors locally in the tree and is applied for each node separately. For , our objective is finding a consistent, most parsimonious solution and equals the objectives of Fitch-SCJ and ROCOCO, where Fitch-SCJ always finds the most fragmented solution whereas ROCOCO and our method aim at reporting least fragmented reconstructions.

Fig. 1: Average precision and sensitivity (top), and and (bottom) of reconstructions on simulated datasets. Adjacency weights have been obtained with parameters (left) and (right).

We measured sensitivity and precision of the reconstructions based on the comparison of simulated and reconstructed adjacencies by the different methods. A high sensitivity indicates the ability to recover the true marker order of ancestors in the phylogeny, while a high precision denotes few wrongly reconstructed adjacencies. As shown in Figure 1, our method reaches a high precision of for all values of , while increasing the sensitivity in comparison to the pure Fitch-SCJ solution by reducing the fragmentation of the reconstructed scaffolds. For higher values of , the influence of the weighting becomes apparent: for , the precision only decreases for , while for , the precision decreases also for lower values of , however leading to more complete reconstructions. In comparison, both DCJ-based methods RINGO and MGRA produce less fragmented solutions by recovering more true adjacencies under the jeopardy of also reconstructing more false adjacencies. The sensitivity and precision of Fitch-SCJ, ROCOCO and ANGES are comparable to our method for low to medium values of .

The score assesses the relation of sensitivity and precision with equal importance. RINGO achieves a better score than all other methods. The score emphasizes the precision of a method over its sensitivity. With this measure, our method with and outperforms the other tools, while ROCOCO and ANGES also reach similarly good scores.

In general, it can be seen that the equal contribution of global evolution and local adjacency weights in the objective function provides a reliable reconstruction and further a useful tool to explore the solution space under different values of .

4.2 Mammalian dataset

We used the markers computed in [13] from whole-genome alignments. The extant species contain a diverse number of chromosomes ranging from 9 chromosomes in opossum to 39 chromosomes in pig. Unique and universal markers were computed as synteny blocks with different resolution in terms of minumum marker length. Note that all rearrangement breakpoints are therefore located outside of marker coordinates. It results in five different datasets varying from markers for a resolution of kb to markers for a resolution of kb.

We considered all adjacencies present in at least one extant genome as potentially ancestral. To weight an adjacency at all internal nodes of the tree, we relied on evolutionary scenarios for each single adjacency, in terms of gain/loss, independently of the other adjacencies (i. e. without considering consistency of ancestral marker orders). We obtain these weights using the software DeClone [12], and we refer to them as DeClone weights. We considered two values of the DeClone parameter , and , the former ensuring that only adjacencies appearing in at least one optimal adjacency scenario have a significant DeClone weight, while the latter samples adjacencies outside of optimal scenarios. For the analysis of the ancestral marker orders obtained with our algorithm, we considered the data set at  kb resolution and sampled ancestral marker orders for all ancestral species under different values of .

Complexity

The complexity of our algorithm is dependent on the size of the largest connected component of the global adjacency graph. In order to restrict the complexity, we kept only adjacencies whose weights are above a given threshold . Figure 2 shows the expected decrease in computational complexity correlated to threshold  for the five different minimal marker lengths. In most cases, all connected components are small enough to be handled by our exact algorithm in reasonable time except for very large components in the marker sets with higher resolution under a low threshold . For the  kb dataset with and , the computation of one solution takes on average 200 s on a 2.6 GHz i5 with 8 GB of RAM. It can be reduced to 30 s when DeClone weights are based on . This illustrates that our algorithm, despite an exponential worst-case time complexity, can process realistic datasets in practice.

Fig. 2: Number of different labels for the largest connected component in each of the mammalian datasets. This statistic provides an upper bound for the actual complexity of our reconstruction algorithm.
Fig. 3: Number of reconstructed CARs at each internal node in 500 samples for the mammalian dataset with  kb resolution, and .

Optimal SCJ labelings

Next, we analyzed the optimal SCJ labelings obtained for , i. e. aiming only at minimizing the SCJ distance, and considered the fragmentation of the ancestral gene orders (number of CARs) and the total evolutionary distance. Note that, unlike the Fitch algorithm used in [17], our algorithm does not favor fragmented assemblies by design but rather considers all optimal labelings. Sampling of co-optimal solutions shows that the pure SCJ criterion leads to some significant variation in terms of number of CARs (Figure 3). In contrast, Table I shows that most observed ancestral adjacencies are present in all sampled scenarios. About of adjacencies, mostly located at nodes higher up in the phylogeny, are only present in a fraction of all sampled scenarios, indicating that there is a small number of conflicts between potential adjacencies that can be solved ambiguously at the same parsimony cost.

The optimal SCJ distance in the tree for is , while the related DCJ distance in the sampled reconstructions varies between and (Figure 4). In comparison, we obtained a DCJ distance of with GASTS [36], a small parsimony solver directly aiming at minimizing the DCJ distance. More precisely, over all ancestral nodes, adjacencies found by GASTS do not belong to our predefined set of potential ancestral adjacencies and another appear in the samples with a frequency below . This illustrates both a lack of robustness of the pure SCJ optimal labelings, and some significant difference between the SCJ and DCJ distances.

Ancestor Frequency
Boreoeutheria 94.66 1.07 4.27
Euarchontoglires 95.42 0.88 3.79
Ferungulates 96.53 0.55 2.92
Primates 98.82 0.34 0.84
Rodentia 99.49 0.34 0.17
Theria 97.67 0.89 1.43
root node 92.23 1.23 6.53
TABLE I: Frequency of adjacencies in 500 samples with as percentage of optimal labelings they appear in.

Finally, we compared the Boltzmann probabilities of ancestral adjacencies (DeClone weights) with the frequency observed in the samples. There is a very strong agreement for DeClone weights obtained with as only ancestral adjacency have a DeClone weight that differs more than from the observed frequency in the samples. This shows that, despite the fact that the DeClone approach disregards the notion of conflict, it provides a good approximation of the optimal solutions of the SCJ Small Parsimony Problem.

Ancestral reconstruction with DeClone weights and varying values of .

For , our method minimizes a combination of the SCJ distance with the DeClone weights of the adjacencies discarded to ensure valid ancestral gene orders. Again, we sampled solutions each for different values of with the  kb data set. We distinguish between DeClone parameter and . Figures 4 and 5 show the respective observed results in terms of evolutionary distance and fragmentation.

Fig. 4: SCJ distance (upper half) and DCJ (lower half) distance in the whole tree for all samples and selected values of in the mammalian dataset.
Fig. 5: Number of CARs in the mammalian dataset in all samples at selected internal nodes for different values of reconstructed with DeClone weights under . While the number of CARs differs in the case of where the adjacency weights are not considered, the fragmentation stays constant for the other values of .

For , the optimal SCJ and DCJ distance over the whole tree hardly depends on . Including the DeClone weights in the objective actually results in the same solution, independent of . In fact, while applying a low weight threshold of , the set of potential adjacencies is already consistent at all internal nodes except for a few conflicts at the root that are solved unambiguously for all values of . This indicates that building DeClone weights on the basis of mostly optimal adjacency scenarios (low ) results in a weighting scheme that agrees with the evolution along the tree for this dataset. More importantly, Figures 4 and  5 show that the combination of DeClone weights followed by our algorithm, leads to a robust set of ancestral gene orders.

In comparison, for , we see an increase in SCJ and DCJ distance for higher , while the number of CARs at internal nodes decreases, together with a loss of the robustness of the sampled optimal results when gets close to . It can be explained by the observation that the weight distribution of ancestral adjacencies obtained with DeClone and is more balanced than with as it considers suboptimal scenarios of adjacencies with a higher probability. It further illustrates that, when the global evolutionary cost of a solution has less weight in the objective function, the algorithm favors the inclusion of an adjacency of moderate weight that joins two CARs while implying a moderate number of evolutionary events (for example an adjacency shared by only a subset of extant genomes). From that point of view, our algorithm – being efficient enough to be run on several values of – provides a useful tool to evaluate the relation between global evolution and prior confidence for adjacencies whose pattern of presence/absence in extant genomes is mixed.

4.3 Yersinia pestis dataset

We started from fully assembled DNA sequences of seven Yersinia pestis and four Yersinia pseudotuberculosis genomes. In addition, we included aDNA single-end reads and contigs of length bp assembled from these reads for the Black Death agent, considered as ancestral to several extant strains [9]. We refer to this augmented ancestral node as the Black Death (BD) node. The marker sequences for all extant genomes were computed as described in [30], restricting the set of markers to be unique and universal. We obtained a total of markers in all extant genomes and different extant adjacencies, thus showing a relatively low level of syntenic conflict compared to the number of markers, although it implies a highly dynamic rearrangement history over the short period of evolution [30].

As for the mammalian dataset, we considered as potentially ancestral any adjacency that appears in at least one extant genome. However for this dataset, reducing the complexity by applying a weight threshold  was not necessary. For the BD node, adjacency weights can be based on the given aDNA reads for a given potential ancestral adjacency as follows. First, we used FPSAC [30] to compute DNA sequences filling the gaps between any two adjacent marker extremities (obtained by aligning the gap sequences of the corresponding conserved extant adjacencies and reconstructing a consensus ancestral sequence using the Fitch algorithm). Then we computed the weights as a likelihood of this putative gap sequence given the aDNA reads, using the GAML probabilistic model described in [11]. Each adjacency together with its template gap sequence details a proposition for an assembly  as a piece of the real ancestral sequence, and given the aDNA read set , the model then defines a probability for observing the reads  given that is the correct assembly. The probability can be computed by aligning to the assembly  while the alignment is evaluated under an appropriate sequencing error model. We refer to [11] for details.

Ancestral reconstruction with aDNA weights

Again we sampled solutions for this dataset. We computed the weights at the BD node based on the aDNA data, while adjacencies at all other nodes were given weight . Hence we can investigate the influence of including the aDNA sequencing data in the reconstruction while for the rest of the tree, the weights do not impact the objective function. Moreover, this weighting scheme addresses the issue of potential BD adjacencies with a low weight due to the difficulty of sequencing ancient DNA.

Fig. 6: Reconstructed number of CARs in the yersinia dataset with aDNA weights at the BD node and 0 otherwise, for four ancestral nodes.

As shown in Figure 6, for selected internal nodes of the phylogeny, the pure SCJ solutions at result in the highest fragmentation, while the number of CARs decreases as we increase the importance of the adjacency weights in the objective of our method. For the BD node, when including the aDNA weights, the fragmentation is decreasing while the reconstructions for each are robust. At the other nodes, the applied sequencing weights also reduce the fragmentation except for node6 which is located in the pseudotuberculosis subtree and hence more distant to the BD node. This shows that the aDNA weights not only influence the reconstructed adjacencies at the BD node, but also other nodes of the tree.

5 Conclusion

Our main contributions are the introduction of the Small Parsimony Problem under the SCJ model with adjacency weights, together with an exact parameterized algorithm for the optimization and sampling versions of the problem. The motivation for this problem is twofold: incorporating sequence signal from aDNA data when it is available, and recent works showing that the reconstruction of ancestral genomes through the independent analysis of adjacencies is an interesting approach [5, 12, 17, 27].

Regarding the latter motivation, we address a general issue of these approaches that either ancestral gene orders are not consistent or are quite fragmented if the methods are constrained to ensure consistency. The main idea we introduce is to take advantage of sampling approaches recently introduced in [12] to weight potential ancestral adjacencies and thus direct, through an appropriate objective function, the reconstruction of ancestral gene orders. Our results on the mammalian dataset suggest that this approach leads to a robust ancestral genome structure. However, we can observe a significant difference with a DCJ-based ancestral reconstruction, a phenomenon that deserves to be explored further. Our algorithm, which is based on the Sankoff-Rousseau algorithm similarly to several recent ancestral reconstruction algorithms [5, 12, 27], is a parameterized algorithm that can handle real instances containing a moderate level of syntenic conflict. Our experimental results on both the mammalian and bacterial datasets suggest that introducing prior weights on adjacencies in the objective function has a significant impact in reducing the fragmentation of ancestral gene orders, even with an objective function with balanced contributions of the SCJ evolution and adjacency weights. For highly conflicting instances, it can be discussed if a reconstruction through small parsimony is the right approach to solve these conflicts or if these should be addressed differently.

Our sampling algorithm improves on the Gibbs sampler introduced in [27] in terms of computational complexity and provides a useful tool to study ancestral genome reconstruction from a Bayesian perspective. Moreover, our algorithm is flexible regarding the potential ancestral gene adjacencies provided as input and could easily be associated with other ideas, such as intermediate genomes for example [16].

There are several research avenues opened by our work. From a theoretical point of view, we know the problem we introduced is tractable for and , and we show it is hard for , but it remains to see whether it is hard otherwise. Further, given that the considered objective is a combination of two objectives to be optimized simultaneously, Pareto optimization is an interesting aspect that should be considered. Our model could also be extended towards other syntenic characters than adjacencies, i. e. groups of more than two markers, following the ancient gene clusters reconstruction approach introduced in [33]. As ancestral gene orders are defined by consistent sets of adjacencies, the principle of our dynamic programming algorithm could be conserved and it would only be a matter of integrating gene clusters into the objective function. From a more applied point of view, one would like to incorporate duplicated and deleted markers into our Small Parsimony Problem. There exist efficient algorithms for the case of a single adjacency [5, 12] that can provide adjacency weights, and natural extensions of the SCJ model to incorporate duplicated genes. However it remains to effectively combine these ideas. Finally, again due to the flexibility and simplicity of the Sankoff-Rousseau dynamic programming algorithm, one could easily extend our method towards the inference of extant adjacencies if some extant genomes are provided in partially assembled form following the general approach described in  [1, 3].

This would pave the way towards a fully integrated phylogenetic scaffolding method that combines evolution and sequencing data for selected extant and ancestral genomes.

Acknowledgements

NL and RW are funded by the International DFG Research Training Group GRK 1906/1. CC is funded by NSERC grant RGPIN-249834.

References

  • [1] Aganezov Jr., S., Sitdykova, N., Alekseyev, M.: Scaffold assembly based on genome rearrangement analysis. Comput. Biol. and Chemistry 57, 46–53 (2015)
  • [2] Alekseyev, M., Pevzner, P.A.: Breakpoint graphs and ancestral genome reconstructions. Genome Res. 19, 943–957 (2009)
  • [3] Anselmetti, Y., Berry, V., pthers: Ancestral gene synteny reconstruction improves extant species scaffolding. BMC Genomics 16(Suppl 10), S11 (2015)
  • [4] Avdeyev, P., Jiang, S., Aganezov, S., Hu, F., Alekseyev, M.A.: Reconstruction of ancestral genomes in presence of gene gain and loss. Journal of Computational Biology 23(3), 150–164 (2016)
  • [5] Bérard, S., Gallien, C., et al.: Evolution of gene neighborhoods within reconciled phylogenies. Bioinformatics 28, 382–388 (2012)
  • [6] Berman, P., Karpinski, M.: On some tighter inapproximability results. In: International Colloquium on Automata, Languages, and Programming. pp. 200–209. Springer (1999)
  • [7] Bertrand, D., Gagnon, Y., et al.: Reconstruction of ancestral genome subject to whole genome duplication, speciation, rearrangement and loss. In: WABI 2010, LNCS, vol. 6293, pp. 78–89. Springer (2010)
  • [8] Biller, P., Feijão, P., Meidanis, J.: Rearrangement-based phylogeny using the single-cut-or-join operation. IEEE/ACM TCBB 10(1), 122–134 (2013)
  • [9] Bos, K.I., Schuenemann, V., et al.: A draft genome of yersinia pestis from victims of the black death. Nature 478, 506–510 (2011)
  • [10] Bosi, E., Donati, B., et al.: Medusa: a multi-draft based scaffolder. Bioinformatics 31(15), 2443–2451 (2015)
  • [11] Boža, V., Brejová, B., Vinař, T.: GAML: Genome Assembly by Maximum Likelihood. Algorithms Mol. Biol. 10,  18 (2015)
  • [12] Chauve, C., Ponty, Y., Zanetti, J.: Evolution of genes neighborhood within reconciled phylogenies: An ensemble approach. BMC Bioinformatics 16(Suppl. 19),  S6 (2015)
  • [13] Chauve, C., Tannier, E.: A methodological framework for the reconstruction of contiguous regions of ancestral genomes and its application to mammalian genomes. PLoS Comput. Biol. 4, e1000234 (2008)
  • [14] Csűrös, M.: How to infer ancestral genome features by parsimony: Dynamic programming over an evolutionary tree. In: Models and Algorithms for Genome Evolution, pp. 29–45. Springer (2013)
  • [15] Denoeud, F., Carretero-Paulet, L., et al.: The coffee genome provides insight into the convergent evolution of caffeine biosynthesis. Science 345, 125527 (2013)
  • [16] Feijǎo, P.: Reconstruction of ancestral gene orders using intermediate genomes. BMC Bioinformatics 16(Suppl. 14),  S3 (2015)
  • [17] Feijǎo, P., Meidanis, J.: SCJ: a breakpoint-like distance that simplifies several rearrangement problems. IEEE/ACM TCBB 8, 1318–1329 (2011)
  • [18] Fitch, W.: Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Biol. 20, 406–416 (1971)
  • [19] Hartigan, J.A.: Minimum mutation fits to a given tree. Biometrics 29(1), 53–65 (1973)
  • [20] Jones, B.R., Rajaraman, A., Tannier, E., Chauve, C.: Anges: reconstructing ancestral genomes maps. Bioinformatics 28(18), 2388–2390 (2012)
  • [21] Kovác, J., Brejová, B., Vinar, T.: A practical algorithm for ancestral rearrangement reconstruction. In: WABI 2011, LNCS, vol. 6833, pp. 163–174. Springer (2011)
  • [22] Luhmann, N., Chauve, C., et al.: Scaffolding of ancient contigs and ancestral reconstruction in a phylogenetic framework. In: BSB 2014, LNCS, vol. 8826, pp. 135–143. Springer (2014)
  • [23] Luhmann, N., Thévenin, A., Ouangraoua, A., Wittler, R., Chauve, C.: The scj small parsimony problem for weighted gene adjacencies. In: International Symposium on Bioinformatics Research and Applications. pp. 200–210. Springer (2016)
  • [24] Ma, J., Zhang, L., et al.: Reconstructing contiguous regions of an ancestral genome. Genome Res. 16, 1557–1565 (2006)
  • [25] Mandric, I., Zelikovsky, A.: Scaffmatch: scaffolding algorithm based on maximum weight matching. Bioinformatics 31(16), 2632–2638 (2015)
  • [26] Maňuch, J., Patterson, M., et al.: Linearization of ancestral multichromosomal genomes. BMC Bioinformatics 13(Suppl 19), S11 (2012)
  • [27] Miklós, I., Smith, H.: Sampling and counting genome rearrangement scenarios. BMC Bioinformatics 16(Suppl 14),  S6 (2015)
  • [28] Ming, R., Van Buren, R., et al.: The pineapple genome and the evolution of CAM photosynthesis. Nature Genetics (2015), in press
  • [29] Neafsey, D.E., Waterhouse, R.M., et al.: Highly evolvable malaria vectors: The genome of 16 Anopheles mosquitoes. Science 347, 1258522 (2015)
  • [30] Rajaraman, A., Tannier, E., Chauve, C.: FPSAC: fast phylogenetic scaffolding of ancient contigs. Bioinformatics 29, 2987–2994 (2013)
  • [31] Sahlin, K., Vezzi, F., et al.: BESST - efficient scaffolding of large fragmented assemblies. BMC Bioinformatics 15, 281 (2014)
  • [32] Sankoff, D., Rousseau, P.: Locating the vertices of a steiner tree in an arbitrary metric space. Math. Programming 9, 240–246 (1975)
  • [33] Stoye, J., Wittler, R.: A unified approach for reconstructing ancient gene clusters. IEEE/ACM TCBB 6(3), 387–400 (2009)
  • [34] Tannier, E., Zheng, C., Sankoff, D.: Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics 10(1), 120 (2009)
  • [35] Wittler, R.: Phylogeny-based Analysis of Gene Clusters. Ph.D. Thesis, Faculty of Technology, Bielefeld University (2010)
  • [36] Xu, A., Moret, B.: GASTS: parsimony scoring under rearrangements. In: WABI 2011, LNCS, vol. 6833, pp. 351–363. Springer (2011)
  • [37] Zheng, C., Sankoff, D.: On the pathgroups approach to rapid small phylogeny. BMC Bioinformatics 12(Suppl 1),  S4 (2011)

Methods: supplementary material

Fig. 7: Simplified example of a connected component in the global adjacency graph. All internal nodes of the tree are augmented with adjacency graphs. At node , the graph contains adjacency , at node , the graph contains adjacency . Along the edge , one adjacency has been cut, the other has been joined. In the global adjacency graph, we see a connected component that contains two edges incident to . This component does not contain a conflict, just a rearrangement. See Fig. 8 below for a that contains a conflict.
Fig. 8: Example for a given connected component . At each node of the tree, possible assignments are defined by the connected component containing all edges annotated with . Possible assignments for a marker extremity, e. g. , are defined by the incident adjacency edges, hence in this example we can e. g. assign , or to . Here two valid joint labeling for are shown, while the third one assigns different adjacencies for and and is therefore invalid.

A DP sampling algorithm.

In the bottom-up traversal, in addition to the minimal cost induced by labeling a node  with a specific label , we can also store the number of optimal solutions under this label for the subtree rooted at . Let and be the children of , and and the sets of labels that induced the minimum value for at (which means a label out of these sets is assigned in the backtracking phase if is labeled with ). Then

gives the number of optimal solutions for the subproblem rooted at . At the root, we might have the choice between different labels with minimum cost. Let be the set of these labels, then the number of overall possible co-optimal solutions is simply the sum of the number of solutions for all optimal root labels: Subsequently in the top-down traversal, choose a label with probability . If at an internal node more than one label in a child node induced the minimum value, choose one of these labels analogously.

Results: supplementary material

Fig. 10: Underlying phylogeny of the considered genomes for the Yersinia dataset used in the evaluation.
Fig. 11: Yersinia dataset: Reconstructed number of CARs with DeClone weights at .
Fig. 9: Underlying phylogeny of the considered genomes for the mammalian dataset used in the evaluation.
Fig. 10: Underlying phylogeny of the considered genomes for the Yersinia dataset used in the evaluation.
Fig. 11: Yersinia dataset: Reconstructed number of CARs with DeClone weights at .
Fig. 12: Yersinia dataset: Reconstructed number of CARs with DeClone weights at .
Fig. 9: Underlying phylogeny of the considered genomes for the mammalian dataset used in the evaluation.

NP-Hardness for high values of

In this section we show that Weighted SCJ labeling is NP-hard for .

The Maximum Intersection Matching Problem

Maximum Intersection Matching Problem (MIMP)
Input: two graphs and with , and an integer .
Question: does there exist a perfect matching in and a perfect matching in such that ?

We now introduce the problem used for showing the hardness of MIMP. The 3-Balanced-Max-2-SAT problem is, given boolean variables and a set of clauses , each with exactly two literals (e.g. ), such that each variable appears in the clauses times positively and times negatively, to find an assignment to the variables such that a maximum of clauses are satisfied. Note that since each variable appears in clauses and each clause has variables. This problem is NP-hard by a reduction from Max-Cut on -regular graphs [6]. We reduce 3-Balanced-Max-2-SAT to MIMP, and we begin by describing how we transform a given 3-Balanced-Max-2-SAT instance into a MIMP instance. Figure 13 illustrates the main ideas behind the reduction.

Let be the set of variables and be the set of clauses of the given 3-Balanced-Max-2-SAT instance. In our construction the graph corresponds to the variables and to the clauses.

The graph is the disjoint union of cycles each of length , one for each variable . For each , the cycle corresponding to has vertices in cyclic order. Call the edges and positive, and the edges and negative. Note that has exactly two perfect matchings: one containing all the positive edges, and one containing all the negative edges. Since has such cycles, has perfect matchings, which are in bijection with the possible assignments. Also note that no two positive or negative edges share a vertex.

Some additional notations are needed before we can construct . Let be the clauses containing variable positively. Map each of the positive edges of to a distinct clause arbitrarily (e.g. we map to , to and to ). Similarly let be the clauses in which appears negatively, and map each negative edge of to one of these clauses in a distinct manner.

Now, has the same vertex set as , and is the union of cycles of length , one for each clause in . Let and be the two variables contained in a clause . Let be the edge of the cycle mapped to , and be the edge of the cycle mapped to . Then the cycle corresponding to is, in cyclic order, , where addition is to be taken modulo . Thus contains exactly two edges in common with : is from and is from . Importantly, has possible perfect matchings: one contains and the other has .

Fig. 13: Illustration of a clause gadget used for the MIMP reduction. The cycles and are the cycles of corresponding to variables and . The ‘+’ edges are positive, and the ‘-’ edges are negative., The cycle (dashed edges) corresponds to the clause ), where here the edges and are hypothetically mapped to . Other clause cycles are not shown.

Intuitively speaking, if an edge is in both and , say on the cycles of and of , then corresponds to the fact that appears in clause . If is positive, setting satisfies , and if is negative, setting satisfies .

We first prove some structural properties of and which will be of use later on. We denote by the complement of a graph .

Lemma 2

The graphs and satisfy the following conditions:

  1. are even;

  2. is the disjoint union of cycles of length at most ;

  3. is the disjoint union of cycles of length at most ;

  4. and have a perfect matching in common, i.e. there is a set such that forms a perfect matching in both and .

Proof:

Conditions 1) and 2) are immediate from our construction. For 3), we need to show that the cycles of are disjoint. We argue that each vertex of appears in exactly one cycle. First, each vertex gets included in at least one cycle : if is an endpoint of a positive or a negative edge , then is mapped to some clause , and so is on the cycle. Otherwise, is not on a positive or negative edge, and thus we have . Therefore (modulo ). In this case, is on a positive or negative edge, and one can check that is in the same cycle as . To see that no vertex gets included in two cycles, observe that . Thus has cycles of length spanning vertices, which is only possible if no two cycles of intersect. For condition 4), there are many choices, and we exhibit one. Let and . One can check that in both and , is not a neighbor of . Thus the set (addition modulo ) is a perfect matching of and . \qed

Observe that since is a disjoint union of cycles, it has perfect matchings since each cycle has possible perfect matchings. We are now ready to prove the hardness of MIMP.

Proof:

It is easy to see that MIMP is in NP, since checking that two given matchings are perfect and share edges in the two graphs can be done in polynomial time. Also, and can obviously be constructed in polynomial time. To prove NP-hardness, we claim that, using the construction described above, at least clauses of are satisfiable if and only if and admit a perfect matching with at least edges in common.

() Let be an assignment of satisfying clauses of . In , for each cycle, take the positive perfect matching of if is positive in , and the negative perfect matching otherwise. The result is a perfect matching of . To construct a perfect matching of , let be a clause satisfied by the value of a variable in . Then, take in the perfect matching of that contains the edge of mapped to . This edge is shared by both our choices in and . We apply this choice of perfect matching for each clause satisfied by (and pick the matching of arbitrarily for an unsatisfied clause ). Since clauses are satisfied, there must be common edges in and .

() Let and be the perfect matchings of and with common edges , . Observe that each edge of must be either positive or negative in . We claim that the following assignment satisfies at least clauses: set if has a positive edge in , set if has a negative edge in , and set arbitrarily if has no positive or negative edge in . No can be assigned two values since and is in a perfect matching of (and thus cannot contain both positive and negative edges from the same cycle ). Let be the clause cycles of containing an edge of . For each in , there can only be one edge of in (recall that each possible perfect matching of has exactly one edge in common with ). Hence since , there must be at least clause cycles in , and thus at least satisfied clauses. \qed

Hardness of Weighted-SCJ-labeling

Given an MIMP instance and that satisfies the conditions of Lemma 2, we construct a Weighted-SCJ-labeling instance consisting of a phylogeny , consistent extant adjacencies on the leaves of and weighted ancestral adjacencies on its internal nodes. Let . The Weighted-SCJ-labeling instance has markers, and the phylogeny is a tree with leaves and internal nodes and . The node is the parent of and , the node is the parent of and and the node is the parent of and . Let such that forms a perfect matching in both and . The marker extremities are and the marker head-tail pairs are given by (i.e. if , then and are the two extremities of a marker).

Since is a union of disjoint cycles of even length, its edges can be partitioned into two disjoint perfect matchings and . More precisely, let be a perfect matching of . By removing the set of edges from , we obtain another perfect matching . The extant adjacencies for the leaf are given by , and those of by . Similarly, can be split into two disjoint perfect matchings and . The extant adjacencies of leaf are given by and those of leaf by .

The set of adjacencies under consideration is . In (respectively ), all adjacencies corresponding to an edge of (respectively ) have weight . All other adjacencies, including all adjacencies in , have weight . Note that in this construction, any ancestral adjacency with non-zero weight can be found in some leaf of , which corresponds to the initialization in our method.

Also note that no adjacency connects two extremities from the same marker. Indeed, all adjacencies on , extant or ancestral, are taken from either or , whereas marker extremities are defined by , which is taken from the complement of and . Thus there is no marker/adjacency inconsistency.

We show that if , Weighted SCJ labeling is NP-hard. This is mainly because we are “forced” to assign a perfect matching to and , as we show in the next Lemma. We then need to choose these matchings so as to maximize their intersection, since this minimizes the SCJ distance on the and branches.

Lemma 3

Let . Let be any labeling for the instance constructed from and as above. Then there is a labeling such that the edges corresponding to (respectively ) form a perfect matching of (respectively ), and .

Proof:

Call a cycle of (resp. ) covered by if (resp. ) has exactly edges of , i.e. (resp. ) contains a perfect matching of . If every cycle of and every cycle of is covered by , then the Lemma holds. Otherwise, we show that if there is an uncovered cycle , it is advantageous to modify in order to make covered.

Thus let be a cycle of either or that is not covered by , and denote . We have . Let if is in , and if is in instead. Let be the edges of that correspond to an adjacency labeled by . Let be the number of vertices of that do not belong to an edge of (i.e. vertices of that do not belong to any adjacency of , or that belong to an adjacency that does not correspond to an edge of ). Since is not covered we assume that . Now, has edges. As is even, this implies as it must also be even. Obtain from by first removing all adjacencies of including a vertex of , then adding to the perfect matching of that has the most edges common to , observing that we retain at least edges of (and hence remove at most ). Then the number of adjacencies in but not in is at most . On the other hand at most adjacencies are assigned by but not by . As and have incident edges, the increase in the SCJ cost incurred by changing from to is at most . On the other hand, the cost of adjacency weights is decreased by , since in , edges of weight are removed, and edges of weight are added, and so new adjacencies of weight are gained. The total difference in cost from to is at most . One can check that for and , this total difference is no more than . Therefore, we can modify to cover at an equal or better cost. By applying this idea on every cycle of and , we get a labeling such that is a perfect matching of and is a perfect matching of . \qed

We can finally show the hardness of Weighted-SCJ-labeling.

Proof:

To prove NP-hardness of Weighted-SCJ-labeling, we use the construction described above, which can easily be seen to be feasible in polynomial time. We show that and admit a perfect matching with common edges if and only if as constructed above admits a labeling total cost at most if .

() Let and be perfect matchings of and , respectively, that share edge set , with . We label the internal node of with the adjacencies corresponding to , the internal node with the adjacencies corresponding to , and internal node with the adjacencies corresponding to . Now, each of the adjacencies in is found in either or (but not both). Thus . Conversely because half of is in . Thus the and branches, together, induce an SCJ cost of . Similarly, the and branches induce an SCJ cost of . There are adjacencies of not in and all adjacencies of are in , and so the branch has SCJ cost . Likewise, has SCJ cost . The total SCJ cost is .

As for the cost of adjacency weights, has adjacencies of weight , and of them are used by . The same holds for . Thus and , together, induce an ancestral adjacency cost of . The total weighted cost is as desired.

() Suppose that and is a “no” instance, i. e. no pair of perfect matching have edges in common. We show that any labeling has cost strictly greater than . As per Lemma 3, there is an optimal labeling for such that is a perfect matching of and is a perfect matching of . Choose such an optimal that labels with a minimum number of adjacencies. By the same argument as above, the branches and incur, together, an SCJ cost of . As for and , let be an adjacency in , and recall that has weight in . Then must be in one of or , since if is in neither of nor , then it should be removed as is suboptimal. If is in but not in , presence or absence of in has the same cost, and thus by our choice of . Thus . Since and form a “no” instance and and are labeled by a perfect matching, we must have , and therefore . The cost incurred by the and branches is . We get that the total SCJ cost incurred by is strictly greater than . Exactly of the putative ancestral adjacencies for and are taken by , and so the ancestral adjacency cost is . In total, the cost of the labeling is strictly higher than , and is also a “no” instance. \qed

The above analysis, in particular Lemma 3, can certainly be improved in order to yield better bounds for . Nevertheless, the question of whether Weighted SCJ labeling is NP-hard for every is wide open.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
112481
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description