Faster and Simpler Minimal ConflictingSet IdentificationThis work is partly supported by the french MAPPI project (ANR-2010-COSI-004).

Faster and Simpler Minimal Conflicting Set Identification††thanks: This work is partly supported by the french MAPPI project (ANR-2010-COSI-004).

Aida Ouangraoua INRIA, Centre de recherche INRIA Haute-Borne, Bât. A, Park Plaza 40 avenue Halley, 59650 Villeneuve d’Ascq, France. aida.ouangraoua@inria.fr    Mathieu Raffinot CNRS/LIAFA, Université Paris Diderot - Paris 7, France, raffinot@liafa.jussieu.fr
Abstract

Let be a finite set of elements and a family of subsets of . A subset of verifies the Consecutive Ones Property (C1P) if there exists a permutation of such that each in is an interval of . A Minimal Conflicting Set (MCS) is a subset of that does not verify the C1P, but such that any of its proper subsets does. In this paper, we present a new simpler and faster algorithm to decide if a given element belongs to at least one MCS. Our algorithm runs in , largely improving the current fastest algorithm of [Blin et al, CSR 2011]. The new algorithm is based on an alternative approach considering minimal forbidden induced subgraphs of interval graphs instead of Tucker matrices.

1 Introduction

Let be a finite set of elements and a family of subsets of . Those sets can be seen as a 0-1 matrix , such that the set represents the columns of the matrix, and the set the rows of the matrix: each represents the set of columns where row has an entry 1.

A subset of verifies the consecutive ones property (C1P) if there exists a permutation of such that each in is an interval of . Testing the consecutive ones property is the core of many algorithms that have applications in a wide range of domains, from VLSI circuit conception through planar embeddings [8] to computational biology for the reconstruction of ancestral genomes [1, 2, 4, 5, 9]. We focus on this last field in this paper.

On real biological matrices, the C1P is rarely verified, and only some subsets of rows might verify the desired property.However, the combinatorics of such sets is difficult to handle, and a strategy to deal with them has been proposed in [1, 5, 9]. It consists in identifying the rows belonging to minimal conflicting subsets of rows that do not verify the C1P, but such that any of their row subset does.

Definition 1

A set is a Minimal Conflicting Set (MCS) if does not verify the C1P, but such that , the set verifies the C1P.

However, it is not difficult to build examples of matrices such that the number of MCS is polynomial or even exponential in the number of rows.

Figure 1 shows such an example in which each sub set of rows is a MCS. Thus, such a construction with rows gives MCS. Note that, on this example, a single row is included in MCS.

Figure 2-(a) shows another example where the number of MCS is exponential in the number of rows. Let be the number of nodes of external rows, which are and on the figure. The total number of rows is , the number of columns , and the number of MCS is since any induced chordless cycle in the row intersection graph of the matrix (Figure 2-(b)) constitutes a MCS.

From a computational point of view, the first question that arises is the following: is a given row included in at least one MCS ? This question has been raised in [1], recalled in [4, 5] and recently solved in polynomial time in [3]. This currently fastest algorithm is based on the identification of minimal Tucker forbidden submatrices [10, 6].

In this paper we present a new simpler time algorithm for deciding if a given row belongs to at least one MCS and if true exhibit one. Our algorithm is based on an alternative approach considering minimal forbidden induced subgraphs of interval graphs [7] instead of Tucker matrices. Moreover, our central paradigm consists in reducing the recognition of complex forbidden induced subgraphs to the detection of induced cycles in ad-hoc graphs, while in [3] only induced paths are considered. Our approach is faster and simpler, but a limit shared by both approaches resides in avoiding to report the number of MCS to which a given row belongs.

2 MCS and Forbidden induced subgraphs

The row-column intersection graph of a 0-1 matrix is a vertex-colored bipartite graph whose set of vertices is ; the vertices corresponding to rows (resp. columns) are black (resp. white) ; there exists an edge between two rows and if , and there exists an edge between a row and a column if .

It should be noted that a column vertex (white) is only connected to row vertices (black).

The neighborhood of a row is the set of rows intersecting , and . The span of a column is the set of rows containing , .

Theorem 1 ([7], Theorem 4)

A 0-1 matrix verifies the C1P if and only if its row-column intersection graph does not contain a forbidden induced subgraph of the form I, II, III, IV, or V (Figure 3).

Property 1

From Theorem 1, a set is a MCS if the row-column intersection graph contains a subgraph of the form I, II, III, IV, or V; and for any , does not contain a subgraph of the form I, II, III, IV, or V.

Given a MCS , a forbidden induced subgraph contained in is said to be responsible for the MCS . If this forbidden induced subgraph is of the form I (resp. II; III; IV; V), we simply say that is a MCS of the form I (resp. II; III; IV; V).

Definition 2

A row of a MCS that intersects all other rows of is called a kernel of . In a forbidden induced subgraph responsible for , any kernel of constitutes a black vertex that is connected to all other black vertices.

Property 2

Note that an induced subgraph of the form II, III, IV, or V necessarily contains at least one kernel, while an induced subgraph of the form I contains no kernel.

We denote by , the subgraph of induced by the set of rows , thus containing only black vertices.

Graph sizes. has vertices and at most edges, while has vertices and at most edges.

3 A global algorithm

Our algorithm to decide if a row of a 0-1 matrix belongs to at least one MCS, is based on a sequence of algorithms for finding a forbidden subgraph of responsible for a MCS containing . It looks for forbidden subgraph of the form I, III, II, IV, V, in the following order: 1. MCS of type I, 2. MCS of size (types IV or V), 3. MCS of type II, 4. MCS of type III, 5. MCS of type IV and size larger or equal to , and MCS of type V and size larger or equal to . See Figure 4 for an overview. The steps 1 to 4 are based on straightforward brute-force algorithms, while the two last steps relies to a reduction to the detection of induced chordless cycles in ad-hoc graphs.

In the following, we simply write as and as .

3.1 Step 1: Forbidden induced subgraph I

We first test if belongs to a MCS of the form I. If it is true, then belongs to an induced chordless cycle of of length at least containing only black vertices. Such a cycle exists in if and only if is also a chordless cycle in since is the subgraph of induced by the set of rows . Thus it suffices to search for an induced chordless cycle in .

Proposition 1

Algorithm Check_I is correct and runs in worst case time.

Proof. The correctness of Algorithm Check_I comes from the fact that, is contained in a MCS of the form I if and only if belongs to an induced chordless cycle of of length at least whose set of vertices constitutes the MCS (Figure 4.I). A of is an induced chordless path of containing vertices. In this case, Algorithm Check_I returns such a set of vertices since an induced chordless cycle of of length at least containing is a containing whose extremities are linked by a chordless path in the subgraph of that does not contain the neighborhood of the internal vertices of the . This set cannot contain a smaller subset of rows that is a MCS, as no subset of can be a MCS of the form I, or a MCS of any other form because of Property 2.

Algorithm Check_I might be implemented in . The test performed on a give containing (lines 2-5 of the algorithm) can be achieved in as follows: removing the neighborhood of its internal vertices might be done in time, and finding a chordless path between the two extremities might be performed using Dijkstra’s algorithm in time. Enumerating all containing might be done in time using a BFS from stopping at depth . Eventually, the whole algorithm is in time.

Precomputation. In the following steps, we assume that the following precomputations have been achieved:

• For any triplet of rows that are pairwise intersecting, i.e each couple is an edge in , and are precomputed ;

• Two rows and are overlapping if and and . The overlapping relation between any couple of rows is precomputed ;

• For any quadruplet of rows such that , and overlap , is precomputed.

All those precomputations can simply be performed in time using straightforward algorithms, that is, scanning the columns of the input matrix for each triplet or quadruplet of rows.

3.2 Step 2: Forbidden induced subgraph responsible for a MCS of size 3

We test here if belongs to a MCS of size . A MCS of size is necessarily caused by a forbidden induced subgraph of the form IV or V. As a consequence, the following property is immediate.

Property 3

A MCS of size is always composed of rows that are pairwise overlapping.

Proposition 2

Algorithm Check_IV_V_3 is correct and runs in time.

Proof. The correctness of Algorithm Check_IV_V_3 comes from the fact that, is contained in a MCS of size if and only if this MCS is caused by a forbidden induced subgraph of the form IV or V (Property 3). Thus, should belong to a triplet of rows that are pairwise overlapping, and satisfy the conditions given in:

• either, line 2 of the algorithm to produce a forbidden induced subgraph of the form IV (left-end graph in Figure 4.IV_V_3),

• or, line 5 of the algorithm to produce a forbidden induced subgraph of the form V (right-end graph in Figure 4.IV_V_3).

In both cases, Algorithm Check_IV_V_3 returns the set as a MCS if such a set of rows exists. This set cannot contain a smaller subset of rows that is a MCS as is the minimum size of any MCS.

Algorithm Check_IV_V_3 runs in time since, given , there might be couples on which the tests performed (lines 2-8 of the algorithm) might be achieved in , thanks to the precomputations that have been done.

3.3 Step 3: Forbidden induced subgraph II

We test here if belongs to a MCS of the form II, with the assumption that is not contained in any MCS of size . Note that such a MCS is of size .

Proposition 3

Algorithm Check_II_4 is correct and runs in time.

Proof. The correctness of Algorithm Check_II_4 comes from the fact that, if belongs to a MCS of the form II, then should belong to a quadruplet of rows such that one these rows is a kernel, and the three other rows do not intersect each other. Thus, the row is:

• either, a kernel of the MCS, tested in lines 1-5 of the algorithm (left-end graph in Figure 4.II_4),

• or, not a kernel of the MCS tested in lines 6-10 of the algorithm (right-end graph in Figure 4.II_4).

In both cases, Algorithm Check_II_4 returns the set as a MCS if such a set of rows exists. This set cannot contain a smaller subset of rows that is a MCS as this subset would be a subset of rows that cannot satisfy Property 3.

Algorithm Check_IV_V_4 runs in time since all the tests performed on a given triplet in lines 2-4 and 7-9 of algorithm can be achieved in , and given there might be such triplets.

3.4 Step 4: Forbidden induced subgraph III

We test here if belongs to a MCS of the form III, with the assumption that is not contained in a MCS of size . Note that such a MCS is of size .

Proposition 4

Algorithm Check_III_4 is correct and runs in time.

Proof. The correctness of Algorithm Check_III_4 comes from the fact that, belongs to a MCS of the form III if and only if should belong to a quadruplet of rows included in an induced subgraph of the form III such that two of these rows are kernels of the subgraph, and one of these kernels contains a column of the induced subgraph that is not shared with any of the other rows. Let us call this kernel kernel_1, and the other kernel kernel_2. For example in the left-end graph in Figure 4.III_4, kernel_1=, and kernel_2=.

Thus, the row is:

• either, kernel_1, tested in lines 1-5 of the algorithm (left-end graph in Figure 4.III_4),

• or, not a kernel, tested in lines 6-10 of the algorithm (middle graph in Figure 4.III_4).

• or, kernel_2, tested in lines 11-15 of the algorithm (right-end graph in Figure 4.III_4).

In the first, and third cases, the set cannot be a MCS because such a set cannot satisfy Property 3 In all cases, Algorithm Check_III_4 returns the set as a MCS if such a set of rows exists, and is not a MCS (in the second case). Since we made the assumption that is not contained in a MCS of size , there cannot exists a smaller subset of containing that is a MCS.

Algorithm Check_III_4 runs in time using a similar proof as the complexity proof for Check_IV_V_4: all the tests performed by the algorithm (lines 2-4, 7-9, and 12-14 of the algoritms) on a given triplet are achieved in thanks to the precomputations, and given there might be such triplets.

3.5 Step 5: Forbidden induced subgraph IV

We test here if belongs to a MCS of the form IV, with the assumption that is contained, neither in a MCS of size , nor in a MCS of type I. Depending on whether the size of the MCS is or larger than , we describe two algorithms.

3.5.1 MCS of size 4

We first test if belongs to a MCS of the form IV of size . We look for a triplet of rows such that the set is a MCS of the form IV (Figure 4.IV_4). In an induced subgraph of the form IV containing rows , two rows are kernels, and in that case, is either a kernel of the MCS, or not. If is a kernel, then it is either a kernel –called kernel_1– containing a column of the induced subgraph that is not shared with any of the other rows , or not –called kernel_2–. For example, in the left-end graph in Figure 4.IV_4, the two kernel are the two central black vertices of the graph: the top one is a kernel_1, and the bootom one a kernel_2. Algorithm Check_IV_4 looks for each of these configurations:

• is a kernel_1, tested in lines 1-5 of the algorithm;

• is not a kernel,tested in lines 6-10 of the algorithm;

• is a kernel_2, tested in lines 11-15 of the algorithm.

The proof of the correctness of Algorithm Check_IV_4 is similar to the proof for Algorithm Check_III_4.

Proposition 5

Algorithm Check_IV_4 is correct and runs in time.

Proof. The proof for Algorithm Check_IV_4 is similar to the proof for Algorithm Check_III_4.

3.5.2 MCS of size larger than 4

We test here if belongs to a MCS of the form IV of size larger than . A MCS of the form IV of size larger than contains one and only one kernel. Depending on whether is the kernel or not, we distinguish two cases here.

Case 1: If row is the kernel of the MCS

Algorithm Check_IV recovers a MCS of the form IV of size larger than containing as a kernel, with the assumption that is not contained in a MCS of size (Figure 4.IV). The principle of the algorithm relies in first choosing the column , of the forbidden induced subgraph of type IV responsible for , that is contained in , and in no other row of the MCS (see Figure 4.IV). Next, it considers the subgraph of induced by the set of black vertices (rows) that are neighbors of , but do not contain the column . We denote this subgraph by . Then, it looks for a set of rows , constituting a chordless path in , such that is a MCS of the form IV.

Proposition 6

Algorithm is correct and runs in time.

Proof. Note that, if the MCS exists, then all the rows belonging to the MCS, except , belong to a same connected component of . Thus, in each connected component of , the algorithm looks for a chordless path linking two vertices satisfying 1) and are not connected, and 2) overlap , and 3) does not contain any smaller subpath satisfying conditions 1) and 2). These conditions are necessary and sufficient for the set to form the rows of a induced subgraph of the form . The set cannot contain a subset that is a MCS as such a smaller MCS should be:

• either a MCS of size including , which impossible by assumption,

• or a MCS of type II or III necessarily including as kernel,

• or a MCS of type IV and size larger than having as kernel.

The two last cases are also impossible, since would not have satisfy condition 3) in these cases.

Next, there might be columns and up to couples of black vertices to test before finding a valid couple satisfying the conditions in line 4 of the algorithm. Up to this point, the complexity is in . Assume now that such a couple exist. Then finding a chordless path between and might be done by searching for a shortest path between and in the connected component using Dijkstra’s algorithm, which thus requires at worst time. The path is of length at most , and thus identifying and is bounded by testing each pair on this path in , which requires at worst time. Thus, in total, the algorithm is worst case time.

Case 2: If row is not the kernel of the MCS

Algorithm Check_IV recovers a MCS of the form IV of size larger than containing , but not as a kernel, with the assumptions that is not contained in a MCS of size , and does not belong to an induced chordless cycle of (Figure 4.IV). The principle of the algorithm consists in first choosing the kernel of among the black vertices (rows) neighbors of , and the column , of the induced subgraph of type IV responsible for , that is contained in , but in no other row of the MCS. (see Figure 4.IV). Next, the algorithm calls Algorithm Check_IV to look for the MCS with , , , and given as parameters.

Algorithm Check_IV is called in Algorithm Check_IV. It recovers a MCS of the form IV of size larger than containing , given the row , the kernel of the MCS , and the column , of the induced subgraph of type IV responsible for , that is contained in , but in no other row of the MCS (Figure 4.IV).

Proposition 7

Algorithm is correct, and runs in time.

Proof.

The correctness and the complexity of follows directly from the the correctness and the complexity of Algorithm Check_IV that is called in Algorithm .

The correctness of Check_IV comes from the fact that, does not belong to any chordless cycle in the graph computed at line 2 of the algorithm by assumption. Then at line 6 of the algorithm, any chordless cycle in the graph containing vertex necessarily contains at least one edge belonging to the set . The number of edges belonging to the set in such a chordless cycle cannot be greater than as any couple of such edges in the chordless cycle would induce a chord. Indeed, if contains more than one edge belonging to , any two such edges would have to extremities in , one from each of the two edges, that are not connected in the graph . These extremities would thus be linked by an edge in , creating a chord for the cycle in the graph .

Therefore, the set of vertices of the chordless cycle induces a chordless path in such that each vertex of is connected to vertex by definition of the graph , and the extremities and of satisfy 1) and are not connected in , and 2) overlap , and 3) does not contain any smaller subpath satisfying conditions 1) and 2). These conditions are necessary and sufficient for the set to form the rows of an induced subgraph of the form , and this set cannot contain a smaller MCS since such a MCS would be:

• either a MCS of size including ,

• or a MCS of type II or III necessarily including as kernel,

• or a MCS of type IV and size larger than having as kernel.

The 3 cases are impossible, since they would induce a chord from the set in the chordless cycle induced by in the graph .

Algorithm Check_IV calls Algorithm Check_I. Both algorithms have the same time complexity in time. It follows immediately that Algorithm runs in time.

3.6 Step 6: Forbidden induced subgraph V

We test here if belongs to a MCS of the form V, with the assumption that is contained neither in a MCS of size , nor in a MCS of type I. Depending on whether the size of the MCS is , or larger than , we describe three algorithms.

3.6.1 MCS of size 4 or 5

We first test if belongs to a MCS of the form V of size or . For a MCS of size 4, we look for a triplet of rows such that the set is a MCS of the form V. In such a case, we look for an induced subgraph responsible for the MCS, containing as four black vertices pairwise connectedr, and we can pick three different couples of such that each couple shares a column (white vertex) that is not shared with the two other of the MCS (see Figure 4.V_4).

Proposition 8

Algorithm Check_V_4 is correct and runs in time.

Proof. Algorithm Check_V_4 looks for an induced subgraph with black vertices , that are pairwise connected to each other. These black vertices should be such that there exist three different couples of vertices among them, such that two couples are disjoint and the third one (called couple_kernel) overlaps the two first, and the rows of each of these couples share a column that is not shared with the two other rows of the set. In this case, if is not a MCS, then the subgraph induced by and the columns (white vertices) connected to the couples of rows is of the form V, and is responsible for a MCS . Algorithm Check_V_4 looks for two cases, depending on whether belong to couple_kernel (lines 3-5), or not (lines 6-8).

Next, all the tests performed by Algorithm Check_V_4 (lines 2-9 of the algoritm) on a given triplet are achieved in thanks to the precomputations, and given there might be such triplets. Thus, Algorithm Check_V_4 runs in time.

Next, for a MCS of size 5, we look for a quadruplet of rows such that the set is a MCS of the form V (Figure 4.V_5). Algorithm Check_V_5 looks for an induced subgraph of the form V, consisting of rows (black vertices) that are pairwise connected, except for a on missing edge, say in , and three columns (white vertices) satisfying the configuration of Figure 4.V_5.

Proposition 9

Algorithm Check_V_5 is correct and runs in time.

Proof. Algorithm Check_V_5 looks for an induced subgraph with black vertices , that are pairwise connected, except for one missing edge in . The black vertices that belong to the set with , should correspond to a set of rows that is C1P. Moreover, there should exist two particular rows (black vertices) of the set, with three columns (white vertices) that satisfy the conditions on line 4 of the algorithm in order to fit the configuration depicted in Figure 4.V_5.

Next, all the tests performed by Algorithm Check_V_5 (lines 2-8 of the algoritm) on a given quatruplet are achieved in thanks to the precomputations, and given there might be such triplets. Thus, Algorithm Check_V_5 runs in time.

3.6.2 MCS of size larger than 5

A MCS of the form V of size larger than contains exactly two kernels. Depending on whether is a kernel or not, we distinguish two cases.

Case 1: If row is a kernel of the MCS

Algorithm Check_V recovers a MCS of the form V of size larger than containing as a kernel, with the assumption that is not contained in a MCS of size , or (Figure 4.V). The principle of the algorithm is similar to Algorithm Check_IV. It relies in first choosing the second kernel of the MCS, and the column , of the induced subgraphof type V responsible for , that is contained in both and , but in no other row of the MCS (see Figure 4.V). Next, it considers the subgraph of induced by the set of black vertices (rows) that are neighbors of and , but do not contain . We denote this subgraph by . Then, it looks for a set of rows , constituting a chordless path in , such that is a MCS of the form V.

Proposition 10

Algorithm is correct and runs in time.

Proof. The proofs are similar to the proofs for the correctness and the complexity of Algorithm as the two algorithms are based on the same principle. However, here the complexity is multiplied by a factor due to considering all black vertices .

Case 2: If row is not a kernel of the MCS

Algorithm Check_V recovers a MCS S of the form V of size larger than containing r, but not as a kernel, with the assumptions that is not contained in a MCS of size or , and r does not belong to an induced chordless cycle of (Figure 4.V).

The principle of the algorithm is similar to the principle of Algorithm Check_IV. It consists in first choosing the two kernels of S among the black vertices (rows) neighbors of , and the column , of the induced subgraph responsible for S, that is contained in both and , but in no other row of the MCS. Next, the algorithm calls Algorithm Check_V to look for the MCS S with , , , and given as parameters.

Algorithm Check_V is called in Algorithm Check_V. It recovers a MCS of the form V of size larger than 5 containing , given the row , the kernels and of the MCS, and the column , of the induced subgraph responsible for , that is contained in and , but in no other row of the MCS.

Proposition 11

Algorithm is correct and runs in time.

Proof. In order to prove the correctness and the complexity of Algorithm , we need to prove the correctness and give the complexity of Algorithm Check_V that is called in .

The correctness of Check_V comes from the fact that does not belong to any chordless cycle in the graph computed at line 2 of the algorithm by assumption. Let be a chordless cycle in the graph containing vertex , computed at line 9 of the algorithm. Since does not belong to an induced chordless cycle of the by assumption, then necessarily contains at least one edge belonging to the set .

We first give two trivial but useful properties for the remaining of the proof:

• For any two edges of , there always exists two extremities and of these edges, one in each edge, that are not disjoint in the graph , i.e

• , and .

We also prove the following useful property:

• and . Let there exists such that and Then, either in which case , or which implies that . The proof is similar for

We now prove that the cycle necessarily contains at most one edge of the set . Indeed, if contains two edges of