1 Introduction

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE

Solving Maximum Clique Problem for Protein Structure Similarity [1cm] Noël Malod-Dognin — Rumen Andonov — Nicola Yanev N° 6818 Janvier 2009

Solving Maximum Clique Problem for Protein Structure Similarity

Noël Malod-Dogninthanks: INRIA Rennes - Bretagne Atlantique, and University of Rennes 1, France. , Rumen Andonov00footnotemark: 0 , Nicola Yanevthanks: Faculty of Mathematics and Informatics, University of Sofia.thanks: Institute of Mathematics and Informatics, Bulgarian Academy of Sciences.

Thème BIO — Systèmes biologiques

Équipes-Projets Symbiose

Rapport de recherche n° 6818 — Janvier 2009 — ?? pages

Abstract: A basic assumption of molecular biology is that proteins sharing close three-dimensional (3D) structures are likely to share a common function and in most cases derive from a same ancestor. Computing the similarity between two protein structures is therefore a crucial task and has been extensively investigated. Evaluating the similarity of two proteins can be done by finding an optimal one-to-one matching between their components, which is equivalent to identifying a maximum weighted clique in a specific “alignment graph”.

In this paper we present a new integer programming formulation for solving such clique problems. The model has been implemented using the ILOG CPLEX Callable Library. In addition, we designed a dedicated branch and bound algorithm for solving the maximum cardinality clique problem. Both approaches have been integrated in VAST (Vector Alignment Search Tool) - a software for aligning protein 3D structures largely used in NCBI (National Center for Biotechnology Information). The original VAST clique solver uses the well known Bron and Kerbosh algorithm (BK). Our computational results on real life protein alignment instances show that our branch and bound algorithm is up to 116 times faster than BK for the largest proteins.

Key-words: Branch and bound, integer programming, maximum clique problems, protein structure alignment.

Problème de clique maximum

pour la similarité des structures de protéines

Résumé : Une hypothèse fondamentale en biologie moléculaire est que des protéines partageant des structures 3D similaires ont de fortes chances de partager une fonction commune, et dans la plupart des cas dérivent d’un ancêtre commun. Évaluer la similarité entre deux structures de protéines est donc une tâche essentielle qui a été largement étudiée. Comparer deux protéines revient à chercher un mariage stable optimal de leurs composants, ce qui est équivalent à rechercher une clique de poids maximum dans un “graphe d’alignement” spécifique.

Dans ce rapport, nous présentons un nouveau modèle de programmation linéaire en nombres entiers pour résoudre ces problèmes de cliques. Ce modèle a été implémenté en utilisant CPLEX 10. Nous avons également conçu un algorithme de branch and bound dédié pour résoudre le problème de clique de cardinalité maximum. Ces deux approches ont été intégrées dans VAST (Vector Alignement Search Tool) - un logiciel pour l’alignement des structures 3D de protéines largement utilisé au NCBI (National Center for Biotechnology Information). Pour résoudre ses problèmes de cliques, VAST utilise l’algorithme de Bron et Kerbosh (BK). Les résultats expérimentaux sur des instances réelles d’alignements de protéines montrent que notre algorithme de branch and bound est jusqu’à 116 fois plus rapide que BK pour les plus grandes protéines.

Mots-clés : Branch and bound, programmation linéaire en nombres entiers, clique maximum, alignement de structures de protéines.

1 Introduction

A protein is an ordered sequence of amino acids. Under specific physiological conditions, the linear arrangement of amino acids will fold and adopt a complex 3D shape. This 3D shape contains some highly regular sub-structures called secondary structures elements (SSEs), such as helices and strands.

A fruitful assumption of molecular biology is that proteins sharing close three-dimensional (3D) structures are likely to share a common function and in most cases derive from a same ancestor. Computing the similarity between two protein structures is therefore a crucial task and has been extensively investigated [1001, xu-gang, halperin, strickland05]. Evaluating the similarity of two proteins and is usually done by evaluating the best - according to some criterion - one-to-one matching between their components (also called alignment).

Among the various protein structure alignment methods, we are interested in VAST [gibrat96] (Vector Alignment Search Tool) - a software for aligning protein 3D structures largely used in NCBI (National Center for Biotechnology Information) - where the problem of aligning two proteins is converted into finding a maximum clique in a specific graph.

In VAST, two proteins and are represented by their ordered sets of SSEs ( and respectively), and a structural alignment of and is an order-preserving one-to-one matching between the SSEs of and the ones of . Such matchings are represented here by the so called alignment graph . In our approach, the vertex set is depicted by a grid, where each row corresponds to a SSE of , while each column corresponds to a SSE of . A vertex , situated on the intersection of row with column exists (), iff the SSEs and are “compatible” (i.e. are of the same type and have similar lengths). An edge between two vertices and exists (), iff the pair is “compatible” (i.e. and (for order preserving) and if the SSEs couple and from can be well superimposed in 3D-space to the SSEs couple and from )111See [gibrat96] for the exact definition of superimposing SSEs couples in 3D-space.. To each vertex we can associate a weight , and to each edge we can associate a weight . Finding a suitable secondary structure alignment is then equivalent to discovering a clique in the grid graph .

Looking for cliques in this kind of graphs arises in different situations, where matching in bipartite graphs preserving the order of vertices is sought. Such an example is another protein structure alignment method called Contact Map Overlap Maximization (CMO) [strickland05].

Various clique related problems can be formulated in such grid graph. The Maximum Cardinality Clique problem MCC consists in finding in a clique of maximum cardinality, denoted by MCC(). MCC is one of the first problem shown to be NP-Complete [karp72]. The Maximum Vertex Weighted clique problem MVW consists in finding a clique with a maximum sum of vertex weights. If the vertex weights are all equal to 1, then MVW is equivalent to MCC. Since MCC is a special case of MVW, then MVW is also NP-Complete. The Maximum Edge Weighted clique problem MEW consists in finding a clique with a maximum sum of edge weights. Again, if the weights are all equal to 1, then MEW is equivalent to MCC, so MEW is also NP-Complete. All these clique problems have been intensively investigated [bron_kerbosh73, abello99, pardalos99, regin03, busygin06]. Moreover, these three problems are all special cases of the Maximum Weighted Clique problem MWC, which consists in finding the clique having the maximum sum of vertex and edge weights. If the vertex weights are equal to zero, then MWC is equivalent to MEW, if edge weights are all equal to zero, then MWC is equivalent to MVW. If the vertex weights are all equal to 1 and the edge weights are all equal to zero, then MWC is equivalent to MCC.

In this article, we present a new mathematical programming model for solving the most general clique problem MWC. The model has been implemented using ILOG CPLEX 10 Callable Library, and validated on the so called Skolnick set (widely used in protein structure comparison articles [1001, cmos_07, VNS]). In addition, we also design a dedicated branch and bound algorithm (B&B) for MCC. The B&B solver has been integrated in VAST and compared to the original VAST clique solver BK (the Bron and Kerbosh algorithm [bron_kerbosh73]) on large real life protein structures. The obtained results show that our B&B algorithm is up to 116 times faster than BK for the largest proteins.

2 Mathematical programming model for MWC

The use of mathematical programming is not new in the field of maximum cliques [pardalos92, balas]. However, by using the properties of our graphs, we designed a new linear mathematical programming model for solving the maximum weighted clique problem (and thus solving MCC, MVW and MEW) on a grid , where the weights associated to the vertices and edges are all in .

To each vertex (in row and column ), we associate a binary variable such that :

In the same way, to each edge , we associate a binary variable such that :

The goal is to find a clique which maximizes the sum of its vertex weights and the sum of its edge weights. This leads to the objective function :

(1)

The one-to-one matching implies that in each row , at most one vertex can be chosen (only one can be set to 1).

(2)

The same holds for the columns.

(3)

These special order set constraints (maximum one activated vertex per row and per column) lead to compact formulations of the relations between vertices and edges.

Denote by the set of columns with indices greater than and such that there exists at least one edge outgoing from the vertex and heading to a vertex in column . In a similar way we introduce the notations , and . More precisely, is the set of columns with indices smaller than and such that there exists at least one edge heading to the vertex and coming from a vertex in column . is the set of rows with indices greater than and such that there exists at least one edge outgoing from and heading to a vertex in row . And finally, is the set of rows with indices smaller than and such that there exists at least one edge heading to and coming from a vertex in row .

Edges driven activations of vertices can be formulated with the following compact inequalities :

(4)
(5)
(6)
(7)

Vertices driven activations of edges can be formulated with the following compact inequalities :

(8)
(9)

Remarks:

Our model is designed to perform on alignment grids, and thus takes advantages of characteristics that grid alike graphs do not to share with randomly generated graphs. The properties “one-to-one matching” and “order preserving” lead to the creation of special order sets. In the mathematical model this is illustrated by constraints (2) and (3). Moreover, this leads to the implication “ implies and ”. As a consequence, if is in a clique , all vertices such that and , or and , cannot be in the clique .

3 Branch and Bound approach for MCC

We present here a new branch and bound algorithm (B&B) for solving the MCC problem in the previously defined grid , where the vertices in are labeled by their coordinates , for the row number and for the column number.

Definition 1

A successor of a vertex is an element of the set s.t. and .

Definition 2

A predecessor of a vertex is an element of the set s.t. and .

We also use the following notations: denotes the subgraph of induced by the vertices in ;   denotes the subgraph of induced by the vertices in ;   denotes the subgraph of induced by the set of vertices .

3.1 Branch and Bound rules

3.1.1 Branching:

Each node of the B&B is characterized by a couple (, ) where is the clique under construction and is the set of candidate vertices to be added to . All B&B nodes can also access to , the best clique found so far during the exploration of the B&B tree. At the root of the B&B tree, these arguments are initially set to , and .

For a given B&B node , the vertices in will be visited according to their lexicographic increasing order (row first). We call a function returning the vertex in having the smallest lexicographic index. Denote a direct descendant of by . The first descendant is created by a recursive call with arguments and . This corresponds to exploring deeper the tree staying on the same branch and is realized by the step 7 of algorithm 1. Visiting the next direct descendent of is done by a recursive call with arguments and . This corresponds to exploring a neighboring branch of the B&B tree (a wider move) and is realized by the step 8 of algorithm 1.

3.1.2 Fathoming:

A leaf in the B&B tree is interesting only if it contains a clique with a cardinality strictly greater than the cardinality of . Being given a B&B node (, ), the current best clique , and a candidate vertex , denote by the maximum cardinality clique in containing the vertex . If , then we do not miss the solution by discarding from (removing a vertex from fathoms all the leafs of the current branch leading to a clique containing ). In our approach we do not compute but we use some upper bounds (see section 3.2).

Denote by the best clique that is found by branching on the vertex , and by the maximum cardinality clique in containing . It is easily seen that . Any vertex such that leads to non-interesting leafs, and thus, can be removed from . Again, we are going to use some upper bounds of .

3.1.3 Main algorithm:

Denote by a procedure which removes from all vertices such that : (i) , or : (ii) , according to some upper bounds of and .

Algorithm 1 gives the global procedure MCC_BB for solving .

1:  if  then
2:      # Recording the new best clique.
3:  end if
4:  REMOVE()
5:  if  then
6:     
7:     MCC_BB() # exploring deeper the branch
8:     MCC_BB() # visiting another candidate (a wider move)
9:  end if
10:  return
Algorithm 1 MCC_BB() # initially called with , and

3.2 Maximum cardinality estimator

The efficiency of the MCC_BB algorithm greatly depends on the ability of the procedure REMOVE to fathom non-interesting vertices from . In order to do this, we need to tightly estimate and . These bounds are based on what we will call feasible paths in a grid.

Definition 3

A feasible path in a grid is an ordered sequence “, , , ” of vertices , such that , and , .

Denote by the longest (in terms of vertices) feasible path in . Note that computing can be done by dynamic programming in time.

3.2.1 Estimation of :

For any vertex , we denote by the longest feasible path in containing , such that for any vertex in the feasible path, is connected to (i.e. or ). As illustrated in figure 1, , and . It is easily seen that :

(10)

Thus, any vertex such that can be safely removed from . Note that computing all can be done once-for-all in a preprocessing step in O() time.

Figure 1: A graphical view of . In this example, the longest feasible path in contains 8 vertices, the longest feasible path in contains 3 vertices, so the longest feasible path in such that any vertex is connected to contains 12 vertices (including ).

3.2.2 Estimation of :

It is obvious that , so if we can safely fathom all vertices from .

In the same way as we estimated , it is easily seen that :

(11)

Any vertex such that can be safely removed from . In the subgraph , computing all can be done in O() (see figure 2 for a graphical view of ).

Figure 2: A graphical view of . In this example, by fixing the clique , we created the subgraph of the candidate vertices to be added to . In this subgraph, for each candidate vertex , is found by computing the longest feasible paths in and in .

Using the abovementioned estimators, we obtain the fathoming procedure REMOVE described in algorithm 2.

1:  for  do
2:     if  then
3:        .
4:     end if
5:  end for
6:  if   then
7:     
8:  end if
9:  for  do
10:     if  then
11:        .
12:     end if
13:  end for
14:  return
Algorithm 2 REMOVE()

4 Results

All results presented here were obtained on the same desktop PC with one Intel Pentium CPU at 3Ghz and 2GB of RAM. The mathematical programming based solver (MIP), was implemented in C using Ilog Cplex 10.0 Callable Library. The B&B solver, was also implemented in C. This two clique solvers were compared to the original VAST solver which is based on Bron and Kerbosh algorithm (BK) [bron_kerbosh73]. All algorithms were used to solve maximum cardinality clique problems. Note that in fact BK computes and evaluates all maximal cliques in a graph, and hence, can be used to solve any kind of clique problems.

The comparison of the above algorithms was performed on real-life proteins. We used two different benchmarks which significantly differ by the size of the instances (number of SSEs). The first benchmark is the Skolnick set which was recently largely used in protein structure comparison papers [1001, cmos_07, VNS]). This set contains 40 small protein chains (containing one domain), with a SSEs number varying from 5 to 20. The second set benchmark (S2) contains 36 long protein chains (containing from 4 to 6 domains), with a number of SSEs varying from 51 to 87. Note that for the skolnick set, we only considered instances leading to an alignement graph with at least 100 vertices.

The characteristics of the corresponding alignment grids are given in table 1. One peculiarity is their low density, less than 20% for the Skolnick set, and less than 6% for S2 set.

Number of vertices Number of edges Density
Set name Min Average Max Min Average Max Min Average Max
Skolnick 100 158.92 208 886 2368.69 3547 0.16 0.18 0.20
S2 1390 2384.97 5582 45278 144206.44 604793 0.03 0.05 0.06
Table 1: Characteristics of the grid graphs corresponding to the considered benchmarks.

Figure 3 compares the time needed by MIP to the one of BK on the Skolnick set. On the 170 instances containing more than 100 vertices, MIP is always slower than BK (3.35 times slower in average). This is not surprising, since dedicated solvers are expected to be faster than general purpose solvers (CPLEX in this case). However, this observation motivated us to go further in developing a fast special purpose clique solver.

MIP vs BK running time comparison on Skolnick set

Figure 3: For each instance the execution time of MIP is plotted on the x-axis, while the one of BK is depicted on the y-axis. All points are above the line (i.e. BK is always faster than MIP).

Figure 4 compares the time needed by B&B to the one of BK on set S2. Here we observed that B&B is about 15.57 times faster than BK in average, and on the biggest instances (where both proteins contain more than 80 SSEs), it is up to 116.7 times faster. Such big instances are solved by B&B in less than 79 seconds (25 sec. on average) while BK needs up to 2660 seconds (1521 sec. on average). These results are detailed in table 2.

B&B vs BK running time comparison on S2 set.

Figure 4: The execution time of B&B is presented on the x-axis, while the one of BK is on the y-axis (in log scale). The x=y line is also given, and any point above it is an instance for which B&B is faster than BK. In average, B&B is 15.57 times faster than BK. This superiority goes up to 116.7 times for the biggest instances.
Instance B&B BK
time (sec.) time (sec.)
(d1n6fA_,d1n6eA_) 86 83 5220 526586 78 22,78 2659,27
(d1n6eA_,d1n6fA_) 83 86 5220 527354 79 22,01 2380,55
(d1n6fA_,d1n6dA_) 86 85 5305 541073 75 32,03 2296,89
(d1n6dA_,d1n6fA_) 85 86 5305 548087 75 36,81 1978,24
(d1n6eK_,d1n6fD_) 87 87 5582 604793 81 22,22 1903,39
(d1n6dD_,d1n6fF_) 85 86 5304 540911 75 78,91 1419,21
(d1n6fB_,d1n6dD_) 85 85 5234 523072 75 47,82 1390,55
(d1k32F_,d1n6eG_) 85 84 5279 528476 76 40,08 1491,81
(d1n6dD_,d1n6fB_) 85 85 5234 532998 75 39,04 1344,8
(d1n6dC_,d1n6fD_) 85 87 5376 576604 76 38,27 1765,77
Table 2: Details of S2 benchmark. The density is 0.04 for each instance. The first five instances correspond to the biggest running times of BK versus B&B, while the second five instances present the biggest running times of B&B versus BK. Note that aligning with is not the same as aligning with , since the superimposition function (from [gibrat96]) that we use to define edges is not symmetrical.

5 Conclusion

We presented a new mathematical programming model for solving the maximum weighted clique problem arising in the context of protein structure comparison. This model was implemented and validated on a small benchmark. We also presented a new dedicated branch and bound algorithm for the maximum cardinality clique problem. The computationnal results show that on big instances, our branch and bound is significantly faster than the Bron and Kerbosh algorithm (up to 116 times for the largest proteins). In the near futur, we intend to study the behavior of the proposed algorithms on arbitrary graphs, conveniently transformed into grid graphs in a preprocessing step.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
70139
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description