INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE
Solving Maximum Clique Problem for Protein Structure Similarity
Noël Malod-Dognin††thanks: INRIA Rennes - Bretagne Atlantique, and University of Rennes 1, France. , Rumen Andonov00footnotemark: 0 , Nicola Yanev††thanks: Faculty of Mathematics and Informatics, University of Sofia.††thanks: Institute of Mathematics and Informatics, Bulgarian Academy of Sciences.
Thème BIO — Systèmes biologiques
Rapport de recherche n° 6818 — Janvier 2009 — ?? pages
Abstract: A basic assumption of molecular biology is that proteins sharing close three-dimensional (3D) structures are likely to share a common function and in most cases derive from a same ancestor. Computing the similarity between two protein structures is therefore a crucial task and has been extensively investigated. Evaluating the similarity of two proteins can be done by finding an optimal one-to-one matching between their components, which is equivalent to identifying a maximum weighted clique in a specific “alignment graph”.
In this paper we present a new integer programming formulation for solving such clique problems. The model has been implemented using the ILOG CPLEX Callable Library. In addition, we designed a dedicated branch and bound algorithm for solving the maximum cardinality clique problem. Both approaches have been integrated in VAST (Vector Alignment Search Tool) - a software for aligning protein 3D structures largely used in NCBI (National Center for Biotechnology Information). The original VAST clique solver uses the well known Bron and Kerbosh algorithm (BK). Our computational results on real life protein alignment instances show that our branch and bound algorithm is up to 116 times faster than BK for the largest proteins.
Key-words: Branch and bound, integer programming, maximum clique problems, protein structure alignment.
Problème de clique maximum
pour la similarité des structures de protéines
Résumé : Une hypothèse fondamentale en biologie moléculaire est que des protéines partageant des structures 3D similaires ont de fortes chances de partager une fonction commune, et dans la plupart des cas dérivent d’un ancêtre commun. Évaluer la similarité entre deux structures de protéines est donc une tâche essentielle qui a été largement étudiée. Comparer deux protéines revient à chercher un mariage stable optimal de leurs composants, ce qui est équivalent à rechercher une clique de poids maximum dans un “graphe d’alignement” spécifique.
Dans ce rapport, nous présentons un nouveau modèle de programmation linéaire en nombres entiers pour résoudre ces problèmes de cliques. Ce modèle a été implémenté en utilisant CPLEX 10. Nous avons également conçu un algorithme de branch and bound dédié pour résoudre le problème de clique de cardinalité maximum. Ces deux approches ont été intégrées dans VAST (Vector Alignement Search Tool) - un logiciel pour l’alignement des structures 3D de protÃ©ines largement utilisé au NCBI (National Center for Biotechnology Information). Pour résoudre ses problèmes de cliques, VAST utilise l’algorithme de Bron et Kerbosh (BK). Les résultats expérimentaux sur des instances réelles d’alignements de protéines montrent que notre algorithme de branch and bound est jusqu’à 116 fois plus rapide que BK pour les plus grandes protéines.
Mots-clés : Branch and bound, programmation linéaire en nombres entiers, clique maximum, alignement de structures de protéines.
A protein is an ordered sequence of amino acids. Under specific physiological conditions, the linear arrangement of amino acids will fold and adopt a complex 3D shape. This 3D shape contains some highly regular sub-structures called secondary structures elements (SSEs), such as helices and strands.
A fruitful assumption of molecular biology is that proteins sharing close three-dimensional (3D) structures are likely to share a common function and in most cases derive from a same ancestor. Computing the similarity between two protein structures is therefore a crucial task and has been extensively investigated [1001, xu-gang, halperin, strickland05]. Evaluating the similarity of two proteins and is usually done by evaluating the best - according to some criterion - one-to-one matching between their components (also called alignment).
Among the various protein structure alignment methods, we are interested in VAST [gibrat96] (Vector Alignment Search Tool) - a software for aligning protein 3D structures largely used in NCBI (National Center for Biotechnology Information) - where the problem of aligning two proteins is converted into finding a maximum clique in a specific graph.
In VAST, two proteins and are represented by their ordered sets of SSEs ( and respectively), and a structural alignment of and is an order-preserving one-to-one matching between the SSEs of and the ones of . Such matchings are represented here by the so called alignment graph . In our approach, the vertex set is depicted by a grid, where each row corresponds to a SSE of , while each column corresponds to a SSE of . A vertex , situated on the intersection of row with column exists (), iff the SSEs and are “compatible” (i.e. are of the same type and have similar lengths). An edge between two vertices and exists (), iff the pair is “compatible” (i.e. and (for order preserving) and if the SSEs couple and from can be well superimposed in 3D-space to the SSEs couple and from )111See [gibrat96] for the exact definition of superimposing SSEs couples in 3D-space.. To each vertex we can associate a weight , and to each edge we can associate a weight . Finding a suitable secondary structure alignment is then equivalent to discovering a clique in the grid graph .
Looking for cliques in this kind of graphs arises in different situations, where matching in bipartite graphs preserving the order of vertices is sought. Such an example is another protein structure alignment method called Contact Map Overlap Maximization (CMO) [strickland05].
Various clique related problems can be formulated in such grid graph. The Maximum Cardinality Clique problem MCC consists in finding in a clique of maximum cardinality, denoted by MCC(). MCC is one of the first problem shown to be NP-Complete [karp72]. The Maximum Vertex Weighted clique problem MVW consists in finding a clique with a maximum sum of vertex weights. If the vertex weights are all equal to 1, then MVW is equivalent to MCC. Since MCC is a special case of MVW, then MVW is also NP-Complete. The Maximum Edge Weighted clique problem MEW consists in finding a clique with a maximum sum of edge weights. Again, if the weights are all equal to 1, then MEW is equivalent to MCC, so MEW is also NP-Complete. All these clique problems have been intensively investigated [bron_kerbosh73, abello99, pardalos99, regin03, busygin06]. Moreover, these three problems are all special cases of the Maximum Weighted Clique problem MWC, which consists in finding the clique having the maximum sum of vertex and edge weights. If the vertex weights are equal to zero, then MWC is equivalent to MEW, if edge weights are all equal to zero, then MWC is equivalent to MVW. If the vertex weights are all equal to 1 and the edge weights are all equal to zero, then MWC is equivalent to MCC.
In this article, we present a new mathematical programming model for solving the most general clique problem MWC. The model has been implemented using ILOG CPLEX 10 Callable Library, and validated on the so called Skolnick set (widely used in protein structure comparison articles [1001, cmos_07, VNS]). In addition, we also design a dedicated branch and bound algorithm (B&B) for MCC. The B&B solver has been integrated in VAST and compared to the original VAST clique solver BK (the Bron and Kerbosh algorithm [bron_kerbosh73]) on large real life protein structures. The obtained results show that our B&B algorithm is up to 116 times faster than BK for the largest proteins.
2 Mathematical programming model for MWC
The use of mathematical programming is not new in the field of maximum cliques [pardalos92, balas]. However, by using the properties of our graphs, we designed a new linear mathematical programming model for solving the maximum weighted clique problem (and thus solving MCC, MVW and MEW) on a grid , where the weights associated to the vertices and edges are all in .
To each vertex (in row and column ), we associate a binary variable such that :
In the same way, to each edge , we associate a binary variable such that :
The goal is to find a clique which maximizes the sum of its vertex weights and the sum of its edge weights. This leads to the objective function :
The one-to-one matching implies that in each row , at most one vertex can be chosen (only one can be set to 1).
The same holds for the columns.
These special order set constraints (maximum one activated vertex per row and per column) lead to compact formulations of the relations between vertices and edges.
Denote by the set of columns with indices greater than and such that there exists at least one edge outgoing from the vertex and heading to a vertex in column . In a similar way we introduce the notations , and . More precisely, is the set of columns with indices smaller than and such that there exists at least one edge heading to the vertex and coming from a vertex in column . is the set of rows with indices greater than and such that there exists at least one edge outgoing from and heading to a vertex in row . And finally, is the set of rows with indices smaller than and such that there exists at least one edge heading to and coming from a vertex in row .
Edges driven activations of vertices can be formulated with the following compact inequalities :
Vertices driven activations of edges can be formulated with the following compact inequalities :
Our model is designed to perform on alignment grids, and thus takes advantages of characteristics that grid alike graphs do not to share with randomly generated graphs. The properties “one-to-one matching” and “order preserving” lead to the creation of special order sets. In the mathematical model this is illustrated by constraints (2) and (3). Moreover, this leads to the implication “ implies and ”. As a consequence, if is in a clique , all vertices such that and , or and , cannot be in the clique .
3 Branch and Bound approach for MCC
We present here a new branch and bound algorithm (B&B) for solving the MCC problem in the previously defined grid , where the vertices in are labeled by their coordinates , for the row number and for the column number.
A successor of a vertex is an element of the set s.t. and .
A predecessor of a vertex is an element of the set s.t. and .
We also use the following notations: denotes the subgraph of induced by the vertices in ; denotes the subgraph of induced by the vertices in ; denotes the subgraph of induced by the set of vertices .
3.1 Branch and Bound rules
Each node of the B&B is characterized by a couple (, ) where is the clique under construction and is the set of candidate vertices to be added to . All B&B nodes can also access to , the best clique found so far during the exploration of the B&B tree. At the root of the B&B tree, these arguments are initially set to , and .
For a given B&B node , the vertices in will be visited according to their lexicographic increasing order (row first). We call a function returning the vertex in having the smallest lexicographic index. Denote a direct descendant of by . The first descendant is created by a recursive call with arguments and . This corresponds to exploring deeper the tree staying on the same branch and is realized by the step 7 of algorithm 1. Visiting the next direct descendent of is done by a recursive call with arguments and . This corresponds to exploring a neighboring branch of the B&B tree (a wider move) and is realized by the step 8 of algorithm 1.
A leaf in the B&B tree is interesting only if it contains a clique with a cardinality strictly greater than the cardinality of . Being given a B&B node (, ), the current best clique , and a candidate vertex , denote by the maximum cardinality clique in containing the vertex . If , then we do not miss the solution by discarding from (removing a vertex from fathoms all the leafs of the current branch leading to a clique containing ). In our approach we do not compute but we use some upper bounds (see section 3.2).
Denote by the best clique that is found by branching on the vertex , and by the maximum cardinality clique in containing . It is easily seen that . Any vertex such that leads to non-interesting leafs, and thus, can be removed from . Again, we are going to use some upper bounds of .
3.1.3 Main algorithm:
Denote by a procedure which removes from all vertices such that : (i) , or : (ii) , according to some upper bounds of and .
Algorithm 1 gives the global procedure MCC_BB for solving .
3.2 Maximum cardinality estimator
The efficiency of the MCC_BB algorithm greatly depends on the ability of the procedure REMOVE to fathom non-interesting vertices from . In order to do this, we need to tightly estimate and . These bounds are based on what we will call feasible paths in a grid.
A feasible path in a grid is an ordered sequence “, , , ” of vertices , such that , and , .
Denote by the longest (in terms of vertices) feasible path in . Note that computing can be done by dynamic programming in time.
3.2.1 Estimation of :
For any vertex , we denote by the longest feasible path in containing , such that for any vertex in the feasible path, is connected to (i.e. or ). As illustrated in figure 1, , and . It is easily seen that :
Thus, any vertex such that can be safely removed from . Note that computing all can be done once-for-all in a preprocessing step in O() time.
3.2.2 Estimation of :
It is obvious that , so if we can safely fathom all vertices from .
In the same way as we estimated , it is easily seen that :
Any vertex such that can be safely removed from . In the subgraph , computing all can be done in O() (see figure 2 for a graphical view of ).
Using the abovementioned estimators, we obtain the fathoming procedure REMOVE described in algorithm 2.
All results presented here were obtained on the same desktop PC with one Intel Pentium CPU at 3Ghz and 2GB of RAM. The mathematical programming based solver (MIP), was implemented in C using Ilog Cplex 10.0 Callable Library. The B&B solver, was also implemented in C. This two clique solvers were compared to the original VAST solver which is based on Bron and Kerbosh algorithm (BK) [bron_kerbosh73]. All algorithms were used to solve maximum cardinality clique problems. Note that in fact BK computes and evaluates all maximal cliques in a graph, and hence, can be used to solve any kind of clique problems.
The comparison of the above algorithms was performed on real-life proteins. We used two different benchmarks which significantly differ by the size of the instances (number of SSEs). The first benchmark is the Skolnick set which was recently largely used in protein structure comparison papers [1001, cmos_07, VNS]). This set contains 40 small protein chains (containing one domain), with a SSEs number varying from 5 to 20. The second set benchmark (S2) contains 36 long protein chains (containing from 4 to 6 domains), with a number of SSEs varying from 51 to 87. Note that for the skolnick set, we only considered instances leading to an alignement graph with at least 100 vertices.
The characteristics of the corresponding alignment grids are given in table 1. One peculiarity is their low density, less than 20% for the Skolnick set, and less than 6% for S2 set.
|Number of vertices||Number of edges||Density|
Figure 3 compares the time needed by MIP to the one of BK on the Skolnick set. On the 170 instances containing more than 100 vertices, MIP is always slower than BK (3.35 times slower in average). This is not surprising, since dedicated solvers are expected to be faster than general purpose solvers (CPLEX in this case). However, this observation motivated us to go further in developing a fast special purpose clique solver.
Figure 4 compares the time needed by B&B to the one of BK on set S2. Here we observed that B&B is about 15.57 times faster than BK in average, and on the biggest instances (where both proteins contain more than 80 SSEs), it is up to 116.7 times faster. Such big instances are solved by B&B in less than 79 seconds (25 sec. on average) while BK needs up to 2660 seconds (1521 sec. on average). These results are detailed in table 2.
|time (sec.)||time (sec.)|
We presented a new mathematical programming model for solving the maximum weighted clique problem arising in the context of protein structure comparison. This model was implemented and validated on a small benchmark. We also presented a new dedicated branch and bound algorithm for the maximum cardinality clique problem. The computationnal results show that on big instances, our branch and bound is significantly faster than the Bron and Kerbosh algorithm (up to 116 times for the largest proteins). In the near futur, we intend to study the behavior of the proposed algorithms on arbitrary graphs, conveniently transformed into grid graphs in a preprocessing step.