Solving the Minimum Common String Partition Problem with the Help of Ants

Solving the Minimum Common String Partition Problem with the Help of Ants

S. M. Ferdous M. Sohel Rahman AEDA Group, Department of CSE, BUET, Dhaka-1000, Bangladesh
Abstract

In this paper, we consider the problem of finding a minimum common partition of two strings. The problem has its application in genome comparison. As it is an NP-hard, discrete combinatorial optimization problem, we employ a metaheuristic technique, namely, MAX-MIN ant system to solve this problem. To achieve better efficiency we first map the problem instance into a special kind of graph. Subsequently, we employ a MAX-MIN ant system to achieve high quality solutions for the problem. Experimental results show the superiority of our algorithm in comparison with the state of art algorithm in the literature. The improvement achieved is also justified by standard statistical test.

keywords:
Ant Colony Optimization, Stringology, Genome sequencing, Combinatorial Optimization, Swarm Intelligence, String partitioning
\newdefinition

rmkRemark \newproofpfProof \newdefinitiondefinitionDefinition \newdefinitionexampleExample

1 Introduction

String comparison is one of the important problems in Computer Science with diverse applications in different areas including Genome Sequencing, text processing and compressions. In this paper, we address the problem of finding a minimum common partition (MCSP) of two strings. MCSP is closely related to genome arrangement which is an important topic in computational biology. Given two DNA sequences, the MCSP asks for the least-sized set of the common building blocks of the sequences.

In the MCSP problem, we are given two related strings . Two strings are related if every letter appears the same number of times in each of them. Clearly, two strings have a common partition if and only if they are related. So, the length of the two strings are also the same (say, ). Our goal is to partition each string into segments called , so that the in the partition of and that of constitute the same multiset of substrings. Cardinality of the partition set, i.e., is to be minimized. A partition of a string is a sequence of strings whose concatenation is equal to , that is . The strings are called the blocks of . Given a partition of a string and a partition of a string , we say that the pair is a common partition of and if is a permutation of . The minimum common string partition problem is to find a common partition of , with the minimum number of blocks. For example, if = {“ababcab”,“abcabab”}, then one of the minimum common partition sets is {“ab”,“abc”,“ab”} and the minimum common partition size is 3. The restricted version of MCSP where each letter occurs at most times in each input string, is denoted by -MCSP.

MCSP has its vast application rooted in Comparative Genomics. Given two DNA strings, MCSP answers the possibilities of re-arrangement of one DNA string to another peter . MCSP is also important in ortholog assignment. Inchen , the authors present a new approach to ortholog assignment that takes into account both sequence similarity and evolutionary events at a genomic level. In that approach, first, the problem is formulated as that of computing the signed reversal distance with duplicates between the two genomes of interest. Then, the problem is decomposed into two optimization problems, namely minimum common partition and maximum cycle decomposition problem. Thus MCSP plays an integral part in computing ortholog assignment of genes.

1.1 Our Contribution

In this paper, we consider metaheuristic approaches to solve the problem. To the best of our knowledge, there exists no attempt to solve the problem with metaheuristic approaches. Only theoretical works are present in literature. Particularly we are interested in nature inspired algorithms. As the problem is discrete combinatorial optimization problem, the natural choice is Ant Colony Optimization (ACO). Before applying ACO, it is necessary to map the problem into a graph. We have developed this mapping. In this paper, we implement a variant of ACO algorithm namely MAX-MIN Ant System (MMAS) to solve the MCSP problem. We conduct experiments on both random and real data to compare our algorithm with the state of the art algorithm in the literature and achieve excellent results. Notably, a preliminary version of the paper appeared at FerdousR13 .

2 Literature Review

MCSP is essentially the breakpoint distance problem chromosome between two permutations which is to count the number of ordered pairs of symbols that are adjacent in the first string but not in the other; this problem is obviously solvable in polynomial time goldstein . The 2-MCSP is proved to be NP-hard and moreover APX-hard in goldstein . The authors in goldstein also presented several approximation algorithms. Chen et al. chen studied the problem, Signed Reversal Distance with Duplicates (SRDD), which is a generalization of MCSP. They gave a 1.5-approximation algorithm for 2-MCSP. In peter , the author analyzed the fixed-parameter tractability of MCSP considering different parametrs. In jiang , the authors investigated -MCSP along with two other variants: , where the alphabet size is at most ; and -balanced MCSP, which requires that the length of the blocks must be within the range , where is the number of blocks in the optimal common partition and is a constant integer. They showed that is NP-hard when . As for -MCSP, they presented an FPT algorithm which runs in time.

Chrobak et al. chrobak analyzed a natural greedy heuristic for MCSP: iteratively, at each step, it extracts a longest common substring from the input strings. They showed that for 2-MCSP, the approximation ratio (for the greedy heuristic) is exactly 3. They also proved that for 4-MCSP the ratio would be and for the general MCSP, between and .

Ant colony optimization (ACO) jour_dorigo ; jour1_dorigo ; book_dorigo was introduced by M. Dorigo and colleagues as a novel nature-inspired metaheuristic for the solution of hard combinatorial optimization (CO) problems. The inspiring source of ACO is the pheromone trail laying and following behavior of real ants which use pheromones as a communication medium. In analogy to the biological example, ACO is based on the indirect communication of a colony of simple agents, called (artificial) ants, mediated by (artificial) pheromone trails. The pheromone trails in ACO serve as a distributed, numerical information which the ants use to probabilistically construct solutions to the problem being solved and which the ants adapt during the algorithm’s execution to reflect their search experience.

Different ACO algorithms have been proposed in the literature. The original algorithm is known as the Ant System(AS) pos_dorigo ; dis_dorigo ; jour3_dorigo . The other variants are, Elitist AS dis_dorigo ; jour3_dorigo , ANT-Q antq , Ant Colony System (ACS) jour1_dorigo , MAX-MIN AS mmas1 ; mmas2 ; jour_Utzle etc.

Recently growing interest has been noticed towards ACO in the scientific community. There are now available several successful implementations of the ACO metaheuristic applied to a number of different discrete combinatorial optimization problems. In jour_dorigo the authors distinguished among two classes of applications of ACO: those to static combinatorial optimization problems, and those to the dynamic ones. When the problem is defined and does not change while the problem is being solved is termed as static combinatorial optimization problems. The authors list some static combinatorial optimization problems those are successfully solved by different variants of ACO. Some of the problems are, travelling salesperson, Quadratic Assignment, job-shop scheduling, vehicle routing, sequential ordering, graph coloring etc. Dynamic problems are defined as a function of some quantities whose values are set by the dynamics of an underlying system. The problem changes therefore at run time and the optimization algorithm must be capable of adapting online to the changing environment. The authors listed connection-oriented network routing and connectionless network routing as the examples of dynamic problems those are successfully solved by ACO.

In 2010 a non-exhaustive list of applications of ACO algorithms grouped by problem types is presented in survey_dorigo_2010 . The authors categorized the problems into different types namely routing, assignment, scheduling, subset machine learning and bioinformatics. In each type they listed the problems those are successfully solved by some variants of ACO.

There are not too many string related problems solved by ACO in the literature. In blum_seq , the authors addressed the reconstruction of DNA sequences from DNA fragments by ACO. Several ACO algorithms have been proposed for the longest common subsequence (LCS) problem in lcs_aco_shyu ; lcs_aco_christ . Recently minimum string cover problem is solved by ACO in mscp_aco . Finally, we note that a preliminary version of this work was presented at confVersion .

3 Preliminaries

In this section, we present some definitions and notations that are used throughout the paper. {definition} Related string: Two strings , each of length , over an alphabet are called related if every letter appears the same number of times in each of them.

{example}

= “abacbd” and = “acbbad”, then they are related. But if = “aeacbd” and = “acbbad”, they are not related

{definition}

Block: A block , , of a string is a data structure having three fields: is an identifier of and the starting and ending positions of the block in are represented by and , respectively. Naturally, the length of a block is . We use to denote the substring of induced by the block . Throughout the report we will use 0 and 1 as the identifiers of (i.e., ) and (i.e., ) respectively. We use to denote an empty block.

{example}

If we have two strings = {“abcdab”,“bcdaba”}, then and both represent the substring “ab” of . In other words, “ab”.

Two blocks can be intersected or unioned. The intersection of two blocks (with same ids) is a block that contains the common portion of the two. {definition}Intersection of blocks: Formally, the intersection operation of = and = is defined as follows:

(1)
{example}

If, and , then . On the other hand, if and , then {definition}Union of blocks: Union of two blocks (with same ids) is either another block or an ordered (based on the starting position) set of blocks. Without the loss of generality we suppose that, for = and =. Then, formally the union operation of and is defined as follows:

(2)
{example}

If, and , then . On the other hand, if and , then

The union rule with an ordered set of blocks, and a block, can be defined as follows. We have to find the position where can be placed in , i.e., we have to find after which can be placed. Then, we have to replace the ordered subset with .

{example}

As an example, suppose we have three blocks, namely, , and . Then . On the other hand, , which is basically identical to .

Two blocks and (in the same string or in two different strings) match if . If the two matched blocks are in two different strings then the matched substring is called a common substring of the two strings denoted by cstring().

{definition}

span: Given a list of blocks with same id, the span of a block, in the list denoted by, is the length of the block (also in the list) that contains and whose length is maximum over all such blocks in the list. Note that a block is assumed to contain itself. More formally, given a list of blocks, , .

{example}

If then where as, . In other words, span of a block is the maximum length of the super string than contains the substring induced by the block.

{definition}

Partition: A partition of a string is a list of blocks all with having the following two properties:

  1. Non Overlapping: The blocks must be be disjoint, i.e., no block should overlap with another block. So the intersection of any two blocks must be empty.

  2. Cover: The blocks must cover the whole string.

In other words, a partition of a string is a sequence of strings whose concatenation is equal to , that is . where ’s are blocks.

3.1 Basics of ACO

In ACO, a combinatorial optimization (CO) problem is solved by iterating the following two steps. At first, solutions are constructed using a parameterized probability distribution over the solution space which is called pheromone model. The second step is to modify the pheromone values using the solutions that were constructed in earlier iterations in a way that is deemed to bias the search towards the high quality solutions.

3.2 Ant Based Solutions Construction

The basic ingredient of an ACO algorithm is a constructive heuristic that constructs solutions probabilistically. Sequences of solution components taken from a finite set of solution components is assembled by a constructive heuristic. Starting with an empty partial solution a solution is constructed. Then at each construction step the current partial solution is extended by adding a feasible solution component from the solution space . The definition of feasible solution component is problem specific. Typically a problem is mapped into a construction Graph whose vertices are the solution components and the set are the connections (i.e., edges). The process of constructing solutions can be regarded as a walk (or a path) on the construction graph.

3.3 Heuristic Information

In most ACO algorithms the transition probabilities, i.e., the probabilities for choosing the next solution component, are defined as follows:

(3)

Here, is a candidate component, is the partial solution. The current partial solution is extended by adding a feasible solution component from the set of feasible neighbors . is a weight function that contains heuristic information and are positive parameters whose values determine the relation between the pheromone information and the heuristic information. The pheromones deployed by the ants are denoted by .

3.4 Pheromone Update

The pheromone update consists of two parts. The first part is pheromone evaporation, which uniformly decreases all the pheromone values . From a practical point of view, pheromone evaporation prevents too rapid convergence of the algorithm toward a sub-optimal region. Thus it helps to avoid the local optimal solutions and favors the exploration of new areas in the search space. Then, one or more solutions from the current or from earlier iterations (the set is denoted by )are used to increase the values of pheromone trail parameters on solution components that are part of these solutions:

(4)

Let is the cost function. Here, is the set of local best or global best solution, is a parameter called the evaporation rate, and is a function such that . The function is commonly called the Fitness Function.

In general, different versions of ACO algorithms differ in the way they update the pheromone values. This also holds for the two currently best-performing ACO variants in practice, namely, the Ant Colony System (ACS) jour1_dorigo and the MAX-MIN Ant System (MMAS) jour_Utzle . Since in our algorithm we hybridize ACS with MMAS, below we give a brief description of MMAS.

3.5 MAX-MIN Ant System (MMAS)

MMAS algorithms are characterized as follows. First, the pheromone values are limited to an interval with . Pheromone trails are initialized to to favor the diversification during the early iterations so that premature convergence is prevented. Explicit limits on the pheromone values ensure that the chance of finding a global optimum never becomes zero. Second, in case the algorithm detects that the search is too much confined to a certain area in the search space, a restart is performed. This is done by initializing all the pheromone values again. Third, the pheromone update is always performed with either the iteration-best solution, the restart-best solution (i.e., the best solution found since the last restart was performed), or the best-so-far solution.

4 Our Approach: MAX-MIN Ant System on the Common Substring Graph

4.1 Formulation of Common Substring Graph

We define a common substring graph, of a string with respect to as follows. Here is the vertex set of the graph and is the edge set. Vertices are the positions of string , i.e., for each , . Two vertices are connected with and edge, i.e, , if the substring induced by the block matches some substring of . More formally, we have:

In other words, each edge in the edge set corresponds to a block satisfying the above condition. For convenience, we will denote the edges as edge blocks and use the list of edge blocks (instead of edges) to define the edgeset . Notably, each edge block on the edge set of of string may match with more than one blocks of . For each edge block a list is maintained containing all the matched blocks of string to that edge block. This list is called the .

For example, suppose = {“abad”,“adab”}. Now consider the corresponding common substring graph, . Then, we have and . The construction steps are shown in figure 1.

Figure 1: Construction of of . (a) Vertex 0 is connected with itself because “a” is common string of and (b) An edge between vertices 0 and 1 as “ab” is a common string of and . (c) vertex 1 is connected with itself (d) vertex 1 and 2 are connected with (e) Vertex 3 is connected with itself.

To find a common partition of two strings () we first construct the common substring graph of . Then from a vertex on the graph we take an edge block . Suppose is the of this block. We take a block from . Then we advance to the next vertex that is ( and choose another corresponding edge block as before. We continue this until we come back to the starting vertex. Let and are two lists, each of length , containing the traversed edge blocks and the corresponding matched blocks. Now we have the following lemma.

Lemma 1

is a common partition of length iff,

(5)

and

(6)
{@proof}

[Proof.] By construction, is a partition of . We need to prove that is a partition of and with the one to one correspondence between and it is obvious that would be the common partition of . Equation 5 asserts the non overlapping property of and Equation 6 assures the cover property. So, will be a partition of if Equation 5 and 6 are satisfied.

On the other hand let along with is a common partition of . According to construction, satisfies the two properties of a partition. Let, is a partition of . We assume does not follow the Equation 5 or 6. So, there might be overlapping between the blocks or the blocks do not cover the string , a contradiction. This completes the proof.

4.2 Heuristics

Heuristics () contain the problem specific information. We propose two different (types of) heuristics for MCSP. Firstly, we propose a static heuristic that does not change during the runs of algorithm. The other heuristic we propose is dynamic in the sense that it changes between the runs.

4.2.1 The Static Heuristic for MCSP

We employ an intuitive idea. It is obvious that the larger is the size of the blocks the smaller is the partition set. To capture this phenomenon, we assign on each edge of the common substring graph a numerical value that is proportional to the length of the substring corresponding to the edge block. Formally, the static heuristic () of an edge block is defined as follows:

(7)

4.2.2 The Dynamic Heuristic for MCSP

We observe that the static heuristic can sometimes lead us to very bad solutions. For example if = {“bceabcd”,“abcdbec”} then according to the static heuristic much higher value will be assigned to edge block than to . But if we take , we must match it to the block and we further miss the opportunity to take later. The resultant partition will be {“bc”,“e”,“a”,“b”,“c”,“d”} but if we would take at the first step, then one of the resultant partitions would be {“b”,“c”,“e”,“abcd”}. To overcome this shortcoming of the static heuristic we define a dynamic heuristic as follows. The dynamic heuristic () of an edge block () is inversely proportional to the difference between the length of the block and the minimum span of its corresponding blocks in its . More formally, is defined as follows:

(8)

where

(9)

In the example, is 1 as follows: . and . On the other hand, is 4. So, according to the dynamic heuristic much higher numeral will be assigned to block rather than to block .

We define the total heuristic () to the linear combination of the static heuristic () and the dynamic heuristic (). Formally, the total heuristic of an edge block B is,

(10)

where , are any real valued constant. The algorithms of static and dynamic heuristics are shown in Algorithm (1 - 2)

E edge blocks of E
for all Block B in E do
     minspan find minimum free span of B by Equation 9
     dynamicHeuristic(E) =
end for
Algorithm 1 addDynamicHeuristic()
E edge blocks of
max maximum length edgeblock of
for all Block B in E do
     staticHeuristic(B) = length(B)/max
end for
Algorithm 2 addStaticHeuristic()
E edge blocks of
addStaticHeuristic()
addDynamicHeuristic()
for all Block B in E do
     heuristic(B) a staticHeuristic(B) + b dynamicHeuristic(B)
end for
Algorithm 3 addHeuristic(,a,b)

4.3 Initialization and Configuration

Given two strings , we first construct the common substring graph . We use the following notations. Local best solution () is the best solution found in each iteration. Global best solution () is the best solution found so far among all iterations. The pheromone of the edge block is bounded between and . Like jour_Utzle , we use the following values for and : , and . Here, is the average number of choices an ant has in the construction phase; is the length of the string; is the probability of finding the best solution when the system converges and is the evaporation rate. Initially, the pheromone values of all edge blocks (substring) are initialized to which is a large value to favor the exploration at the first iteration jour_Utzle . The steps of the initialization is shown in Algorithm 4

initialize
initialize
set Parameters
E edge blocks of
for all  Block B in E do
     pheromone(B)
end for
Algorithm 4 initialize()

4.4 Construction of a Solution

Let, denotes the total number of ants in the colony. Each ant is deployed randomly to a vertex of . A solution for an ant starting at a vertex is constructed by the following steps:

step 1: Let . Choose an available edge block starting from by the discrete probability distribution defined below. An edge block is available if its is not empty and inclusion of it to the and obeys Equation 11. The probability for choosing edge block is:

(11)

step 2: Suppose, is chosen according to Equation 11 above. We choose a match block from the of and delete from the . We also delete every block from every of every edge block that overlaps with . Formally we delete a block B if

We add to the and to the .

step 3: If and the obeys Equation 6, then we have found a common partition of and . The size of the partition is the length of the . Otherwise, we jump to the step 1.

The construction is shown in Algorithm 5.

= empty list of blocks
= empty list of blocks
startpos =
k = startpos
repeat
     addHeuristics(,a,b)
     constructPDF(k,) using Equation 11
     B = choose an edge block from PDF
     M = choose a match block from Intelligent Positioning
     Update
     add B to
     add M to the
     k = B.j + 1
until k startpos
Algorithm 5 constructSolution(i,)

4.5 Intelligent Positioning

For every edge block of in , we have a that contains the matched block of string . In construction (step 1), when an edge block is chosen by the probability distribution, we take a block from the of the chosen edge block. We can choose the matched block randomly. But we observe that random choosing may lead to a very bad partition. For example, if () = {“ababc”,“abcab”} then the . If we choose the first match block then eventually we will get the partition as {“ab”,“ab”,“c”} but a smaller partition exists and that is {“ab”,“abc”}.

To overcome this problem, we have imposed a rule for choosing the matched block. We will select a block from the having the lowest possible span. Formally, for the edge block, , a block will be selected such that is the minimum.

In our example where as . So it is better to select the second block so that we do not miss the opportunity to match a larger block.

4.6 Pheromone Update

When each of the ants in the colony construct a solution (i.e., a common partition), an iteration completes. We set the local best solution as the best partition that is the minimum length partition in an iteration. The global best solution for iterations is defined as the minimum length common partition over all the iteration.

We define the fitness of a solution as the reciprocal of the length of . The pheromone of each interval of each target string is computed according to Equation 4 after each iteration. The pheromone values are bounded within the range and . We update the pheromone values according to or . Initially for the first 50 iterations we update pheromone by only to favor the search exploration. After that we develop a scheduling where the frequency of updating with decreases and increases to facilitate exploitation. The pheromone update algorithm is listed in Algorithm 8

for all  Block B in E do
     pheromone(B) pheromone(B) - pheromone(B)
end for
Algorithm 6 decreasePheromone(Blocklist E))
for all  Block B in E do
     pheromone(B) pheromone(B) +
end for
Algorithm 7 increasePheromone(Blocklist E))
E edge blocks of
decreasePheromone(E)
if  then
     increasePheromone()
else if  then
     if  then
         increasePheromone()
     else
         increasePheromone()
     end if
else if  then
     if  then
         increasePheromone()
     else
         increasePheromone()
     end if
else if  then
     if  then
         increasePheromone()
     else
         increasePheromone()
     end if
else if  then
     if  then
         increasePheromone()
     else
         increasePheromone()
     end if
else
     increasePheromone()
end if
Update and
for all  Block B in E do
     Bound pheromone(B) between and
end for
Algorithm 8 updatePheromoneSchedule(iterationCounter,,,)

4.7 The Pseudocode

The pseudocode of our approach for solving MCSP is given in Algorithm 9.

construct common substring graph of string X and Y
for  do number of Runs
     initialize()
     interationCounter = 0
     repeat
         iterationCounter = iterationCounter + 1;
         Initialize local best
         for  do
              constructSolution(i,)
              update localBest ()
         end for
         update globalBest ()
         updatePheromoneSchedule(iterationCounter,)
     until time reaches or No update found for
end for
Algorithm 9 MMAS(X,Y)

5 Experiments

We have conducted our experiments in a computer with Intel Core 2 Quad CPU 2.33 GHz. The available RAM was 4.00 GB. The operating system was Windows 7. The programming environment was java. jre version is“1.7.0_15”. We have used JCreator as the Integrated Development Environment. The maximum allowed time for test case instance was 120 minutes.

5.1 Datasets

We have conducted our experiments on two types of data: randomly generated DNA sequences and real gene sequences.

5.1.1 Random DNA sequences:

We have generated random DNA sequences each of length at most 600 using seq . The fraction of bases , , and is assumed to be 0.25 each. For each DNA sequence we shuffle it to create a new DNA sequence. The shuffling is done using the online toolbox shuffle . The original random DNA sequence and its shuffled pair constitute a single input () in our experiment. This dataset is divided into 3 classes. The first 10 have lengths within [100-200] bps (base-pairs), the next 10 have lengths within and the rest 10 have lengths within bps.

5.1.2 Real Gene Sequences:

We have collected the real gene sequence data from the NCBI GenBank111http://www.ncbi.nlm.nih.gov. For simulation, we have chosen Bacterial Sequencing (part 14). We have taken the first 15 gene sequences whose lengths are within .

5.2 Parameter Tuning

There are several parameters which have to be carefully set to obtain good results. To obtain a good set of parameters we have done a preliminary experiment. In our experiment we have chosen 3 values for each of the parameters. so there are 243 possible permutations of the 5 parameters. The values of the parameters used in our experiment is listed in Table 1. We have chosen 2 input cases from each of the groups (group1, group2, group3 and realgene). The time limits are set to 10, 20, 30 and 20 minutes for the 4 groups, respectively. The algorithm is run for 4 times and the average result is recorded. Let the partition size of each of the case is denoted by where . With these settings, we find rank of a permutation by the following rule:

After computing the Rank, , we find the permutation of the parameters for which the rank is minimum. The best found parameters are reported in Table 2.

Name Symbol value set
Pheromone information {1,2,3}
Heuristic information {3,5,10}
Evaporation rate {0.02,0.04,.05}
Number of Ants {20,60,100}
Probability of best solution {0.005,0.05,0.5}
Table 1: List of Parameters. The first column represents the name, the second column represents the symbol of the parameter and the third column represent the set of values used for tuning
Parameters Value
Evaporation rate,
100
Maximum Allowed Time min
Table 2: Best found values of the parameters. The first column is the symbol of the parameter and the second column is the best found value

5.3 Results and Analysis

We have compared our approach with the greedy algorithm of chrobak because none of the other algorithms in the literature are for general MCSP: each of the other approximation algorithms put some restrictions on the parameters. As it is expected the greedy algorithm runs very fast. All of the result by greedy algorithm presented in this paper outputs within 2 minutes.

5.3.1 Random DNA sequence:

Table 3, Table 4 and Table 5 present the comparison between our approach and the greedy approach chrobak for the random DNA sequences. For a particular DNA sequence, the experiment was run 15 times and the average result is reported. The first column under any group reports the partition size computed by the greedy approach, the second column is the average partition size found by MMAS, the third and fourth column report the worst and best results among 15 runs, the fifth column represents the difference between the two approaches. A positive (negative) difference indicates that the greedy result is better (worse) than the MMAS result by that amount. The sixth column reports the standard deviation of 15 runs of MMAS, the seventh column is the average time in second by which the reported partition size is achieved. The first 3 columns summarize the t-statistic result for greedy vs. MMAS. The first column reports the t-value of two sample t-test. A positive t-value indicate significant improvement. The second column presents the p-value. A lower p-value represent higher significant improvement and the third column reports whether the null hypothesis is rejected or accepted. Here the null hypothesis is that the two random population (partition sizes from greedy and MMAS) have equal means. We have used to denote improvement, deteriotion and almost equal respectively. According to t-statistic value with 5% significance value we have found better solution in 28 cases for MMAS. For the other 2 case we got worse result in 5% significance level.

Greedy MMAS(Avg.) Worst Best Difference Std.Dev.(MMAS Time in sec(MMAS) tstat p-value significance
46 42.8667 43 42 -3.1333 0.3519 114.6243 34.4886 0.0000 +
56 51.8667 52 51 -4.1333 0.5164 100.823 31 0.0000 +
62 57 58 55 -5 0.6547 207.5253 29.5804 0.0000 +
46 43.3333 43 43 -2.6667 0.488 168.3098 21.166 0.0000 +
44 42.9333 43 43 -1.0667 0.2582 42.7058 16 0.0000 +
48 42.8 43 42 -5.2 0.414 75.2033 48.6415 0.0000 +
65 60.6 60 60 -4.4 0.5071 131.9478 33.6056 0.0000 +
51 46.9333 47 47 -4.0667 0.4577 201.2292 34.4086 0.0000 +
46 45.5333 46 45 -0.4667 0.5164 172.6809 3.5 0.0016 +
63 59.7333 60 59 -3.2667 0.7037 288.4226 17.9781 0.0000 +

Table 3: Comparison between Greedy approach chrobak and MMAS on random DNA sequences (Group 1, [100-200] bps). Here, Difference = MMAS(Avg.) - Greedy. Best and Worst report the maximum and minimum partition size among 15 runs using MMAS.
Greedy MMAS Worst Best Difference Std.Dev.(MMAS) Time in sec(MMAS) tstat p-value significance
119 113.9333 116 111 -5.0667 1.3345 1534.1015 14.7042 0.0000 +
122 118.9333 121 117 -3.0667 0.9612 1683.1146 12.3572 0.0000 +
114 112.5333 114 111 -1.4667 0.8338 1398.5315 6.8126 0.0000 +
116 116.4 117 115 0.4 0.7368 1739.3478 -2.1026 0.0446 -
135 132.2 135 130 -2.8 1.3202 1814.7264 8.2143 0.0000 +
108 106.0667 107 105 -1.9333 0.8837 1480.2378 8.4731 0.0000 +
108 98.4 101 96 -9.6 1.2421 1295.2485 29.9333 0.0000 +
123 118.4 120 117 -4.6 0.7368 1125.2353 24.1802 0.0000 +
124 119.4667 121 117 -4.5333 1.0601 1044.4141 16.5622 0.0000 +
105 101.8667 103 101 -3.1333 0.7432 1360.1529 16.328 0.0000 +

Table 4: Comparison between Greedy approach chrobak and MAX-MIN on random DNA sequences (Group 2, [201-400] bps). Here, Difference = MMAS(Avg.) - Greedy. Best and Worst report the maximum and minimum partition size among 15 runs using MMAS
Greedy MMAS Worst Best Difference Std.Dev.(MMAS) Time in sec(MMAS) tstat p-value significance
182 179.9333 181 177 -2.0667 1.7099 1773.0398 4.6810 0.0001 +
175 176.2000 177 175 1.2000 0.8619 3966.8293 -5.3923 0.0000 -
196 187.8667 189 187 -8.1333 0.7432 1589.2953 42.3833 0.0000 +
192 184.2667 185 184 -7.7333 0.4577 2431.1580 65.4328 0.0000 +
176 171.5333 173 171 -4.4667 0.9155 1224.8943 18.8965 0.0000 +
170 163.4667 165 160 -6.5333 1.8465 1826.1438 13.7036 0.0000 +
173 168.4667 170 167 -4.5333 1.1872 1802.1655 14.7886 0.0000 +
185 176.3333 177 175 -8.6667 0.8165 1838.5603 41.1096 0.0000 +
174 172.8000 175 172 -1.2000 1.5675 4897.4688 2.9649 0.0061 +
171 167.2000 168 167 -3.8000 0.5606 1886.2098 26.2523 0.0000 +

Table 5: Comparison between Greedy approach chrobak and MAX-MIN on random DNA sequences (Group 3, [401-600] bps). Here, Difference = MMAS(Avg.) - Greedy. Best and Worst report the maximum and minimum partition size among 15 runs using MMAS

5.3.2 Effects of Dynamic Heuristics:

In Section 4.2.2, we discussed the dynamic heuristic we employ in our algorithm. We conducted experiments to check and verify the effect of this dynamic heuristic. We conducted experiments with two versions of our algorithm- with and without applying the dynamic heuristic. The effect is presented in Table 6, where for each group the average partition size with dynamic heuristic and without dynamic heuristic is reported. The positive difference depicts the improvement using dynamic heuristic. Out of 30 cases we found positive differences on 27 cases. This clearly shows the significant improvement using dynamic heuristics. It can also be observed that with the increase in length, the positive differences are increased. Figures 2, 3, and 4 show the case by case results. The blue bars represent the partition size using dynamic heuristic and the red bars represent the partition size without the dynamic heuristic.

Group 1 (200 bps) Group 2 (400 bps) Group 3 (600 bps)
MMAS MMAS(w/o heuristic) Difference MMAS MMAS(w/o heuristic) Difference MMAS MMAS(w/o heuristic) Difference
42.7500 43.2500 0.5000 114.2500 115.5000 1.2500 180.0000 183.2500 3.2500
51.5000 50.7500 -0.7500 119.0000 121.0000 2.0000 176.2500 183.2500 7.0000
56.7500 56.5000 -0.2500 112.2500 113.5000 1.2500 188.0000 193.7500 5.7500
43.0000 44.0000 1.0000 116.2500 120.5000 4.2500 184.2500 189.2500 5.0000
43.0000 42.7500 -0.2500 132.2500 134.0000 1.7500 171.7500 173.5000 1.7500
42.2500 42.5000 0.2500 105.5000 107.7500 2.2500 163.2500 168.0000 4.7500
60.0000 60.5000 0.5000 99.0000 99.7500 0.7500 168.5000 170.5000 2.0000
47.0000 47.5000 0.5000 118.0000 121.7500 3.7500 176.2500 178.7500 2.5000
45.7500 46.0000 0.2500 119.5000 120.7500 1.2500 172.7500 179.2500 6.5000
59.2500 61.5000 2.2500 101.7500 103.7500 2.0000 167.2500 172.2500 5.0000

Table 6: Comparison between MMAS with and without dynamic heuristic on random dna sequence
Figure 2: Comparison between MMAS with and without dynamic heuristic (Group 1)
Figure 3: Comparison between MMAS with and without dynamic heuristic (Group 2)
Figure 4: Comparison between MMAS with and without dynamic heuristic (Group 3)

5.3.3 Real Gene Sequence:

Table 7 shows the minimum common partition size found by our approach and the greedy approach for the real gene sequences. Out of 15 cases positive improvement is found in 10 cases in 5% significance level.

Greedy MMAS Worst Best Difference Std.Dev(MMAS) Time in sec(MMAS) tstat p-value significance
95 87.66666667 88 87 -7.333333333 0.487950036 863.8083333 58.2065 0.0000 +
161 156.3333333 162 154 -4.666666667 2.350278606 1748.34 7.6901 0.0000 +
121 117.0666667 118 116 -3.933333333 0.883715102 1823.4922 17.2383 0.0000 +
173 164.8666667 167 163 -8.133333333 1.187233679 1823.012533 26.5325 0.0000 +
172 170.3333 172 169 1.2 1.207121724 2210.153533 3.8501 0.0006 +
153 146 148 143 -7 1.309307341 1953.838267 20.7063 0.0000 +
140 141 142 140 1 0.755928946 2439.0346 -5.1235 0.0000 -
134 133.1333333 136 130 -0.866666667 1.807392228 1406.804533 1.8571 0.0738
149 147.5333333 150 145 -1.466666667 1.505545305 2547.519267 3.7730 0.0008 +
151 150.5333333 152 148 -0.466666667 1.597617273 1619.6364 1.1313 0.2675
126 125 127 123 -1 1 1873.3868 3.8730 0.0006 +
143 139.1333333 141 137 -3.866666667 1.245945806 2473.249067 12.0194 0.0000 +
180 181.5333333 184 179 1.533333333 1.35576371 2931.665333 -4.3802 0.0002 -
152 149.3333333 151 147 -2.666666667 1.290994449 2224.403733 8.0000 0.0000 +
157 161.6 164 160 4.6 1.242118007 1739.612133 1-14.3430 0.0000 -
Table 7: Comparison between Greedy approach chrobak and MMAS on real gene sequence.Here, Difference = MMAS(Avg.) - Greedy. Best and Worst report the maximum and minimum partition size among 15 runs using MMAS

6 Conclusion

Minimum Common String Partition problem has important applications in computational biology. In this paper, we have described a metaheuristic approach to solve the problem. We have used static and dynamic heuristic information in this approach with intelligent positioning. The simulation is conducted on random DNA sequences and real gene sequences. The results are significantly better than the previous results. The t-test result also shows significant improvement. As a future work different other metaheuristic techniques may be applied to present better solutions to the problem.

References

  • (1) Damaschke, P.: Minimum common string partition parameterized. In Crandall, K., Lagergren, J., eds.: Algorithms in Bioinformatics. Volume 5251 of Lecture Notes in Computer Science. Springer Berlin Heidelberg (2008) 87–98
  • (2) Chen, X., Zheng, J., Fu, Z., Nan, P., Zhong, Y., Lonardi, S., Jiang, T.: Assignment of orthologous genes via genome rearrangement. IEEE/ACM Trans. Comput. Biol. Bioinformatics 2(4) (October 2005) 302–315
  • (3) Ferdous, S.M., Rahman, M.S.: Solving the minimum common string partition problem with the help of ants. In Tan, Y., Shi, Y., Mo, H., eds.: ICSI (1). Volume 7928 of Lecture Notes in Computer Science., Springer (2013) 306–313
  • (4) Watterson, G., Ewens, W., Hall, T., Morgan, A.: The chromosome inversion problem. Journal of Theoretical Biology 99(1) (1982) 1 – 7
  • (5) Goldstein, A., Kolman, P., Zheng, J.: Minimum common string partitioning problem: Hardness and approximations. The Electronic Journal of Combinatorics 12(R50) (2005)
  • (6) Jiang, H., Zhu, B., Zhu, D., Zhu, H.: Minimum common string partition revisited. In: Proceedings of the 4th International Conference on Frontiers in Algorithmics. FAW’10, Berlin, Heidelberg, Springer-Verlag (2010) 45–52
  • (7) Chrobak, M., Kolman, P., Sgall, J.: The greedy algorithm for the minimum common string partition problem. ACM Trans. Algorithms 1(2) (October 2005) 350–366
  • (8) Dorigo, M., Di Caro, G., Gambardella, L.M.: Ant algorithms for discrete optimization. Artif. Life 5(2) (April 1999) 137–172
  • (9) Dorigo, M., Gambardella, L.M.: Ant colony system: A cooperative learning approach to the traveling salesman problem. Trans. Evol. Comp 1(1) (April 1997) 53–66
  • (10) Dorigo, M., Stützle, T.: Ant Colony Optimization. Bradford Company, Scituate, MA, USA (2004)
  • (11) Dorigo, M., Colorni, A., Maniezzo, V.: Positive feedback as a search strategy. Technical Report 91-016, Dipartimento di Elettronica, Politecnico di Milano, Milan, Italy (1991)
  • (12) Dorigo, M.: Optimization, Learning and Natural Algorithms. PhD thesis, Politecnico di Milano, Italy (1992)
  • (13) Dorigo, M., Maniezzo, V., Colorni, A.: The ant system: Optimization by a colony of cooperating agents. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART B 26(1) (1996) 29–41
  • (14) Gambardella, L., Dorigo, M.: Ant-q: A reinforcement learning approach to the traveling salesman problem, Morgan Kaufmann (1995) 252–260
  • (15) Stützle, T., Hoos, H.: Improving the ant system: A detailed report on the max-min ant system. Technical report (1996)
  • (16) Stützle, T., Hoos, H.: Max-min ant system and local search for the traveling salesman problem. In: IEEE INTERNATIONAL CONFERENCE ON EVOLUTIONARY COMPUTATION (ICEC’97), IEEE Press (1997) 309–314
  • (17) Stützle, T., Hoos, H.H.: Max-min ant system. Future Gener. Comput. Syst. 16(9) (June 2000) 889–914
  • (18) Dorigo, M., Stützle, T.: Ant colony optimization: Overview and recent advances. In Gendreau, M., Potvin, J.Y., eds.: Handbook of Metaheuristics. Volume 146 of International Series in Operations Research & Management Science. Springer US (2010) 227–263
  • (19) Blum, C., Vallès, M.Y., Blesa, M.J.: An ant colony optimization algorithm for dna sequencing by hybridization. Comput. Oper. Res. 35(11) (November 2008) 3620–3635
  • (20) Shyu, S.J., Tsai, C.Y.: Finding the longest common subsequence for multiple biological sequences by ant colony optimization. Comput. Oper. Res. 36(1) (January 2009) 73–91
  • (21) Blum, C.: Beam-aco for the longest common subsequence problem. In: IEEE Congress on Evolutionary Computation, IEEE (2010) 1–8
  • (22) Ferdous, S., Das, A., M.S., R., M.M., R.: Ant colony optimization approach to solve the minimum string cover problem. In: International Conference on Informatics, Electronics & Vision (ICIEV), IEEE (2012) 741 – 746
  • (23) Ferdous, S., Rahman, M.: Solving the minimum common string partition problem with the help of ants. In Tan, Y., Shi, Y., Mo, H., eds.: Advances in Swarm Intelligence. Volume 7928 of Lecture Notes in Computer Science. Springer Berlin Heidelberg (2013) 306–313
  • (24) Stothard, P.: The sequence manipulation suite: Javascript programs for analyzing and formatting protein and dna sequences. Biotechniques 28(6) (2000) 1102
  • (25) Villesen, P.: Fabox: An online fasta sequence toolbox (2007)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
46210
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description