Iterative Rounding for the Closest String Problem

Iterative Rounding for the Closest String Problem

Jingchao Chen School of Informatics, Donghua University
2999 North Renmin Road, Songjiang District, Shanghai 201620, P. R. China 11email: chen-jc@dhu.edu.cn
Abstract

The closest string problem is an NP-hard problem, whose task is to find a string that minimizes maximum Hamming distance to a given set of strings. This can be reduced to an integer program (IP). However, to date, there exists no known polynomial-time algorithm for IP. In 2004, Meneses et al. introduced a branch-and-bound (B&B) method for solving the IP problem. Their algorithm is not always efficient and has the exponential time complexity. In the paper, we attempt to solve efficiently the IP problem by a greedy iterative rounding technique. The proposed algorithm is polynomial time and much faster than the existing B&B IP for the CSP. If the number of strings is limited to 3, the algorithm is provably at most 1 away from the optimum. The empirical results show that in many cases we can find an exact solution. Even though we fail to find an exact solution, the solution found is very close to exact solution.

Keywords:
closest string problem; mathematical programming; NP-problem; integer programming; iterative rounding.

1 Introduction

The task of finding a string that is close to each string in a given set of strings is one of combinatorial optimization problems, which arise in computational molecular biology and coding theory. This problem is called the closest string problem (CSP). We introduce some notations to defining more precisely the CSP. Let stand for a fixed finite alphabet. Its element is called character, and a sequence of characters over it is called string, denoted by . The length and -th character of are denoted by and , respectively. is defined as the Hamming distance between two equal-length strings and , i.e. the number of characters where they do not agree. This may be formulated as , where is one if , and zero otherwise. Let be the set of all strings of length over . Then, the CSP is defined exactly as follows.

Given a finite set of strings, each of which is in , the objective is to find a center string of length over minimizing the distance such that, for every string in , .

The CSP has received the attention of many researchers in the recent few years. The literature abounds with the CSP. In theory, Frances and Litman [FL1] have proven that it is NP-hard. However, if the distance d is fixed, the exact solution to the problem can be found in polynomial time [GN1, BG1]. For the general case where is variable, one is involved in studying approximation algorithms. There have been some approximation algorithms with good theoretical precision. For example, Gasieniec et al. [GJ1] and Lanctot et al. [LL1] developed independently a 4/3-approxmation algorithm. On the basis of this, Li et al. [LM1] presented a polynomial-time approximation scheme (PTAS). However, the PTAS is not practical.

Meneses et al. [ML1] studied many approximation algorithms, and found that the mentioned-above algorithms are of only theoretical importance, not directly applicable to bioinformatics practice because of high time complexity. For this reason, they suggested reducing the CSP to an integer-programming (IP) problem, and then using branch-and-bound (B&B) algorithm to solve the IP problem. Unfortunately, integer programs are also NP-hard. So far, no polynomial-time algorithm for solving integer programs has been found. Furthermore, the B&B has its own drawbacks. It leads easily to memory explosion due to excessive accumulation of active nodes. In fact, our empirical results show that the B&B IP is not efficient. In despite of instances of moderate size, the B&B IP fails to find an optimal solution sometimes.

We want to find efficiently an exact solution via a technique called iterative rounding. The reason for using this technique is because we noted that Jain [JA1], Cheriyan and Vempala [CV1] used it and succeeded in getting a better approximation algorithm for the generalized steiner network problem. Although our problem is different from their problem, both are NP-hard. Therefore, we believe this technique is applicable to the CSP. The iterative rounding method used here is a greedy one. It may be outlined as follows. First we formulate the CSP as an IP, and then use the LP solution to round some of higher valued variables, finally repeatedly re-solve the LP for the remaining variables until all variables are set. The method has small memory requirement, and can avoid memory explosion of the B&B IP. It is a polynomial time algorithm which can find an exact solution in a very short time for a CSP instance of moderate size in many cases. The computational experiments reveal that our algorithm is not only much faster than the existing one, but also has high quality. If the number of strings is limited to 3, the error of the algorithm is proven to be at most one.

Unlike the existing rounding schemes, our rounding scheme is iterative, not random, while the existing ones such as the rounding scheme of Lanctot et al. [LL1] are random. An important contribution of our algorithm is in setting up a new approach for finding the exact CSP algorithm with the polynomial-time.

2 Iterative rounding for the CSP

The CSP can be reduced to a 0-1 Integer Programming problem as follows.

s.t.       

where , , and is a non-negative integer. Solving this IP problem by applying directly LP (Linear Programming) relaxation and randomized rounding does not work well because randomized rounding procedure leads to large errors, especially when the optimal distance is small [LM1]. Therefore, we decided to find other rounding techniques. Jain [JA1] used iterative rounding to get a 2-approximation algorithm for the generalized steiner network problem. Based on our observation, iterative rounding is suited also for the CSP. Hence, we use it to solve the CSP. The following pseudo-code is a CSP algorithm with iterative rounding.

Algorithm A

Formulate the CSP as an IP.

empty, empty

for to do

Fix all variables in to 1, and all variables in to 0

Solve the LP for the sub-CSP on the unfixed variables

Pick a variable with highest value, i.e.,

end for

Convert into a solution (a center string ) to the CSP as follows.

for all .

Clearly, Algorithm A is a polynomial-time algorithm. Furthermore, we have

Theorem 2.1

If the input consists of only two strings, i.e., , then Algorithm A always find an exact solution to the CSP.

Proof

Without loss of generality, we assume

(Notice, in the case when the some positions of two strings , have the same characters, the proof is simpler than in the above case.)
It is easy to see that the 1st LP optimal solution to the CSP is


where for .
Without loss of generality, assume .
If , there exists such that


Say, is just a solution to this equation. Then, when , setting to 1, we can get the 2nd LP optimal solution

,
By induction on , we can prove that the -th () LP optimal solution satisfies

,
where at least values out of are integers.
Hence, if is even, if and only if there are one’s among . This is just an optimal solution to the CSP.
if is odd, assume is not an integer. We have one’s among if setting to 0, and one’s otherwise. Both two cases are an optimal solution to the CSP. Therefore, the theorem is proved. ∎

Define the error of an algorithm as the difference between the exact solution (distance) and the solution obtained. We have

Theorem 2.2

If the input consists of only three binary strings, i.e., = , , then the error of Algorithm A is at most one.

Proof

In general, any three strings can be simplified into


Assume that , and the closest string (optimal solution) is of the following form,


where is the number of 0’s in the substring of , Similarly for .
Assume the distances between and the three strings are , and , respectively, we have


The optimal distance is denote by . Then .
the following proposition is true.


If it is false, by (1) we have


It follows that , which is a contradiction. By (2), we have that one of the following three propositions is true.

(a) can constitute a optimal solution, but cannot.

(b) can constitute a optimal solution, but cannot.

(c) can constitute a optimal solution, but cannot.
Here we consider only the 2nd case to prove the theorem, since other cases is similar. That is, assume


This implies


It is false, (1) can be rewritten as


is also a optimal solution,which is in contradiction with (3).
Without loss of generality, suppose


(If , the subsequent proof is similar). This implies


If it is false, let and , we can rewrite (1) as


By (4), we have


However, in fact, by fixing , solving (1) yields . Then by (4) and (5), , which is a contradiction.
By (3) and (6), (1) can be rewritten as


where and .
This implies


If it is false, by (7), we can obtain a solution with , which is in contradiction with (3).
Let be 0-variables of the LP, 1-variables. Define


Let , and denote the distances between the three strings and the center string of the LP, respectively. Then,


Let denote the optimal distance of the LP. Then . The goal of the LP is to find a minimum satisfying (9). Next we analyze the error caused by Algorithm A to solve the LP given in (9).
Depending on or not, we proceed to our proof. First, let us consider


This implies


If it is false, by (8), we have

or
Then by (4),we have that is also a optimal solution,which is in contradiction with (3).
By (11) and (7), it is easy to verify . Then


The addition of the 2nd and 3rd equation in (9) yields


It follows that .Thus


This implies that without rounding error, Algorithm A fixes all the letters in the substring into 1. It remains to how to compute and .
Let be the maximum of such that


Similarly, , and are the maximum of , and s.t. (13).Let be the number of letters in the a substring fixed to 0 by the -th rounding operation of Algorithm A. Similarly for , and . If for all , , , and , Algorithm A attains an exact solution. Otherwise,there exists such that only one of ,, and exceeds its maximum.Without loss of generality, assume = (other cases,proof is similar).By (13), there exist and such that

or or


Below we justify

for all ,
Assume the solution of the -th () LP is


Clearly ,
By (14), we have


Therefore . Namely, decreases as decreases. Thus, by (17) we have


The claim of (15) is proved. Next we shall show that
for all implies


By (16),(19),(15), we have


Therefore, by (18), we have


(14) can be rewritten as


Clearly, is a feasible solution of the -th () LP, but not necessarily optimal. Therefore . By the constraint of and in (14), it is easy to verify

.
Thus, by (21) and (22), the claim of (20) is proved.
Below we shall prove

s.t. implies ,
Asuume , ,
Then,the of the -th LP satisfies
Then, by (19),(24) we have


Thus
On the other hand, by (14), we can prove


Therefore is a feasible solution of the -th LP. It means .
Thus, by (26), . This implies . Then by (25),the claim of (23) is proven.
In a way similar to the proof of (23), we can prove

s.t. implies ,
By (20), (23), (27) and the previous proof, we conclude that in the case , , the error of Algorithm A is at most one.Now we consider the case


The addition of the 1st and 2nd equation in (7) yields


The addition of the 1st and 2nd equation in (9) yields


Then by (28), . This is equivalent to . That is, without rounding error, Algorithm A fix all the letters of the substring into 0. It remains to how to compute , , and . By symmetry, we can prove in a way similar to the previous that Algorithm A computes , , and within one error of optimal distance. ∎

Based our empirical observation, the error caused by the algorithm was always within one. Hence,for any ,the number of the input strings, we have

Conjecture 1

For any input, the error of Algorithm A is at most one.

3 Improving the running time and quality of the solution

To speed up the algorithm, we present Algorithm B, which picks multiple (not single) variables of higher values to round up at a time. That is, in the rounding phase, this algorithm searches always for multiple higher valued variables, and then set them to one’s, and the other variables at the same positions to zero’s. Selection is done by parameter , which is set to 0.9 in our experiment. As long as , we set the solution of the -th position to .

Algorithm B
Input
: and a threshold
Output: a center string close to every string

1. for do .

2. repeat the following process until all .

2.1 Solve the LP-relaxation

2.2 Let be the value of for the LP optimal solution.

if there exists an

then for all and do

else find such that =

To get a higher precision, we improve Algorithm B by Algorithm C. It tries not only the best, but also the second best. If the first solution is not optimal, we select 8 positions to be re-solved the most possibly in the increasing order of variable values. The first position of a solution to be re-solved is one out of the 8 positions. Its value is set to the character corresponded by the second best valued variables. We update the initial setting to find a new solution. Thus, using 8 different settings, we can find 8 different solutions. Finally, we choose the best one out of 9 solutions, including the 1st solution.

Algorithm C

1. Let first, second store the largest value of ’s variables in the -th

position, second the character with the second largest value.

2. Invoke Algorithm B with the following modification: the“else” statement of

Algorithm B is revised as

find =

first =

second with =

3. if the objective value of = that of the LP rounded up, return.

else

4. for do

for do

second,where first is -th smallest

Use Step 2 of Algorithm B to re-solve the CSP

if the current solution is better than then

5.

Instance Average distance Max distance error Average time (ms)
LP Alg.C B&B IP Alg.C B&B IP LP Alg. C B&B IP
10 300 175.00 175.00 175.00 0.80 0.80 52 182 8203
10 400 231.67 231.67 231.67 0.60 0.60 78 266 15271
10 500 293.00 293.00 293.00 0.80 0.80 114 349 25261
10 600 347.00 347.00 347.00 0.80 0.80 151 843 39344
10 700 409.00 409.00 409.00 0.60 0.60 192 886 55786
10 800 462.67 462.67 462.67 0.70 0.70 234 609 78167
15 300 185.33 185.67 185.67 1.02 1.02 104 375 342166
15 400 246.67 247.33 246.67 1.23 0.80 130 1094 263583
15 500 306.67 307.00 306.67 1.07 0.40 172 838 37786
15 600 366.67 367.00 366.67 1.27 0.46 229 1813 59198
15 700 428.67 428.67 428.67 0.97 0.97 281 495 81906
15 800 491.00 491.00 491.00 0.80 0.80 308 552 107703
20 300 190.67 191.00 191.00 1.12 1.12 130 880 344474
20 400 252.33 252.67 252.67 1.03 1.03 182 937 353969
20 500 315.33 315.33 315.33 0.59 0.59 260 1135 53875
20 600 379.67 380.00 380.00 1.22 1.22 312 1823 385182
20 700 443.33 443.33 443.33 0.73 0.73 401 917 121641
20 800 505.00 505.00 505.00 0.88 0.88 474 547 171245
25 300 195.00 196.00 196.00 1.34 1.34 151 1911 1000021
25 400 259.00 260.00 259.67 1.49 1.33 239 2729 694192
25 500 323.00 323.67 323.67 1.27 1.27 334 2589 689667
25 600 387.67 388.00 387.67 1.40 0.76 411 1817 113396
25 700 451.00 451.33 451.33 1.09 1.09 516 2594 435693
25 800 515.67 516.67 516.67 1.11 1.11 594 4776 1000016
30 300 197.33 197.67 197.67 1.26 1.26 172 1114 349266
30 400 263.00 263.67 263.33 1.71 1.02 276 2646 370468
30 500 328.33 329.00 328.67 1.63 1.04 401 2797 398458
30 600 392.67 393.00 393.33 1.39 1.54 516 4089 708740
30 700 459.33 460.00 459.67 1.57 1.44 609 4625 459099
30 800 523.00 523.33 523.67 1.50 1.52 740 5953 755380
Table 1: Empirical Results for the Alphabet with 4 Characters

4 Simulations

On Celeron 2.2GHz CPU, we tested two algorithms: our Algorithm C and the B&B IP by Meneses et al. which is referred to as the best IP for the CSP so far.

We carried out many experiments, including McClure data set [ML1] and random instances over the alphabet with 2 characters, 4 characters and 20 characters. In all experiments, our algorithm’s performance was very good. For the limit of space, we presents only the empirical results for random instances over the alphabet with 4 characters. In Table 1, we provided three instances for each entry. Parameters and stands for the number of strings and the string size. “distance” and “time” refer to the minimum distance found, and the running time in milliseconds. LP average distance is computed as . The reason for taking the ceiling here is because the optimal solution for the CSP is no less than the ceiling of the LP value. In the 6th,7th column, Max distance error is defined as ,where is the -th solution, and is the -th LP fractional solution. The maximum time allowed for each instance was set to 1000 seconds. As was seen in Table 1, we found always an exact solution except for a few instances. In terms of running time, our improvement was huge. Our algorithm was from 32 up to 912 times faster than the B&B IP. In other experiments, which is not listed here, it was even 1765 times faster. In some cases, its speed was even close to one for computing an LP. Notice, our algorithm invokes generally many LP solvers. Even so, in the worst case, it was only 20 times slower than computing an LP.

References

  • [BL1] Ben-Dor, A., Lancia, G., Perone, J., Ravi, R.: Banishing bias from consensus sequences. 8th Ann. Symp. Comb. Pat. Match., LNCS 1264, 247–261(1997)
  • [BG1] Berman, P., Gumucio, D., Hardison, R., Miller, W., Stojanovic,N.: A linear-time algorithm for the 1-mismatch problem. Workshop Alg. & Data Stru. 125–135(1997)
  • [CV1] Cheriyan, J., Vempala, S.: Edge covers of setpairs and the iterative rounding method. IPCO 2001, LNCS 2081, 30–44(2001)
  • [FL1] Frances, M., Litman, A.: On covering problems of codes. Theor. Comput. Syst. 30, 113–119(1997)
  • [GJ1] Gasieniec, L., Jansson, J., Lingas,A.: Efficient approximation algorithms for the hamming center problem. 10th ACM-SIAM Symp. Discr. Alg. 905–906(1999)
  • [GN1] Gramm, J., Niedermeier,R., Rossmanith,P.: Exact solutions for closest string and related problems. ISAAC 2001, LNCS 2223, 441–452(2001)
  • [JA1] Jain, K.: A factor 2 approximation algorithm for the generalized Steiner network problem,Combinatorica 21(1), 39–60(2001)
  • [LL1] Lanctot,K., Li, M., Ma, B., Wang,S., Zhang, L.: Distinguishing string selection problems, Information and Computation 185, 41–55(2003)
  • [LM1] Li, M., Ma, B., Wang, L.: On the closest string and substring problems, J.ACM 49, 157–171(2002)
  • [ML1] Meneses, C., Lu, Z., Oliveira, C., Pardalos, P.:Optimal solutions for the closest string problem via integer programming, INFORMS J. Comput. 16(4), 419–429(2004).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
326997
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description