Iterative Rounding for the Closest String Problem
The closest string problem is an NP-hard problem, whose task is to find a string that minimizes maximum Hamming distance to a given set of strings. This can be reduced to an integer program (IP). However, to date, there exists no known polynomial-time algorithm for IP. In 2004, Meneses et al. introduced a branch-and-bound (B&B) method for solving the IP problem. Their algorithm is not always efficient and has the exponential time complexity. In the paper, we attempt to solve efficiently the IP problem by a greedy iterative rounding technique. The proposed algorithm is polynomial time and much faster than the existing B&B IP for the CSP. If the number of strings is limited to 3, the algorithm is provably at most 1 away from the optimum. The empirical results show that in many cases we can find an exact solution. Even though we fail to find an exact solution, the solution found is very close to exact solution.
Keywords:closest string problem; mathematical programming; NP-problem; integer programming; iterative rounding.
The task of finding a string that is close to each string in a given set of strings is one of combinatorial optimization problems, which arise in computational molecular biology and coding theory. This problem is called the closest string problem (CSP). We introduce some notations to defining more precisely the CSP. Let stand for a fixed finite alphabet. Its element is called character, and a sequence of characters over it is called string, denoted by . The length and -th character of are denoted by and , respectively. is defined as the Hamming distance between two equal-length strings and , i.e. the number of characters where they do not agree. This may be formulated as , where is one if , and zero otherwise. Let be the set of all strings of length over . Then, the CSP is defined exactly as follows.
Given a finite set of strings, each of which is in , the objective is to find a center string of length over minimizing the distance such that, for every string in , .
The CSP has received the attention of many researchers in the recent few years. The literature abounds with the CSP. In theory, Frances and Litman [FL1] have proven that it is NP-hard. However, if the distance d is fixed, the exact solution to the problem can be found in polynomial time [GN1, BG1]. For the general case where is variable, one is involved in studying approximation algorithms. There have been some approximation algorithms with good theoretical precision. For example, Gasieniec et al. [GJ1] and Lanctot et al. [LL1] developed independently a 4/3-approxmation algorithm. On the basis of this, Li et al. [LM1] presented a polynomial-time approximation scheme (PTAS). However, the PTAS is not practical.
Meneses et al. [ML1] studied many approximation algorithms, and found that the mentioned-above algorithms are of only theoretical importance, not directly applicable to bioinformatics practice because of high time complexity. For this reason, they suggested reducing the CSP to an integer-programming (IP) problem, and then using branch-and-bound (B&B) algorithm to solve the IP problem. Unfortunately, integer programs are also NP-hard. So far, no polynomial-time algorithm for solving integer programs has been found. Furthermore, the B&B has its own drawbacks. It leads easily to memory explosion due to excessive accumulation of active nodes. In fact, our empirical results show that the B&B IP is not efficient. In despite of instances of moderate size, the B&B IP fails to find an optimal solution sometimes.
We want to find efficiently an exact solution via a technique called iterative rounding. The reason for using this technique is because we noted that Jain [JA1], Cheriyan and Vempala [CV1] used it and succeeded in getting a better approximation algorithm for the generalized steiner network problem. Although our problem is different from their problem, both are NP-hard. Therefore, we believe this technique is applicable to the CSP. The iterative rounding method used here is a greedy one. It may be outlined as follows. First we formulate the CSP as an IP, and then use the LP solution to round some of higher valued variables, finally repeatedly re-solve the LP for the remaining variables until all variables are set. The method has small memory requirement, and can avoid memory explosion of the B&B IP. It is a polynomial time algorithm which can find an exact solution in a very short time for a CSP instance of moderate size in many cases. The computational experiments reveal that our algorithm is not only much faster than the existing one, but also has high quality. If the number of strings is limited to 3, the error of the algorithm is proven to be at most one.
Unlike the existing rounding schemes, our rounding scheme is iterative, not random, while the existing ones such as the rounding scheme of Lanctot et al. [LL1] are random. An important contribution of our algorithm is in setting up a new approach for finding the exact CSP algorithm with the polynomial-time.
2 Iterative rounding for the CSP
The CSP can be reduced to a 0-1 Integer Programming problem as follows.
where , , and is a non-negative integer. Solving this IP problem by applying directly LP (Linear Programming) relaxation and randomized rounding does not work well because randomized rounding procedure leads to large errors, especially when the optimal distance is small [LM1]. Therefore, we decided to find other rounding techniques. Jain [JA1] used iterative rounding to get a 2-approximation algorithm for the generalized steiner network problem. Based on our observation, iterative rounding is suited also for the CSP. Hence, we use it to solve the CSP. The following pseudo-code is a CSP algorithm with iterative rounding.
Formulate the CSP as an IP.
for to do
Fix all variables in to 1, and all variables in to 0
Solve the LP for the sub-CSP on the unfixed variables
Pick a variable with highest value, i.e.,
Convert into a solution (a center string ) to the CSP as follows.
for all .
Clearly, Algorithm A is a polynomial-time algorithm. Furthermore, we have
If the input consists of only two strings, i.e., , then Algorithm A always find an exact solution to the CSP.
Without loss of generality, we assume
(Notice, in the case when the some positions of two
strings , have the same characters, the proof is
simpler than in the above case.)
It is easy to see that the 1st LP optimal solution to the CSP is
where for .
Without loss of generality, assume .
If , there exists such that
Say, is just a solution to this equation. Then, when , setting to 1, we can get the 2nd LP optimal solution
By induction on , we can prove that the -th () LP optimal solution satisfies
where at least values out of are integers.
Hence, if is even, if and only if there are one’s among . This is just an optimal solution to the CSP.
if is odd, assume is not an integer. We have one’s among if setting to 0, and one’s otherwise. Both two cases are an optimal solution to the CSP. Therefore, the theorem is proved. ∎
Define the error of an algorithm as the difference between the exact solution (distance) and the solution obtained. We have
If the input consists of only three binary strings, i.e., = , , then the error of Algorithm A is at most one.
In general, any three strings can be simplified into
Assume that , and the closest string (optimal solution) is of the following form,
where is the number of 0’s in the substring of , Similarly for .
Assume the distances between and the three strings are , and , respectively, we have
The optimal distance is denote by . Then .
the following proposition is true.
If it is false, by (1) we have
It follows that , which is a contradiction. By (2), we have that one of the following three propositions is true.
(a) can constitute a optimal solution, but cannot.
(b) can constitute a optimal solution, but cannot.
(c) can constitute a optimal solution,
Here we consider only the 2nd case to prove the theorem, since other cases is similar. That is, assume
It is false, (1) can be rewritten as
is also a optimal solution,which is in contradiction with (3).
Without loss of generality, suppose
(If , the subsequent proof is similar). This implies
If it is false, let and , we can rewrite (1) as
By (4), we have
However, in fact, by fixing , solving (1) yields . Then by (4) and (5), , which is a contradiction.
By (3) and (6), (1) can be rewritten as
where and .
If it is false, by (7), we can obtain a solution with , which is in contradiction with (3).
Let be 0-variables of the LP, 1-variables. Define
Let , and denote the distances between the three strings and the center string of the LP, respectively. Then,
Let denote the optimal distance of the LP. Then . The goal of the LP is to find a minimum satisfying (9). Next we analyze the error caused by Algorithm A to solve the LP given in (9).
Depending on or not, we proceed to our proof. First, let us consider
If it is false, by (8), we have
Then by (4),we have that is also a optimal solution,which is in contradiction with (3).
By (11) and (7), it is easy to verify . Then
The addition of the 2nd and 3rd equation in (9) yields
It follows that .Thus
This implies that without rounding error, Algorithm A fixes all the letters in the substring into 1. It remains to how to compute and .
Let be the maximum of such that
Similarly, , and are the maximum of , and s.t. (13).Let be the number of letters in the a substring fixed to 0 by the -th rounding operation of Algorithm A. Similarly for , and . If for all , , , and , Algorithm A attains an exact solution. Otherwise,there exists such that only one of ,, and exceeds its maximum.Without loss of generality, assume = (other cases,proof is similar).By (13), there exist and such that
Below we justify
for all ,
Assume the solution of the -th () LP is
By (14), we have
Therefore . Namely, decreases as decreases. Thus, by (17) we have
The claim of (15) is proved. Next we shall show that
for all implies
By (16),(19),(15), we have
Therefore, by (18), we have
(14) can be rewritten as
Clearly, is a feasible solution of the -th () LP, but not necessarily optimal. Therefore . By the constraint of and in (14), it is easy to verify
Thus, by (21) and (22), the claim of (20) is proved.
Below we shall prove
Asuume , ,
Then,the of the -th LP satisfies
Then, by (19),(24) we have
On the other hand, by (14), we can prove
Therefore is a feasible solution of the -th LP. It means .
Thus, by (26), . This implies . Then by (25),the claim of (23) is proven.
In a way similar to the proof of (23), we can prove
By (20), (23), (27) and the previous proof, we conclude that in the case , , the error of Algorithm A is at most one.Now we consider the case
The addition of the 1st and 2nd equation in (7) yields
The addition of the 1st and 2nd equation in (9) yields
Then by (28), . This is equivalent to . That is, without rounding error, Algorithm A fix all the letters of the substring into 0. It remains to how to compute , , and . By symmetry, we can prove in a way similar to the previous that Algorithm A computes , , and within one error of optimal distance. ∎
Based our empirical observation, the error caused by the algorithm was always within one. Hence,for any ,the number of the input strings, we have
For any input, the error of Algorithm A is at most one.
3 Improving the running time and quality of the solution
To speed up the algorithm, we present Algorithm B, which picks multiple (not single) variables of higher values to round up at a time. That is, in the rounding phase, this algorithm searches always for multiple higher valued variables, and then set them to one’s, and the other variables at the same positions to zero’s. Selection is done by parameter , which is set to 0.9 in our experiment. As long as , we set the solution of the -th position to .
Input: and a threshold
Output: a center string close to every string
1. for do .
2. repeat the following process until all .
2.1 Solve the LP-relaxation
2.2 Let be the value of for the LP optimal solution.
if there exists an
then for all and do
else find such that =
To get a higher precision, we improve Algorithm B by Algorithm C. It tries not only the best, but also the second best. If the first solution is not optimal, we select 8 positions to be re-solved the most possibly in the increasing order of variable values. The first position of a solution to be re-solved is one out of the 8 positions. Its value is set to the character corresponded by the second best valued variables. We update the initial setting to find a new solution. Thus, using 8 different settings, we can find 8 different solutions. Finally, we choose the best one out of 9 solutions, including the 1st solution.
1. Let first, second store the largest value of ’s variables in the -th
position, second the character with the second largest value.
2. Invoke Algorithm B with the following modification: the“else” statement of
Algorithm B is revised as
second with =
3. if the objective value of = that of the LP rounded up, return.
4. for do
second,where first is -th smallest
Use Step 2 of Algorithm B to re-solve the CSP
if the current solution is better than then
|Instance||Average distance||Max distance error||Average time (ms)|
|LP||Alg.C||B&B IP||Alg.C||B&B IP||LP||Alg. C||B&B IP|
On Celeron 2.2GHz CPU, we tested two algorithms: our Algorithm C and the B&B IP by Meneses et al. which is referred to as the best IP for the CSP so far.
We carried out many experiments, including McClure data set [ML1] and random instances over the alphabet with 2 characters, 4 characters and 20 characters. In all experiments, our algorithm’s performance was very good. For the limit of space, we presents only the empirical results for random instances over the alphabet with 4 characters. In Table 1, we provided three instances for each entry. Parameters and stands for the number of strings and the string size. “distance” and “time” refer to the minimum distance found, and the running time in milliseconds. LP average distance is computed as . The reason for taking the ceiling here is because the optimal solution for the CSP is no less than the ceiling of the LP value. In the 6th,7th column, Max distance error is defined as ,where is the -th solution, and is the -th LP fractional solution. The maximum time allowed for each instance was set to 1000 seconds. As was seen in Table 1, we found always an exact solution except for a few instances. In terms of running time, our improvement was huge. Our algorithm was from 32 up to 912 times faster than the B&B IP. In other experiments, which is not listed here, it was even 1765 times faster. In some cases, its speed was even close to one for computing an LP. Notice, our algorithm invokes generally many LP solvers. Even so, in the worst case, it was only 20 times slower than computing an LP.
- [BL1] Ben-Dor, A., Lancia, G., Perone, J., Ravi, R.: Banishing bias from consensus sequences. 8th Ann. Symp. Comb. Pat. Match., LNCS 1264, 247–261(1997)
- [BG1] Berman, P., Gumucio, D., Hardison, R., Miller, W., Stojanovic,N.: A linear-time algorithm for the 1-mismatch problem. Workshop Alg. & Data Stru. 125–135(1997)
- [CV1] Cheriyan, J., Vempala, S.: Edge covers of setpairs and the iterative rounding method. IPCO 2001, LNCS 2081, 30–44(2001)
- [FL1] Frances, M., Litman, A.: On covering problems of codes. Theor. Comput. Syst. 30, 113–119(1997)
- [GJ1] Gasieniec, L., Jansson, J., Lingas,A.: Efficient approximation algorithms for the hamming center problem. 10th ACM-SIAM Symp. Discr. Alg. 905–906(1999)
- [GN1] Gramm, J., Niedermeier,R., Rossmanith,P.: Exact solutions for closest string and related problems. ISAAC 2001, LNCS 2223, 441–452(2001)
- [JA1] Jain, K.: A factor 2 approximation algorithm for the generalized Steiner network problem,Combinatorica 21(1), 39–60(2001)
- [LL1] Lanctot,K., Li, M., Ma, B., Wang,S., Zhang, L.: Distinguishing string selection problems, Information and Computation 185, 41–55(2003)
- [LM1] Li, M., Ma, B., Wang, L.: On the closest string and substring problems, J.ACM 49, 157–171(2002)
- [ML1] Meneses, C., Lu, Z., Oliveira, C., Pardalos, P.:Optimal solutions for the closest string problem via integer programming, INFORMS J. Comput. 16(4), 419–429(2004).