Preemptive Online Partitioning of Sequences 111C. Konrad is supported by the Centre for Discrete Mathematics and its Applications (DIMAP) at Warwick University and by EPSRC award EP/N011163/1.
Online algorithms process their inputs piece by piece, taking irrevocable decisions for each data item. This model is too restrictive for most partitioning problems, since data that is yet to arrive may render it impossible to extend partial partitionings to the entire data set reasonably well.
In this work, we show that preemption might be a potential remedy. We consider the problem of partitioning online sequences, where separators need to be inserted into a sequence of integers that arrives online so as to create contiguous partitions of similar weight. While without preemption no algorithm with non-trivial competitive ratio is possible, if preemption is allowed, i.e., inserted partition separators may be removed but not reinserted again, then we show that constant competitive algorithms can be obtained. Our contributions include:
We first give a simple deterministic -competitive preemptive algorithm for arbitrary and arbitrary sequences. Our main contribution is the design of a highly non-trivial partitioning scheme, which, under some natural conditions and being a power of two, allows us to improve the competitiveness to . We also show that the competitiveness of deterministic (randomized) algorithms is at least (resp. ).
For , the problem corresponds to the interesting special case of preemptively guessing the center of a weighted request sequence. While deterministic algorithms fail here, we provide a randomized -competitive algorithm for all-ones sequences and prove that this is optimal. For weighted sequences, we give a -competitive algorithm and a lower bound of .
Online algorithms receive their inputs sequentially piece by piece. For each incoming piece of data (=request), the algorithm takes an immediate and irrevocable decision on how to process it. Taking good decisions can be challenging or even impossible, since decisions are based only on the requests seen and choices taken so far, and, in particular, they cannot be based on future requests. For many problems, taking a few bad decisions is forgivable and good competitive algorithms can nevertheless be designed (e.g. maximum matching , bin packing , -server ), while for other problems, even a single bad decision may make it impossible to obtain non-trivial solutions (e.g. the maximum independent set problem ). Generally, a necessary condition for a problem to admit good online algorithms is that solutions can be incrementally built by extending partial solutions.
In this paper, we are interested in whether data partitioning problems can be solved online. In an online data partitioning problem, the input data arrives online and is to be partitioned into parts by computing a partitioning function such that an application-specific cost function is optimized. Unfortunately, most partitioning problems are inherently non-incremental and thus poorly suited to the online model, which may explain why the literature on online data partitioning is exceptionally scarce (see related works section). This raises the following research questions:
Is online data partitioning with provable guarantees really hopeless to achieve?
If so, how can we minimally augment the power of online algorithms to render data partitioning possible?
In this work, we address these questions with regards to the problem of partitioning integer sequences. It is one of the simplest data partitioning problems and thus a good candidate problem to answer the questions raised. We prove that even for this rather simple problem, non-trivial quality guarantees are indeed impossible to achieve in the online model. However, if we augment the online model with preemption, i.e., the ability to remove a previously inserted element from the solution, or, in the context of data partitioning, the ability to merge a subset of current partitions by removing previously inserted partition separators, then competitive ratios of at most can be obtained.
Partitioning Online Integer Sequences. Let be an integer sequence of length , and an integer. In the problem of partitioning integer sequences (abbreviated by Part), the goal is to partition into contiguous blocks (by determining the position of partition separators) such that the maximum weight of a block, denoted the bottleneck value of the partitioning, is minimized.
In the (non-preemptive) online model, parameter is given to the algorithm beforehand, and the sequence arrives online, integer by integer. When processing an integer, the algorithm has to decide whether or not to place a partition separator after the current integer. We are interested in the competitive ratio  of an algorithm, i.e., the (expected) ratio of the bottleneck value of the computed partitioning and the optimal bottleneck value of the input, maximized over all inputs. Placing no separator at all results in a single partition that is trivially -competitive. We prove, however, that this is essentially best possible in the non-preemptive online model, even if the integer sequence is an all-ones sequence.
Given this strong impossibility result, we then augment the online model with preemption. When processing the current integer of the input sequence, a preemptive online algorithm for Part is allowed to remove a previously placed separator, which results in the merging of the two adjacent partitions incident to the separator. The removed separator can then be reinserted again (however, only after the current integer). In this paper, we show that the additional flexibility gained through preemption allows us to obtain algorithms with competitive ratio at most . Even though our original motivation for studying Part in the preemptive online model is the fact that non-trivial algorithms cannot be obtained in the non-preemptive case, preemptive online algorithms for Part are extremely space efficient (only the weights of partitions and positions of separators need to be remembered) and thus work well in a data streaming context for massive data sets.
Partitioning Continuous Online Flows. Algorithms for Part have to cope with the following two difficulties: First, the weights in the input sequence may vary hugely, which implies that algorithms cannot establish partitions of predictable weights. Second, algorithms need to find a way to continuously merge adjacent partitions while keeping the bottleneck value small. While the first point can be tackled via rounding approaches, the second constitutes the core difficulty of Part. We define a problem denoted Flow, which abstracts away the varying weights of the integers and allows us focus on the second point. Flow differs from Part in that the preemptive online algorithm is allowed to determine the weight of every incoming element (we now even allow for positive rational weights). The difficulty in Flow stems from the fact that the algorithm is not aware of the total weight of the input. While at a first glance this problem appears to be substantially easier than Part, we show that any algorithm for Flow can be used for Part while incurring an error term that depends on the ratio of the largest weight of an element and the total weight of the input sequence. Flow can be seen as a continuous version of Part and can be interpreted as the problem of partitioning a continuous online flow (details follow in the preliminaries section).
Our Results. We first show that every algorithm for Part in the non-preemptive online model has an approximation ratio of , even if the input is guaranteed to be a sequence of ones (Theorem 1). We then turn to the preemptive model and consider the special case first, which corresponds to preemptively guessing the center of a weighted request sequence. It is easy to see that every deterministic algorithm for the case has a competitive ratio of . We then give a randomized -competitive algorithm for unweighted sequences (Theorem 2) and prove that this is best possible (Theorem 3). We extend this algorithm to weighted sequences and give a -competitive algorithm (Theorem 4) and a lower bound of on the competitiveness (Theorem 5).
For general , we first give a simple deterministic -competitive online algorithm for Part (Theorem 6) and prove a lower bound of () on the competitiveness of every deterministic (resp. randomized) algorithm (Theorem 7). We then turn to Flow and give a highly non-trivial deterministic partitioning scheme with competitive ratio (Theorem 8) for the case that is a power of two, which constitutes the main contribution of this paper. This scheme translates to Part while incurring a small additive term in the competitive ratio that stems from the varying weights in Part. We discuss extensions of our scheme to arbitrary values of and demonstrate experimentally that competitive ratios better than can still be obtained. Last, we give a lower bound of on the competitive ratio of every deterministic algorithm for Flow (Theorem 9). Unlike the lower bounds for Part, this lower bound does not rely on the discrete properties of integers.
Techniques. Consider first the case and all-ones sequences, which corresponds to preemptively guessing the center of the request sequence. One potential technique is reservoir sampling , which allows the sampling of a uniform random element while processing the input sequence. It naturally suits the preemptive online model and can be used to place a separator at a uniform random position in the request sequence, giving a randomized algorithm with expected competitive ratio . We show that the randomized geometric guessing technique allows us to improve on this bound: For a random and a carefully chosen value , reset the single separator to the current position every time the total weight seen so far equals , for , giving a -competitive algorithm. Via Yao’s principle, we prove that this algorithm is optimal. We then analyze essentially the same algorithm on weighted sequences and show that it is -competitive.
For general , consider the following algorithm for all-ones sequences of unknown length (assume also that is even): First, fill all partitions with weight . Whenever partitions are entirely filled, merge them pairwise creating partitions of weight and then update . Then always constitutes the bottleneck value of this partitioning. After the merging, fill the empty partitions with weight and repeat. Note that the optimal bottleneck value is bounded as . Since at least half of all partitions computed by the algorithm have weight , it holds that , which together with implies , giving a -competitive algorithm. The -competitive algorithm given in this paper is based on the intuition provided and also works for sequences with arbitrary weights.
To go beyond the competitiveness of , consider the key moment that leads to the -competitiveness of the above algorithm: Just after merging all partitions of weight into partitions of weight , the competitive ratio is . To avoid this situation, note that merging only a single pair of partitions would not help. Instead, after merging two partitions, we need to guarantee that the new bottleneck value is substantially smaller than twice the current bottleneck value. This implies that the weights of the merged partitions cannot both be close to the current bottleneck value, and it is hence beneficial to establish and merge partitions with different weights. In fact, our lower bound for Flow makes use of this observation: If at some moment the competitive ratio is too good, then most partitions have similar weight, which implies that when these partitions have to be merged in the future the competitive ratio will be large.
From an upper bound perspective, a clever merging scheme is thus required, which establishes partitions of different weights and, in particular, remains analyzable. We give such a scheme for Flow, when is a power of two. As an illustration, consider the case as depicted in Table 1. Recall that in Flow we are allowed to determine the weight of the incoming elements. We first initialize the partitions with values , for , and evolve the partitions as in the table. Note that, at every moment, all partition weights are different, but at the same time never differ by more than a factor of . A key property is that at the end of the scheme the weights of the partitions are a multiple of the initial weights of the partitions. This allows us to repeate the scheme and limits the analysis to one cycle of the scheme. We prove that the competitive ratio of our scheme is for every that is a power of two. We discuss how our scheme can be extended to other values of and demonstrate experimentally that competitive ratios better than can still be obtained.
Further Related Work. The study of Part in the offline setting has a rich history with early works dating back to the 80s [1, 7, 11, 18, 19, 14, 10, 20, 21]. After a series of improvements, Frederickson gave a highly non-trivial linear time algorithm . Part finds many applications, especially in load balancing scenarios (e.g. [20, 15, 16]). It has recently been studied in the context of streaming algorithms where it serves as a building block for partitioning XML documents .
Recently, Stanton and Kliot  expressed interest in simple online strategies for data partitioning. They studied online graph partitioning heuristics222Phrased in the context of streaming algorithms, but their algorithms are in fact online for the balanced graph partitioning problem and demonstrated experimentally that simple heuristics work well in practice. Stanton later analyzed the behavior of these heuristics on random graphs and gave good quality bounds . Interestingly, besides this line of research, we are unaware of any further attempts at online data partitioning.
Many works provide additional power to the online algorithm. Besides preemption, common resource augmentations include lookahead (e.g. ), distributions on the input (e.g. ), or advice (e.g. ). Preemptive online algorithms have been studied for various online problems. One example with a rich history is the matching problem (e.g. [6, 27, 5, 3]).
Outline. In Section 2, we formally define the studied problems and the preemptive online model. In Section 3, we prove that every non-preemptive algorithm for Part has a competitive ratio of . Then, we give our algorithms and lower bounds for Part for the special case in Section 4. All our results for Part and Flow for arbitrary are given in Section 5.
Partitioning Integer Sequences. In this paper, we study the following problem:
Definition 1 (Partitioning Integer Sequences).
Let be an integer sequence, and let be an integer. The problem of partitioning integer sequences consists of finding separators such that and the maximum weight of a partition is minimized, i.e., is minimized. The weight of a heaviest partition is the bottleneck value of the partitioning. This problem is abbreviated by Part.
Online Model. In the online model, parameter is given to the algorithm beforehand, and the integers arrive online. Upon reception of an integer (also called a request), the algorithm has to decide whether to place a partition separator after . In the non-preemptive online model, placing a separator is final, while in the preemptive model, when processing previously placed separators may be removed. The total number of separators in place never exceeds and separators can only be inserted at the current request.
Competitive Ratio. The competitive ratio of a deterministic online algorithm for Part is the ratio between the bottleneck value of the computed partitioning and the bottleneck value of an optimal partitioning, maximized over all potential inputs. If the algorithm is randomized, then we are interested in the expected competitive ratio, where the expectation is taken over the coin flips of the algorithm.
Partitioning Online Flows. We connect Part to the problem of partitioning a continuous online flow, abbreviated by Flow.
Definition 2 (Preemptive Partitioning of a Continuous Online Flow).
In Flow, time is continuous starting at time . Flow enters the system with unit and constant speed such that at time , the total volume of flow has been injected. The goal is to ensure that when the flow stops at time ( is an arbitrarily small initial warm-up period), which is unknown to the algorithm, the total amount of flow is partitioned into parts such that the weight of a heaviest partition is minimized. More formally, at time , the objective is that partition separators with are in place such that: is minimized. Similar to Part, in the preemptive online model, partition separators can only be inserted at the current time , and previously inserted partition separators can be removed.
Even though Flow is defined as a continuous problem, it can be seen as a special case of Part where the algorithm can determine the weight of every incoming element.
3 An Lower Bound In The Non-preemptive Online Model
Solving Part in the non-preemptive online model is difficult since the total weight of the input sequence is unknown to the algorithm. Since inserted partition separators cannot be removed, any partition created by the algorithm may be too small if the input sequence is heavier than expected. This intuition is formalized in the following theorem:
For every , every randomized non-preemptive online algorithm for Part has expected competitive ratio , even on all-ones sequences.
Let and let be a set of request sequences where is the all-ones sequence of length . Let be a randomized algorithm for Part, and assume that its expected competitive ratio on every instance of is at most . Then by Yao’s lemma, there exists a deterministic algorithm with average approximation ratio at most over the instances of .
Let denote the set of separators output by on . Note that since is a prefix of , the output of on is a subset of the separators . Partition now into and such that the approximation ratio of on is strictly smaller than , and the approximation ratio of on is at least . Consider now a . Then, there exists a separator with , since otherwise the bottleneck value of the partitioning output by on was at least . This in turn would imply that the approximation ratio was at least (since the optimal bottleneck value on is ), contradicting the fact that . Next, note that the separators and , for and , are necessarily different. Thus, since the number of separators is , the size of is bounded by . The average approximation factor of on instances is thus at least
4 Guessing the Center: Part for
In this section, we consider Part for , i.e., a single separator needs to be introduced into the request sequence so as to split it into two parts of similar weight. We start with all-ones sequences and present an asymptotically optimal preemptive online algorithm. Then, we show how to extend this algorithm to sequences of arbitrary weights.
4.1 All-ones Sequences
The special case on all-ones sequences corresponds to preemptively guessing the center of a request sequence of unknown length. Deterministic algorithms cannot achieve a competitive ratio better than here, since request sequences that end just after a deterministic algorithm placed a separator give a competitive ratio of .
Every deterministic preemptive online algorithm for Part with has a competitive ratio of .
Using a single random bit, the competitiveness can be improved to . This barely-random algorithm is given in Appendix A. Using random bits, we can improve the competitive ratio to , which is best possible, and will be presented now.
Algorithm , as depicted in Algorithm 2, is parametrized by a real , which will be optimized in Theorem 2. It moves the separator to the current position as soon as the -th request is processed, where is any integer and is a random number.
Remark. The continuous random variable is only taken for convenience in the analysis; a bit precision of is enough.
In the following, denote by the competitive ratio of on a sequence of length .
There is a constant such that .
Let and be such that . Then, the bottleneck value of an optimal partition is . We now bound the bottleneck value of the computed partitioning , which depends on various ranges of and . We distinguish two ranges for , and within each case, we distinguish three ranges of :
Case 1: (note that we assumed ). In order to bound , we split the possible values of into three subsets:
If , then we have that . In this case, the bottleneck value is .
If , then we have that but . In this case, .
If , then we have that and . In this case, .
Using these observations, we can bound the expected competitive ratio as follows:
Case 2: . We deal with this case similarly, but we need to group the possible values for in a different way:
If , then but . In this case, .
If , then and . In this case, .
If , then . In this case, (note that here).
Plugging the values above in the formula for the expected value, we obtain a different sum of integrals, which however leads to the same function as above:
Moreover, the formulas above are independent of . Thus, it remains to find a value of that minimizes . Observe that , and if and only if . With a simple transformation, the latter is equivalent to with , so the value that minimizes can be computed as , where is the lower branch of Lambert’s function. The claim of the theorem follows by calculating . ∎
Next, we prove that no algorithm can achieve an (expected) competitive ratio better than the one claimed in Theorem 2. The proof applies Yao’s Minimax principle and uses a hard input distribution over all-ones sequences of length , for some large values of and , where the probability that the sequence is of length is proportional to .
For any randomized algorithm , .
We will prove the theorem by using Yao’s Minimax principle . To this end, let us first consider an arbitrary deterministic algorithm . Assume the length of the sequence is random in the interval for large values of and with and has the following distribution: The sequence ends at position with probability which is proportional to , i.e., using the definition , we have
We will show that for each deterministic algorithm , if the input sequence is distributed as above, then , where the expectation is taken over the distribution of .
Let denote the set of requests at which places the separator when processing the all-ones sequence of length . Note that the set of separator positions placed by on sequences of shorter lengths are a subset of . Let (the are ordered with increasing value).
For , let be the bottleneck value of the partitioning computed by on the sequence of length . We bound by separately bounding every partial sum in the following decomposition:
where for each , . The first and last terms need a special care, so we will start with bounding all other terms. In the following, denotes partial harmonic sums for . In particular, .
Thus, we proceed in three steps:
Consider an index and let us bound the sum . Let us denote and . We need to consider two cases.
Case 1: . Then for all , the bottleneck value computed by the algorithm is (since ). Then:
where the last inequality can be proved as follows. First, it is easily checked that the ratio decreases when increases with kept fixed, implying that (recall that and using standard approximations of harmonic sums)
where the last inequality holds when is large enough (say ).
Case 2: . In this case, for all , if the sequence is of length , then , as in case 1. However, when , then , so . Using these observations, we can bound as follows:
where the third line is obtained by using the identity . Again, using a standard approximation for the harmonic sums, and setting , we can approximate:
Note that the function is exactly the same that was minimized in the proof of Thm. 2, and achieves its minimum in at , giving . Thus, we have .
The term can be bounded by by an identical argument as above.
The term needs a slightly different approach. Let denote the last separator that the algorithm placed before . We can assume that , as otherwise the algorithm could only profit by moving the separator to . We consider two cases. First, if , then we simply assume the algorithm performs optimally in the range :
since (recalling that ), .
On the other hand, when (and by the discussion above, ), then with calculations similar to the one in Case 2 above, we can obtain:
because and thus .
It remains to plug the obtained estimates in (1):
Last, by Yao’s principle, every randomized algorithm has a competitive ratio of at least . ∎
4.2 General Weights
Algorithm can be adapted to weighted sequences of positive integers as follows: can be thought of as a sequence of unit weight requests and we simulate on this unit weight sequence. Whenever attempts to place a separator, but the position does not fall at the end of a weight , the separator is placed after .
If the weights of the sequence are bounded, algorithm can be analyzed similarly as Theorem 2, by treating all requests as unit weights. This introduces an additional error term:
There is a constant value of such that for any sequence with weights .
When arbitrary weights are possible, a non-trivial bound can still be proved. Interestingly, the optimal gap size between the separator positions is larger than in the case of unweighted sequences.
There is a constant value of such that for each sequence of total weight .
Let be the input sequence of total weight , and let . Then, is the central weight of the sequence, and we denote the central point. We will argue first that replacing all left of by a sequence of unit requests, and replacing all right of by a single large request of weight worsens the approximation factor of the algorithm. Indeed, suppose that the algorithm attempts to place a separator at a position that falls on an element , which is located left of . Then the algorithm places the separator after , which brings the separator closer to the center and thus improves the partitioning. Similarly, suppose that the algorithm attempts to place a separator at position that falls on an element , which is located right of . By replacing all weights located to the right of by a single heavy element, the algorithm has to place the separator at the end of the sequence, which gives the worst possible approximation ratio. Thus, we suppose from now on that is of the form , where may be non-existent.
Assume that the central point splits the request with weight into two parts , such that . Clearly, the optimal bottleneck value is . Further assume that for and . Then, is the central point. Let be such that is the starting point of weight , and is the starting point of weight . Note that is non-negative, but can be negative. However, if , we can replace without decreasing the approximation ratio, because in both cases the algorithm places the separator after , while in the case of negative the optimum can only be larger than when . Thus, we assume w.l.o.g. that , so and .
Again, we consider several cases. In the estimates below, we ignore the rounding terms as they are all , and hence the error term in the approximation ratio is . Also, we use in place of for brevity.
Case 1: . In this case, . So for all , is the position where the last separator would be (in the unit weights case). However, when then and is the last separator.
Here we have:
: (all unit weights before ).
: the separator is placed after , and .
: assume the worst case bound .
: , as the last separator is at .
Computing the expectation gives:
First, assume that , which implies that . In this case . Let us see which values of maximize the term in the parentheses. We have , since by assumption (that ). So is increasing in the interval , and is the maximizer. Plugging this value and in the expectation formula and rearranging the terms gives:
Note that the right hand side is a decreasing function of and the maximum is achieved when :
Now assume that . Then . By plugging the value in (2) we see again that the expression is a decreasing function of , so it is maximum when takes its minimum value :
where we also used . This is again a decreasing function of and gives exactly the same bound (3).
Case 2: . Here for all , so we assume that the algorithm places the separator at the end of the sequence for such . Here we have to distinguish two sub-cases.
Case 2.1: . We have: