An Algorithm for Approximation by Step Functions
Quentin F. Stout
Computer Science and Engineering
University of Michigan
Ann Arbor, MI 48109-2121
firstname.lastname@example.org +1 734.763.1518
An algorithm is given for determining an optimal -step approximation of weighted data, where the error is measured with respect to the norm. For data presorted by the independent variable the algorithm takes time and space. This is in the worst case and when . A minor change determines an optimal reduced isotonic regression in the same time and space bounds, and the algorithm also solves the -center problem for 1-dimensional weighted data.
Keywords: step function approximation; reduced isotonic regression; variable width histogram; weighted k-center; interval tree of bounded envelopes
Step functions are a fundamental form of approximation, arising in variable width histograms, databases, segmentation, approximating sets of planar points, piecewise constant approximations, etc. Here we are interested in stepwise approximation of weighted data. By weighted data on we mean values , , where is an arbitrary real number and (the weight) is a positive real number. For integers let denote . A function on is a -step function iff there are indices and real values , , such that for . is an optimal -step approximation of iff it minimizes the weighted error, , among all -step functions. Since a step can be split into smaller ones, we do not differentiate between “ steps” and “no more than steps”.
Many algorithms have been developed for -step regression [2, 3, 4, 5, 6, 8, 9, 11, 12, 13, 14]. The first time deterministic algorithms [6, 11] were decidedly impractical, relying on parametric search. A more feasible algorithm appeared in . However, the time of these algorithms does not improve when is small, which is the typical case of interest. We present a faster algorithm that is when the data is presorted by independent coordinate.
2 -Step Approximation
Build an interval tree to determine the regression error of an arbitrary interval if it is a single step.
Use “search in a sorted matrix” to find the minimal possible error for a -step approximation.
The search uses a feasibility test which is given and decides if there is a -step approximation with error . If there is such an approximation then the test produces one. We incorporate important improvements to this approach: feasibility tests are used during tree construction, not just during the search; tests do not determine the minimal regression error of an interval, merely that it is sufficiently small or too large; previous searches, except the randomized version in , were not search in a sorted matrix; and we exploit the fact that calculations at one stage of the search are related to those of the previous stage. We will show:
Given weighted data , sorted by the independent coordinate, and number of steps , one can determine an optimal -step approximation in time and space.
Given a set of weighted values and , the 1-dimensional weighted -center problem is to find a set of real numbers that minimizes , where is the weighted distance to , i.e., . Note that a set is an optimal -center iff it is the step values of an optimal -step approximation of the values in sorted order.
From now on we generally omit mention of “” and “optimal” since they are implied.
2.1 Interval Tree of Bounded Envelopes
For a weighted value , in the - plane the error of using as its regression value is given by the ray in the upper half-plane that starts at with slope . Given a set of weighted data , its upward error envelope is the topmost sequence of line segments corresponding to all such rays. For each , it gives the maximum error of using as the regression value for all points where . The downward error envelope uses rays in the upper half-plane starting at with slope , representing the error of using a regression value . The intersection of the downward and upward error envelopes gives the regression value minimizing the error over the entire set, i.e., the weighted mean, and its error.
To simplify exposition we assume that is an integral power of 2. A binary interval tree has a root corresponding to the interval , its two children correspond to and , their children represent intervals of length , etc. The leaves are , , …. The intervals corresponding to nodes will be called binary intervals.
Some authors [2, 5, 9] used interval trees where each node contains the upward and downward error envelopes of the data in its interval, but most of the envelopes’ segments are unnecessary. Let be the (unknown) error of an optimal -step approximation, and suppose bounds are known. Then can be determined using only the segments representing errors in . These will be called essential segments, and they form bounded envelopes. All others, the inessential segments, are discarded. Figure 1 shows how essential segments can become inessential when a better error bound is determined. Initially and , and the algorithm continually improves these bounds. In our interval tree each node contains its upward and downward bounded envelopes. Throughout, the essential segments are precisely the segments in the standard, unbounded, envelopes that represent errors in . Thus correctness depends on properties of the standard envelopes, though timing does not.
Bounded envelopes are stored as a doubly-linked list ordered by slope. Whenever a node is visited, by starting at both ends, inessential segments (i.e., those with no errors in ) are discarded. The time is charged to the segments removed, not the search visiting the node. Only segments are ever created, hence the total time to remove inessential ones is . Whenever the number of remaining segments is counted the count is only of the essential segments given the current values of and .
2.2 Feasibility Tests
Given an arbitrary interval let denote the upward bounded envelope of the data in , and the downward bounded envelope. For the error of making a single step is , , iff , , (see Figure 1). Since this can be calculated as follows: let for some , where each is a binary interval. Then and . For a binary interval , to determine , and similarly , go through the segments of its bounded envelope until the segment containing is found. This search alternates back and forth starting at the topmost and bottommost essential segments. This is only performed during a feasibility test, which will result in either the segments above , or those below , becoming inessential. Thus the time to find is at most a constant plus a term linear in the number of segments that become inessential. Here too the linear term is charged to the inessential segments.
Any interval can be decomposed into binary intervals where the sizes increase and then decrease, with perhaps two intervals of the same size in the middle. E.g., . These can be visited in time by a tree traversal starting at the leaf , moving upward to the least common ancestor of and , and then downward to the leaf . Suppose, given and , we want to find the largest such that the error of making a single step is . We do this by a traversal to locate . By incrementally updating the values to determine , and values used for , when moving upward at node one can determine if is in ’s subtree (and hence the traversal should start going downward) by using ’s envelopes to decide if adding the entire subtree gives an error . When moving downward, is in ’s left subtree iff adding the left subtree gives error , otherwise it is in ’s right subtree. Not counting the queries of children’s envelopes, the nodes visited are the same as those in going from to when is known.
Given and , a feasibility test determines if there is a -step function with regression error . This can be accomplished by starting at 1 and determining the largest for which the error of making a single step is , then starting at and determining the largest for which the error of making a single step is , etc. If the step is finished before is reached then is infeasible, the test stops, and . Otherwise, is feasible, the steps have been identified, and .
To count the number of nodes visited, for each step the traversal visits nodes at a given height at most twice, once moving upward and once moving downward. Thus at any height at most nodes are visited. The top levels have a total of nodes. There are lower levels, so in total nodes are visited. Each visit takes time, so this is also the time required.
2.3 Constructing the Tree
We reduce the time to construct the interval tree of bounded envelopes by continually shrinking . At the end, is so small that each bounded envelope is a single segment. See Figure 2.
First a feasibility test with is performed using only the base level. If it passes then the algorithm is done. Otherwise, set , , and . Throughout, is an unordered multiset containing the errors of all segment endpoints (e.g., the error of the joint endpoint of a and b in Fig. 1) remaining in . At height 0 each interval is a singleton, with single rays in its upward and downward envelopes for a total of rays. In general, at the end of height , and the total number of essential segments in the envelopes at height is . Moving upward, bounded envelopes from height are merged to form those at height , creating segments and segment endpoints (the number of segment endpoints is the number of segments minus one per envelope). Add the errors of the segment endpoints to those already in , resulting in . Take the median error in and do a feasibility test. Depending on the outcome, one of and is adjusted and 1/2 the entries in can be eliminated. Doing this 3 times results in . The number of essential segments is 1 per envelope (since included segment endpoints at level ), and hence is .
When the top is finished and 2 feasibility tests are used to eliminate the remaining endpoint errors, i.e., at every node of the interval tree the upward and downward bounded envelopes have only one segment. Complete the tree construction by removing all inessential segments, taking time. Feasibility tests during tree construction have a slight change from standard traversals in that when height is being constructed, when the test’s traversals reach height they go sideways, not upwards, from one node to the next since nodes at higher levels haven’t yet been constructed. This increases the total number of nodes visited per test by at most . Since only 3 tests are done per height (see Figure 2), this adds total time over all heights. Thus the total time to construct the tree is .
2.4 Search for Minimal Feasible Error
The error of a stepwise approximation is the maximum of the errors of its steps, thus there is an interval such that the error of an optimal -step approximation is the error of using the weighted mean as the step value on . Thus searching through such errors can determine the minimal feasible error. “Parametric search” was used in [6, 11] but this is only of theoretical interest since parametric search is completely impractical, involving very complex data structures and quite large constants.
Search in a sorted matrix provides a practical approach . Let be the matrix where is the error of using the mean on if , and is 0 if . is not actually created, but serves as a conceptual guide. Few of its entries are ever determined. Its rows are nondecreasing and the columns are nonincreasing, so for any submatrix its largest entry is in the upper right and the smallest is in the lower left.
The algorithm has stages , where at the start of stage there is a collection of disjoint square submatrices of size . Stage 0 starts with all of . At each stage, divide each of the matrices into quadrants, and let be a median of the smallest value from each quadrant and a median of their largest values. Values determined to be outside , as in Figure 1, are not calculated exactly and are set to or , as appropriate. Feasibility tests for and are done, resulting in improvements to and/or . Quadrants with smallest value , or largest value , are eliminated. The remaining quadrants are the matrices that start the next stage. Note that if then any quadrant with an entry of is not eliminated, hence at the end of each stage, either or one of the entries in the remaining submatrices is .
After the last stage the remaining matrices are and a standard binary search on these values is used to finish the determination of . The search uses feasibility tests, and the less obvious fact, proven in , is that only entries of are evaluated.
2.5 Evaluating During The Search
For intervals let denote the submatrix , i.e., the submatrix corresponding to intervals starting at some and ending at some . At the start of stage of the search there is a collection of submatrices of the form for binary intervals , of size . Either , or is to the left of and there is a (perhaps empty) gap between them with length an integral multiple of ( in Figure 3). During stage , the quadrants of are formed by cutting and in half into , and , , respectively, creating quadrants , , , and . The smallest and largest value in each quadrant needs to be determined, and as Figure 3 shows, one can evaluate the smallest entry in (i.e., ), by using the envelopes from and the binary intervals , , and . The largest entry in the quadrant, , uses envelopes from and the binary intervals and . Similar results hold for all of the other quadrants of . Exact values for entries outside are irrelevant and or is used, as appropriate.
The bounded envelopes for gap are associated with , and if, say, is kept for stage then the envelopes for are associated with it. Just as for the tree construction, as search in a sorted matrix is proceeding the number of segments in the gap envelopes is reduced by interleaving feasibility tests based on the segment endpoint errors with tests for the basic search. At the end of stage each gap envelope is copied at most 4 times, and each quadrant passed on to the next level adds 3 binary intervals which have envelopes that are only single segments. As shown in  there are such quadrants, so by using 2 additional feasibility tests at each stage the total number of segments in the bounded envelopes at stage is . Since the time to evaluate an entry of is linear in the number of segments involved, the total time to evaluate entries of over all stages of the algorithm is , and the total time of the feasibility tests is plus the time to remove inessential segments.
This completes the proof of Theorem 1.
3 Final Comments
We have shown how to find an -step approximation of weighted data, presorted by its independent coordinate, in time. No previous algorithm [2, 3, 4, 5, 6, 8, 9, 11, 12, 13, 14] was whenever , nor whenever . For sorted data the algorithm solves the 1-dimensional weighted -center problem in the same time.
With a small change the algorithm also produces a “reduced isotonic” -step function. is an isotonic function iff , and is an optimal -step reduced isotonic regression of iff it minimizes the error among all isotonic -step functions. Isotonic regression is an important form of nonparametric regression that allows researchers to replace parametric assumptions with weaker shape constraints [1, 15]. Some researchers were concerned that it can overfit the data and/or be too complicated [10, 16, 17] and resorted to reduced isotonic regression. However, they used approximations because previous exact algorithms were too slow. Merely changing the feasibility test to insure increasing steps finds -step reduced isotonic regression in the same time bounds as -step approximation.
Acknowledgement: Research partially supported by NSF grant CDI-1027192
-  Barlow, RE, Bartholomew, DJ, Bremner, JM, and Brunk, HD, Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression, John Wiley, 1972.
-  Chen, DZ and Wang, H, “Approximating points by a piecewise linear function”, Algorithmica 66 (2013), pp. 682–713.
-  Chen, DZ and Wang, H, “A note on searching line arrangements and applications”, Info. Proc. Let. 113 (2013), pp. 518–521.
-  Díaz-Báñnez, JM and Mesa, JA, “Fitting rectilinear polygonal curves to a set of points in the plane”, Eur. J. Operational Res. 130 (2001), pp. 214–222.
-  Fournier, H and Vigneron, A, “Fitting a step function to a point set”, Algor. 60 (2011), pp. 95–109.
-  Fournier, H and Vigneron, A, “A deterministic algorithm for fitting a step function to a weighted point-set”, Info. Proc. Let. 113 (2013), pp. 51–54.
-  Frederickson, G and Johnson, D, “Generalized selection and ranking: Sorted matrices”, SIAM J. Comp. 13 (1984), pp. 14–30.
-  Fülöp, J and Prill, M, “On the minimax approximation in the class of the univariate piecewise constant functions”, Oper. Res. Let. 12 (1992), pp. 307–312.
-  Guha, S and Shim, K, “A note on linear time algorithms for maximum error histograms”, IEEE Trans. Knowledge and Data Engin. 19 (2007), pp. 993–997.
-  Haiminen, N, Gionis, A, and Laasonen, K, “Algorithms for unimodal segmentation with applications to unimodality detection”, Knowl. Info. Sys. 14 (2008), pp. 39–57.
-  Hardwick, J and Stout, QF, “Optimal reduced isotonic reduction”, Proc. Interface 2012.
-  Karras, P, Sacharidis, D., and Mamoulis, N, “Exploiting duality in summarization with deterministic guarantees”, Proc. Int’l. Conf. Knowledge Discovery and Data Mining (KDD) (2007), pp. 380–389.
-  Liu, J-Y, “A randomized algorithm for weighted approximation of points by a step function”, COCOA 1 (2010), pp. 300–308.
-  Mayster, Y and Lopez, MA, “Approximating a set of points by a step function”, J. Vis. Commun. Image R. 17 (2006), pp. 1178–1189.
-  Robertson, T, Wright, FT, and Dykstra, RL, Order Restricted Statistical Inference, Wiley, 1988.
-  Salanti, G and Ulm, K, “A nonparametric changepoint model for stratifying continuous variables under order restrictions and binary outcome”, Stat. Methods Med. Res. 12 (2003), pp. 351–367.
-  Schell, MJ and Singh, B, “The reduced monotonic regression method”, J. Amer. Stat. Assoc. 92 (1997), pp. 128–135.