Isotonic Regression for Linear, Multidimensional, and Tree Orders
Quentin F. Stout
qstout@umich.edu
University of Michigan
Ann Arbor, MI
Abstract
Algorithms are given for determining isotonic regression of weighted data. For a linear order, grid in multidimensional space, or tree, of vertices, optimal algorithms are given, taking time. These improve upon previous algorithms by a factor of . For vertices at arbitrary positions in dimensional space a algorithm employs iterative sorting to yield the functionality of a multidimensional structure while using only space. The algorithms utilize a new nonconstructive feasibility test on a rendezvous graph, with bounded error envelopes at each vertex.
Keywords: weighted isotonic regression, shapeconstrained nonparametric regression, linear order, multidimensional domination, tree, rendezvous graph, coordinatewise ordering
1 Introduction
This paper gives efficient algorithms for determining optimal isotonic regression functions for weighted data. For example, consider predicting weight as a function of height and S M L XL shirt size. Average weight is an increasing function of height and of shirt size, but there may be no assumptions about the relative weights of people shorter and with a larger shirt size vs. taller and smaller shirt size. A parametric model, such as linear regression, may not be justified, and would require a metric on shirt sizes, not just an ordering. Isotonic regression is useful here since it merely assumes a direction on each variable.
Isotonic regression is a fundamental nonparametric regression, long used for numerous applications [3, 38] and mathematically equivalent problems [24], and is getting increased attention in machine learning and data mining due to its flexibility and minimal assumptions [8, 9, 16, 21, 25, 32, 34, 36, 49]. For example, the Isotron algorithm provably learns single index models [21]. As another example, a classification system may have some confidence that an item in an image is squirrel, of it being a rat, a mammal, etc., with the isotonic requirement that as one moves up the taxonomy tree the confidence does not decrease [22, 34].
Formally, given a directed acyclic graph (dag) , for vertices , iff there is a path in from to . A realvalued function on is isotonic iff whenever then . By weighted data on we mean functions on where is an arbitrary real value and (the weight) is nonnegative, for . Given weighted data , an isotonic function on is an isotonic regression iff it minimizes the weighted error
among all isotonic functions. See Figure 1. Note that when is points on the real line, is not necessarily defined at points not in . For example, if all weights are the same and the values are (1,7), (2,5), (3,8), then for all an optimal regression is 6 on [1,2] and 8 at 3, but for any and there is an optimal isotonic regression for which . Further, while isotonic regression is unique when , for it is not necessarily unique. For example, given values 4, 1, 3, with weights 1, 1, 1, on 0, 1, 2 with the natural ordering, a function is an isotonic regression iff its values are 2.5, 2.5, , for .
For a dag of vertices and edges, previous algorithms for weighted regression were no matter what the dag was [24, 27, 29, 44], though a randomized algorithm taking expected time appears in [26]. Here we give algorithms for the dags of most interest, namely linear, tree, and multidimensional grids. We also give an algorithm for arbitrary points in dimensional space with componentwise ordering, , taking time and only space. For all , previous algorithms for isotonic regression for such inputs required space and more time.
2 Preliminaries
For linear orders it is well known that the , , and isotonic regressions can easily be found in , time, and time, respectively, by using “pool adjacent violators” (PAV) where whenever adjacent level sets are not isotonic they are pooled together into a larger set where all the regression values are the same. Algorithms for multidimensional orderings have concentrated on 2 dimensional grids [5, 12, 13, 17, 31, 35, 37, 42], with the fastest taking time for [43], for [42], and for [44]. For grids of dimension , and dimensional points in arbitrary positions, , the fastest algorithms just apply the fastest algorithm for arbitrary dags, which takes time for [2] ( if the data is unweighted [45]), for [20], and for [44] ( if the data is unweighted).
Given data on dag , for , with and , let and . For any isotonic function , the weighted error of at least one of and is , which is minimized when . Because of the isotonic restriction, a larger value at forces a larger value at , increasing the error there, and, similarly, a smaller value at increases the error at . The most widely known isotonic regression is given by , where
Note that the optimal regression error, , is .
A simplistic use of this could take time, so more efficient methods are used instead. Given an , to decide if , for let , i.e., the smallest value at with weighted error no more than , and let . is the smallest possible value at of any isotonic function with error on and its predecessors, and thus there is an isotonic function with error iff is one, i.e., iff for all . This is usually called a feasibility test, but here will be called a feasibility construction. It can be computed in time via topological sort.
For the metric, isotonic regression and related nonparametric regressions such as step (where the regression is piecewise constant with at most pieces), researchers have used an approach based on searching through values to find , using a feasibility construction to guide the search and to generate the final regression [11, 15, 18, 23, 24, 28, 44]. feasibility constructions are performed, so the time is .
This paper gives a technique for determining more rapidly, using a nonconstructive feasibility test (“pairwise feasibility test”, Sec. 2.3). It utilizes a “rendezvous graph” (Sec. 2.2) as a succinct representation of the transitive closure of the original dag. This graph has levels, with the base corresponding to the vertices of the original dag. The algorithm traverses the rendezvous graph level by level, with “bounded error envelopes” (Sec. 2.1) used to generate values and for the pairwise error test.
2.1 Bounded Error Envelopes
There is a geometric interpretation of and : for a weighted value , the ray in the upper half plane that starts at with slope gives the error of using a regression value greater than (this will be called the upward ray) and the ray that starts at with slope gives the error of using a smaller value (this will be called the downward ray). If then and are given by the intersection of the upward ray of and the downward ray of . Given disjoint sets , let . If every element of precedes every element of , then any isotonic function must have error at least . To determine let be the set of downward rays corresponding to the elements of and the set of upward rays corresponding to elements of . The downward envelope of is the piecewise linear function which is the pointwise maximum over all rays in , and the upward envelope of is the pointwise maximum of all rays in (see Figure 2). The value of is the error of the intersection of the downward envelop of and the upward envelope of (if they don’t intersect then ), and the ordinate of this intersection is the regression value achieving .
While error envelopes have been used by many authors, in general not all of the segments in the envelopes are necessary. Suppose bounds , are known, where . Then only the portions of the envelopes representing errors in are needed to determine . The segments of the envelope representing errors in are essential, and the others are inessential and can be pruned. The essential segments form a bounded envelope, as shown in Figure 2. The interval is continually narrowed. However, this is only used to reduce the number of essential segments at the current level or above: nodes at lower levels are never revisited.
Bounded envelopes will be associated with vertices of the rendezvous graph described below, where the downward bounded envelope at vertex is denoted and its upward bounded envelope is . Since we will always have an upper bound , and the intersection of and is no greater than , with bounded envelopes we can determine the intersection of and if it is , and if it is then we can determine that fact though not necessarily its (not needed) value. Bounded envelopes are stored as simple lists ordered by slope, and all operations on bounded envelopes can be done in time linear in the number of segments.
2.2 Rendezvous Graphs
For a dag , a rendezvous graph is used to help speed up the calculations. These were introduced in [45], though here they are modified slightly. The nodes of have height , for some , where the leaves are at height 0 and correspond to the vertices of . A node at height only has edges to nodes at height and . For dags more complicated than a linear order the rendezvous graph is not a tree, but we will still call the adjacent nodes below children and adjacent nodes above parents (see Figure 3). A node covers a set , where corresponds to the set of leaf nodes in the descendants of . A node may have several children, but there is a unique small child, denoted , and a unique large child, denoted , such if and covers and covers , then in . Further, for every pair of vertices , if then there is a vertex such that is covered by and is covered by . This is their rendezvous node. Here their rendezvous node is unique, but the algorithm only depends upon the existence of at least one rendezvous for them.
To explain the presence of other children, suppose covers a rectangular matrix, with children nodes , , , and , each covering a quadrant of the rectangle. If covers the lower left quadrant (in standard ordering), then it is the small child, and if covers the upper left quadrant it is the large child. Both of and cover vertices that are successors of some vertices covered by but not all, and vertices that are predecessors of some elements covered by but not all. Vertices in covered by or rendezvous with those covered by or at lower levels. However, data values in all 4 quadrants are needed to construct ’s bounded error envelopes to be used at higher levels.
For every vertex , the intersection of and gives
values involving vertices not in nor , are ignored. Since every pair in rendezvous somewhere, . Whenever is calculated we set and never revisit .
2.3 Pairwise Feasibility Test and Reducing the Number of Segments
A pairwise feasibility test is given , and determines if for all pairs where . If true, . It can be proven by showing for all rendezvous nodes . Let , i.e., the regression value of the point on with error , and be the corresponding value for the upward envelope. Then iff . A test is initiated at level after has been determined for all vertices at level and below, in which case is the largest such value ( isn’t computed exactly if since ). Thus we only need to determine if for vertices at level and above. First, for all vertices at height , construct and from the union of the envelopes of all of ’s children.
For a pairwise feasibility test of initiated at level :

For all vertices at height h, compute and .

Level by level, starting at level , for a vertex , if then the test fails. Otherwise, and .
The pairwise feasibility test succeeds iff it does not fail anywhere, in which case . Its time is the time to compute the bounded envelopes at level , which is linear in the number of envelope segments at that level, plus time linear in the sum, over all nodes at height and above, of the number of children. The algorithms insure that both terms are .
To reduce the time the number of envelope segments is continually reduced. After and are created for all nodes at height , for a bounded upward or downward envelope at a node, consider the errors of the endpoints of its essential segments (see Fig. 2) and let be the union of these errors over all nodes at level ( is a multiset). Then either or there are two consecutive errors , such that (or is less than the smallest, or greater than the largest, value in ). By taking the median error in and using a pairwise feasibility test to determine if one can eliminate 1/2 of the endpoint errors in and eliminate corresponding segments in the envelopes. This is not quite the same as eliminating 1/2 the segments since each bounded envelope has at least one segment, hence the number of essential segments is , where is the number of vertices at level .
Observation 1: Let be given. Suppose there are essential segments at level , each rendezvous node has at most parents (and hence each bounded envelope from level is used at most times to create envelopes at level ), and the number of nodes at level is . Then initially there are at most essential segments at level , and hence at most this many segment endpoint errors. Using the median times to reduce the number of segments errors results in at most segment endpoint errors, and thus at most essential segments, at level . Thus if we always use the median this many times at every level and start with segments at the base there will be essential segments at level .
3 Dimensional Points in Grids and Arbitrary Position
For points in dimensional space, , the componentwise ordering, also known as domination or product order, is used, i.e., iff for all . Since it is only the ordering of the independent variable that is important, not their values, for linear orders we assume the values are . A dimensional grid of size has vertices , where for all . To avoid degeneracy we assume that for all . For points in arbitrary position the coordinate, , has values , where each value appears at least once. If the original coordinates are different they can be converted to this form in time, but this is only needed for points in arbitrary position.
For grids the input is a array, but for points in arbitrary position we merely assume the data is given in a list or linear array format since it may be that .
Theorem 1
Given weighted data on a set of vertices in dimensional space with componentwise ordering, , an isotonic regression of the data can be determined in

time and space if the vertices form a grid, and

time and space if the vertices are in arbitrary positions,
where the implied constants depend upon .
a) is proven in Sec. 3.1 for linear orders and in Sec. 3.2 for dimensional grids. b) is proven in Sec. 3.3.
3.1 Linear Orders
Let be a linear order with vertices . The rendezvous graph for is a simple binary tree (see Figure 3). The only unusual aspect is that the vertices in at height 0 have labels twice that of the ones they correspond to in , which is used merely to simplify the description of the algorithm. A vertex in at height has two children , though the larger child will be absent if . The maximum height is and .
Algorithm A gives the algorithm for isotonic regression on a linear order. It finds the optimal regression error by using and to compute the maximum error at the rendezvous nodes. Technically line 7 is skipped if it is determined that the intersection is less than . For the repeat loop, lines 10–13, Observation 1 shows that only 3 iterations are required to insure that the number of essential segments at level is (a more careful analysis shows that 2 suffice).
To determine the time of the pairwise feasibility test, in Section 2.3 it was shown that a test started at level is linear in the time for determining and for all nodes at level , which is linear in the number of segments in envelopes at level , plus time linear in the number of nodes at higher levels. Both terms are , and thus for each iteration of the forloop at lines 3—15 the time is . Summing over all levels gives , proving Theorem 1 a) for linear orders.
3.2 Multidimensional Grids
Let be a dimensional grid. Its rendezvous graph has vertices , where is the rendezvous graph for a linear order on . For vertex with label in let denote the label of its parent and its height (these are independent of ). For vertex of , . As before, corresponds to the vertices of of height 0.
The parents of rendezvous node are of the form , , where
Thus has at most parents. If any of ’s components are already at the maximum height in their dimension then has no parents.
For example, suppose and . Then , (see Fig. 3). ’s parents are (7,8) and (7,9). The ancestors of that keep 8 as their second coordinate are part of the 1dimensional tree that covers a single row. Meanwhile, covers a array, its unique parent (15,11) covers a array, etc., with all of its ancestors covering a rectangle with twice as many rows as columns.
Let , , , be vertices in . If then their rendezvous node in is where is the rendezvous node of and in the linear ordering on dimension . Since for all , is in the small child of and is in its large child.
A simple weak upper bound on the number of rendezvous nodes at height is if one of the coordinates is at height , and all others are unconstrained. A linear order of size has rendezvous nodes at height and total rendezvous nodes, so an overestimate is
For fixed this is for a constant . Further, each node as at most parents, and thus Observation 1 shows that for a fixed only a constant number of pairwise feasibility tests using the median (lines 1013 in Algorithm A) are needed. Thus the total time is as claimed in Theorem 1 a).
3.3 Points in Arbitrary Position
For points at arbitrary positions in dimensional space their ordering is given implicitly via their coordinates, but representing this with an explicit set of edges can result in a large dag. For example, for 2dimensions, let and . Then every point in is dominated by every point in and representing this explicitly in a dag requires edges. However, in a dag with vertices the ordering can be represented by an edge from each vertex of to and an edge from to each vertex of , using only edges. This is an orderpreserving embedding of with coordinatewise ordering into a dag with slightly more vertices but significantly fewer edges. is sometimes called a Steiner point.
This observation was used in [45] to create an orderpreserving embedding of a set of points in dimensional space into a dag of vertices and edges. Modifying the construction slightly, let be the index of the ancestor of at height in 1dimensional rendezvous graphs and let be the index of the rendezvous node for in 1d rendezvous graphs (these do not depend on the size of the linear ordering as long as it is large enough for them to be defined). For dimensional points , , if then a natural place for them to rendezvous is the dimensional vertex . This was used for grids, but here they rendezvous at a line in the th dimension, with its other coordinates being (see Figure 4). is mapped to point on this line.
To determine we need only find the maximum over all rendezvous lines of the maximum on the line. The lines are independent of each other and we do not use bounded envelopes from one for another. Each point is mapped to lines. The points are first sorted into lexical order and the max on the lines is determined, then into order , etc., as shown in Algorithm B.
In Algorithm B the structure in Figure 4 is never explicitly constructed. Instead, sorting puts the vertices in a line into consecutive locations, and iterating through the sorting orders creates all of the rendezvous lines. Thus the total space is only instead of the needed in [45] to create a multidimensional structure. Each sort can be done in time by using radix sort (the implied constants depend on ), and the maximum on a line can be found in time linear in the number of points on the line (Algorithm A), so the total time is .
There are some aspects in this process that need to be explicated. While the maximum on a line can be determined in time linear in the length of the line, may be superlinear in this length. However, if a given line has points their actual first coordinate is ignored and they are treated as if the coordinates are . Thus the time to find maximum is . Further, on each line one only uses downward bounded envelopes from the small points rendezvousing here, and only upward bounded envelopes from the large points. If rendezvouses at line , it is a small point iff for , and is large iff for . If neither of these conditions hold then it is ignored. All of the bookkeeping necessary to determine a point’s rendezvous coordinates in each iteration, and whether it is small or large, can be done in time, where the implied constants depend on .
This concludes the proof of Theorem 1 b).
4 Trees
Recent applications of isotonic regression on trees include taxonomies and analyzing web and GIS data [1, 9, 22, 34]. Here the isotonic ordering is towards the root. For trees the previously fastest isotonic regression algorithms take time for the [43], [33], and metrics [44]. We will show:
Theorem 2
Given weighted data on a rooted tree of vertices, an isotonic regression of the data can be determined in time.
The algorithm is based on a hierarchical decomposition. First, given a rooted tree one can convert it to a rooted binary tree by replacing a vertex with children by a binary subtree with vertices. The parent of the root of this subtree is ’s parent, and the external leaves are links to ’s children. The number of vertices in is less than twice the number in . Data on it is extended to data on by assigning to all vertices in the binary tree representing . An optimal isotonic regression of this data on yields an optimal isotonic regression on by assigning to the value of the regression at the root of the subtree representing in . Because all of the above transformations can be done in linear time, from now on we assume the tree is binary.
The rendezvous graph is created incrementally, with each level being constructed from the one below. Each rendezvous node covers a subtree of of , where at most 2 of the leaves of aren’t also leaves of . No two nodes at one level cover the same vertex in , and thus each level partitions into subtrees and the nodes at that level form a binary tree. This is similar to the partitioning for a linear order, where the partitioning was into subintervals. The tree edges within a level are not used, but serve as a guide to construct the level above.
Given a tree at level of the rendezvous graph, to create the tree at level ,

First make a copy of .

For every vertex of this copy, if has a single child then if is at an odd depth it merges with its parent and otherwise merges with its child.

In the resulting tree, every leaf merges with its parent, finishing the construction of .
See Figure 5. This 2step process is repeated level by level until only a single node remains. Note that the rules insure that the tree remains binary at each level.
When nodes merge they are rendezvousing. With is stored the downward envelop of all of the vertices in (the envelope could be viewed as being associated with the root of ), and for each of the leaf nodes of that aren’t leaf nodes of it has the upward envelope of the vertices in on the path to the leaf. There are nodes with 2 children, and all other nodes must merge with some node, hence the next level has nodes (better bounds are possible), and thus the total size of the rendezvous graph is linear in the size of the original tree. To determine the number of pairwise feasibility tests needed at each level, downward envelopes are merged at most twice, while upward envelopes are only merged once, so and in Observation 1, so using 3 tests to reduce the number of segment endpoint errors suffices.
This proves Theorem 2.
5 Final Remarks
Isotonic regression is a fundamental nonparametric regression, making only a weak shapeconstrained assumption. It can be applied to linear and multidimensional orders without artificially requiring a metric on natural orderings such as S M L XL, and can be applied to more general ordering such as classification trees. This makes it of increasing interest in machine learning and data mining [8, 9, 14, 16, 21, 32, 34, 36, 49]. The algorithms herein improve upon previous results by and are optimal for grids and trees. This improvement also occurs for dimensional points in general position, an area where previous algorithms were criticized as being too slow, forcing researchers to use inferior substitutes such as approximations or additive models [7, 16, 30, 39, 40, 46].
The algorithms use a mix of a nonconstructive feasibility test, rendezvous graphs, and bounded error envelopes. The test and bounded envelopes are new to this paper. The nonconstructive pairwise test allows one to move up the rendezvous graph, rather than continually returning to the base graph for the constructive test used previously. Bounded error envelopes are important since standard error envelopes require time and space just to build them [10, 18, 45].
Rendezvous graphs for isotonic regression on multidimensional points in arbitrary position were introduced in [45] (a preliminary version was posted in 2008). Two variants were introduced: one had a strong property that, given a set of dimensional points , the rendezvous dag had , and for any , iff there is a rendezvous node such that , . If then there is no path in from to , and thus the transitive closure of the domination ordering is represented by paths of length 2. This is also known as a Steiner 2transitiveclosure spanner, and its size, , is optimal [6]. The second variant corresponds to the form used here, reducing the size to , though the transitive closure can involve paths of length (this can be reduced to by replacing lines with binary trees). This is the smallest known dag where its vertices contain the original points and there is a path in the dag from point to point iff . For the optimal size of such a dag is an open question. Optimality for appears in [45].
Finally, the iterative sorting approach in Algorithm B, which uses only space to provide the functionality of a multidimensional rendezvous dag with edges, gives simple solutions to other multidimensional problems such the domination and empirical cumulative distribution function (ECDF) problems in [4]. It gives the same time bounds but without needing modified algorithms for all lower dimensions. The approach can also produce the transitive closure in time and space, where is the number of edges in the transitive closure. Further, all previous algorithms for isotonic regression on dimensional points, for any , required that the dag be given explicitly, and hence took space. The only exception is that one can do unweighted isotonic regression in space since it only utilizes simple operations that can be accumulated pairwise, but this approach takes time.
References
 [1] Agarwal, PK, Phillips, JM, and Sadri, B (2010), “Lipschitz unimodal and isotonic regression on paths and trees”, LATIN 2010, LNCS 6034, pp. 384–396.
 [2] Angelov, S, Harb, B, Kannan, S, and Wang, LS (2006), “Weighted isotonic regression under the L1 norm”, Symp. Discrete Algor. (SODA), pp. 783–791.
 [3] Barlow, RE, Bartholomew, DJ, Bremner, JM, and Brunk, HD (1972), Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression, John Wiley.
 [4] Bentley, JL (1980, “Multidimensional divideandconquer”, Comm. ACM 23, pp. 214–229.
 [5] Beran, R and Dümbgen, L (2010), “Least squares and shrinkage estimation under bimonotonicity constraints”, Statistics and Computing 20, pp. 177–189.
 [6] Berman, P, Bhattacharyya, A, Grigorescu, E, Rashkhodnikova, S, Woodruff, DP, and Yaroslavtsev, G (2014), “Steiner transitiveclosure spanners of lowdimensional posets”, Combinatorica 34, pp. 255–277.
 [7] Burdakov, O, Grimvall, A and Hussian, M (2004), “A generalized PAV algorithm for monotonic regression in several variables”, Proceedings of COMPSTAT’2004
 [8] Caruana, R and NiculescuMizil, A (2006), “An empirical comparison of supervised learning algorithms”, Proc. Int’l. Conf. Machine Learning.
 [9] Chakrabarti, D, Kumar, R and Punera, K (2007), “Pagelevel template detection via isotonic smoothing”, Proc. 16th Int’l. World Wide Web Conf.
 [10] Chen, DZ and Wang, H (2013), “Approximating points by a piecewise linear function”, Algorithmica 66, pp. 682–713.
 [11] DíazBáñnez, JM and Mesa, JA (2001), “Fitting rectilinear polygonal curves to a set of points in the plane”, Eur. J. Operational Res. 130, pp. 214–222.
 [12] Dykstra, R, Hewett, J and Robertson, T (1999), “Nonparametric, isotonic discriminant procedures”, Biometrika 86, pp. 429–438.
 [13] Dykstra, RL and Robertson, T (1982), “An algorithm for isotonic regression of two or more independent variables”, Annals of Statistics 10, pp. 708–716.
 [14] Feelders, Ad (2010), “Monotone relabeling in ordinal classification”, Proc. IEEE Int?l Conf. Data Mining (ICDM), pp. 803–808
 [15] Fournier, H and Vigneron, A (2013), “A deterministic algorithm for fitting a step function to a weighted pointset”, Info. Proc. Let. 113, pp. 51–54.
 [16] Gamarnik, D (1998), “Efficient learning of monotone concepts via quadratic optimization”, Proceedings of Computational Learning Theory (COLT), pp. 134–143.
 [17] Gebhardt, F (1970), “An algorithm for monotone regression with one or more independent variables”, Biometrika 57, pp. 263–271.
 [18] Guha, S and Shim, K (2007), “A note on linear time algorithms for maximum error histograms”, IEEE Trans. Knowledge and Data Engin. 19, pp. 993–997.
 [19] Hardwick, J and Stout, QF (2012), “Optimal reduced isotonic reduction”, Proc. Interface 2012.
 [20] Hochbaum, DS and Queyranne, M (2003), “Minimizing a convex cost closure set”, SIAM J. Discrete Math 16, pp. 192–207.
 [21] Kalai, A.T. and Sastry, F. (2009), “The Isotron algorithm: Highdimensional isotonic regression”, COLT ’09.
 [22] van de Kamp, R, Feelders, A, and Barile, N, (2009), “Isotonic classification trees”, Advances in Intel. Data Analysis VIII, LNCS 5772, pp. 405–416.
 [23] Karras, P., Sacharidis, D., and Mamoulis, N. (2007), “Exploiting duality in summarization with deterministic guarantees”, Proc. Int’l. Conf. Knowledge Discovery and Data Mining (KDD), pp. 380–389.
 [24] Kaufman, Y and Tamir, A (1993), “Locating service centers with precedence constraints”, Discrete Applied Math. 47, pp. 251–261.
 [25] Kotlowski, W and Slowinski, R (2013), “On nonparametric ordinal classification with monotonicity constraints”, IEEE Trans. Knowledge and Data Engin. 25, pp. 2576–2589.
 [26] Kyng, R, Rao, A, and Sachdeva, S (2015), “Fast, provable algorithms for isotonic regression in all norms”, Proc. NIPS, pp. 2719–2727.
 [27] Lin, T.C., Kuo, C.C., Hsieh, Y.H., and Wang, B.F. (2009), “Efficient algorithms for the inverse sorting problem with bound constraints under the norm and the Hamming distance”, J. Comp. and Sys. Sci. 75, pp. 451–464.
 [28] Liu, JY (2010), “A randomized algorithm for weighted approximation of points by a step function”, COCOA 1, pp. 300–308.
 [29] Liu, MH and Ubhaya, VA (1997), “Integer isotone optimization”, SIAM J. Optimization 7, pp. 1152–1159.
 [30] Luss, R, Rosset, S and Shahar, M (2010), “Decomposing isotonic regression for efficiently solving large problems”, Proceedings of Neural Information Processing Systems (NIPS).

[31]
Meyer, M (2010),
“Inference for multiple isotonic regression”,
www.stat.colostate.edu/research/TechnicalReports/2010/2010_2.pdf  [32] Moon, T, Smola, A, Chang, Y and Zheng, Z (2010), “IntervalRank — isotonic regression with listwise and pairwise constraints”, Proc. Web Search and Data Mining, pp. 151–160.
 [33] Pardalos, PM and Xue, G (1999), “Algorithms for a class of isotonic regression problems”, Algorithmica 23, pp. 211–222.
 [34] Punera, K and Ghosh, J (2008), “Enhanced hierarchical classification via isotonic smoothing”, Proceedings International Conference on the World Wide Web 2008, pp. 151–160.
 [35] Qian, S and Eddy, WF (1996), “An algorithm for isotonic regression on ordered rectangular grids”, Journal of Computational and Graphical Statistics 5, pp. 225–235.
 [36] Rademacker, M, De Baets, R, and De Meyer, H (2009), “Loss optimal monotone relabeling of noisy multicriteria data sets”, Info. Sciences 179, pp. 4089–4096.
 [37] Robertson, T and Wright, FT (1973), “Multiple isotonic median regression”, Annals of Statistics 1, pp. 422–432.
 [38] Robertson, T, Wright, FT, and Dykstra, RL (1988), Order Restricted Statistical Inference, Wiley.
 [39] Saarela, O and Arjas, E (2011), “A method for Bayesian monotonic multiple regression”, Scandinavian Journal of Statistics 38, pp. 499–513.
 [40] Salanti, G and Ulm, K (2001), “Multidimensional isotonic regression and estimation of the threshold value”, Discussion paper 234, Institute für Statistik, LudwigMaximilians Universität, Munchen.
 [41] Salanti, G and Ulm, K (2003), “A nonparametric changepoint model for stratifying continuous variables under order restrictions and binary outcome”, Stat. Methods Med. Res. 12, pp. 351–367.
 [42] Spouge, J, Wan, H, and Wilber, WJ (2003), “Least squares isotonic regression in two dimensions”, Journal of Optimization Theory and Applications 117, pp. 585–605.

[43]
Stout, QF (2013),
“Isotonic regression via partitioning”,
Algorithmica 66, pp. 93–112.
www.eecs.umich.edu/ ~qstout/abs/L1IsoReg.html 
[44]
Stout, QF (2013),
“Weighted L isotonic regression”, submitted.
www.eecs.umich.edu/ ~qstout/abs/LinfinityIsoReg.html  [45] Stout, QF (2015), “Isotonic regression for multiple independent variables”, Algorithmica 71, pp. 450–470. www.eecs.umich.edu/ ~qstout/abs/MultidimIsoReg.html
 [46] Sysoev, O, Burdakov, O, Grimvall, A (2011), “A segmentationbased algorithm for largescale partially ordered monotonic regression”, Computational Statistics and Data Analysis 55, pp. 2463–2476.
 [47] Thompson,WA Jr (1962), “The problem of negative estimates of variance components”, Annals of Math. Stat. 33, pp. 273–289.
 [48] Ubhaya, VA (1974), “Isotone optimization, I, II”, J. Approx. Theory 12, pp. 146–159, 315–331.
 [49] Velikova, M and Daniels, H (2008), Monotone Prediction Models in Data Mining, VDM Verlag.