Abstract
The Binary Space Partitioning (BSP)Tree process is proposed to produce flexible 2D partition structures which are originally used as a Bayesian nonparametric prior for relational modelling. It can hardly be applied to other learning tasks such as regression trees because extending the BSPTree process to a higher dimensional space is nontrivial. This paper is the first attempt to extend the BSPTree process to a dimensional () space. We propose to generate a cutting hyperplane, which is assumed to be parallel to dimensions, to cut each node in the dimensional BSPtree. By designing a subtle strategy to sample two free dimensions from dimensions, the extended BSPTree process can inherit the essential selfconsistency property from the original version. Based on the extended BSPTree process, an ensemble model, which is named the BSPForest, is further developed for regression tasks. Thanks to the retained selfconsistency property, we can thus significantly reduce the geometric calculations in the inference stage. Compared to its counterpart, the Mondrian Forest, the BSPForest can achieve similar performance with fewer cuts due to its flexibility. The BSPForest also outperforms other (Bayesian) regression forests on a number of realworld data sets.
1 Introduction
Several machine learning methods, such as decision trees for regression and relational modelling for identifying interaction patterns, are concerned with space partitioning strategies to identify meaningful “blocks” in a product space. Models may be fitted to the data in each block, within which the data will exhibit certain types of homogeneity. These techniques have found application in relational modeling [14, 1], community detection [22, 13], collaborative filtering [24, 17], and random forests [16] – This spacepartitioning strategy has shown its promising prospect in realworld applications. As a result, a number of structured spacepartitioning priors have been developed, including the Mondrian process [28, 27, 26], the Rectangular Tiling process [21], the Ostomachion process [9], and the Binary Space Partitioning (BSP)Tree process [8]. Other strategies are also developed to complete the task, e.g., the Rectangular Bounding Process [7] is recently proposed to use a bounding strategy to partition the space and it claims to obtain a parsimonious result.
Among the aforementioned approaches, the BSPTree process [8] is an efficient way to partition the twodimensional space. Instead of axisaligned cuts adopted in most conventional approaches [14, 28, 21], the BSPTree process uses angled cuts to better describe the potential dependency between each dimension. The BSPtree process is attractive because it is selfconsistent, which ensures distributional invariance while restricting the process from a larger domain to a smaller one. However, as the BSPTree process is originally proposed for relational modelling, which only requires partitions on a twodimensional space, it can hardly be extended to a dimensional () space. This restriction prohibits its applications to decision treestyle models, which usually consist of more than two dimensions of features. Unfortunately, we cannot straightforwardly extend the BSPTree process to a higher dimensional space because it would violate the selfconsistency. It is possible to consider a proper extension that is able to retain the selfconsistency property, however, for a dimensional space, the hyperplane usually lies in the dimensional space and the measure of all potential cutting hyperplanes involves complicated integrals which would incur unacceptable calculations for a machine learning task.
In this work, we make the first endeavor to extend the domain of the BSPTree process to dimensional () space while still keeping its selfconsistency. To simplify the process, we propose to generate a cutting hyperplane, which is assumed to be parallel to dimensions, to cut each node in the dimensional BSPtree. That is to say, each cutting hyperplane is allowed to have two degrees (dimensions) of freedom during the generative process. The current node, which is a convex polyhedron, is first projected onto a pair of dimensions – The pair is sampled in proportion to the perimeter of its projection onto the sampled pair of dimensions among all possible configurations. The subsequent cut on this node is then generated on the projected convex polygon through the same way of the BSPTree process and the rest dimensions of the cutting hyperplane are parallel to the other dimensions. This geometrically simple construction permits a flexible multidimensional extension to the BSPtree process that provably retains the selfconsistency property.
Our second contribution is to construct an ensemble of BSPtrees – the BSPforest – to enable complex and flexible regression modelling. In contrast to Bayesian additive regression trees (BART) [5] and the Mondrian Forest [16], which only implement node cuts in a single dimension to generate the tree structure, the BSPforest uses two dimensions jointly to form a hyperplane cut in the feature space. As a result, the BSPforest can achieve higher performance with a lowerdepth hierarchical tree structure. In addition, because it consists of multiple BSTtrees who select different pairs of dimensions to describe the data in a local region, the BSPForest is able to capture allround pairwise dimensional dependence.
In the inference stage, an efficient Particle Gibbs sampler is developed to infer the BSPForest. Particularly, instead of cutting the entire node, we can further simplify the partitioning process by only conducting the hyperplane cut in the convex hull of the training data within the node to circumvent the complicated polyhedron related evaluation. Thanks to the selfconsistency of the proposed extended BSPTree process, these two strategies lead to the same equilibrium distribution; we can thus largely simplify the sampling procedure and improve the efficiency.
The effectiveness of the proposed BSPForest is validated through an extensive study on the famous Friedman’s function and then five realworld data sets. Compared to its counterpart, the Mondrian Forest, the BSPForest can achieve similar performance with fewer cuts due to its flexibility. The BSPForest also outperforms other (Bayesian) regression forests on a number of realworld data sets.
2 Preliminary: The BSPTree Process
The BSPtree process [8] generates partitions, , on an arbitrary twodimensional complex polygon . The BSPtree process is an almost surely rightcontinuous Markov jump process on , where is a prefixed budget. Let denote the BSPtree partition at time . is a hierarchical partition of , which can be represented as a triplet . is a finite binary space partitioning tree, and and respectively specify, for each intermediate node of the tree, the orthogonal slope () for the cutting line and the cut position () on the projected segment of the node.
The probability density function of is proportional to the length of the projected segment , and the cut position is uniformly distributed on . represents a hierarchical partition of into the root node (; the base of the domain), intermediate nodes () and terminal nodes. Intermediate nodes are generated from their parental node through a node split and are themselves split into two child nodes through the cutting line
Terminal nodes are the final generated blocks, which do not contain cutting lines.
Defining the time for the th cut as , the incremental cut time depends only on the partitions at time . Given an existing partition , the time to the next cut follows an Exponential distribution:
(1) 
where denotes the perimeter of the th block in partition . Each cut divides one block (chosen with probability proportional to its perimeter) into two new blocks and forms a new partition. If the time index of the new cut exceeds the budget , the BSPtree process terminates and returns the partition as the final realisation.
Selfconsistency
Selfconsistency of the BSPTree process refers to that, when restricting the BSPtree process on a convex polygon to any subregion , the resulting partitioning on the subregion is distributed as if it is directly generated on through the BSPtree’s generative process. This property can be usefully exploited in settings such as online learning and domain expansion. See [8] for more detailed description of the BSPtree process.
3 Extending the BSPtree process to dimensional space
One motivation for extension of the BSPtree process to multidimensional space is regression. Here, we might have labelled datapoints , where we aim to predict the unknown labels from the dimensional predictors . Other applications, such as classification, simply require straightforward changes to the likelihood constructions for the labels.
The proposed multidimensional BSPTree process works similarly as its original version. It is still a continuoustime Markov jump process where, for , the value taken at , which is denoted as , is a BSPTree partition in a convex polyhedron and also a further refinement of the value taken at time (see Figure 1). For the domain (root node) , the partition result is composed of a set of convex polyhedrons and is recursively generated through a series of cutting hyperplanes.
In order to extend the domain of the BSPTree process to dimensional () space while still keeping its selfconsistency, we consider a reduced generative process where each cutting hyperplane is only allowed to have two degrees (dimensions) of freedom. In this way, each potential cutting hyperplanes on is assumed to be parallel to the rest dimensions, except the selected two.
Given the current partition and , the next cutting hyperplane is generated in the following steps (see the illustrations for Steps 1–4 in Figure 2):
 (1)

Sample a candidate polyhedron (leaf node) from all the existing leaf nodes in proportion to , where denotes an arbitrary pair of dimensions from the dimensions, denotes the projection of onto the dimensions of , and denotes the perimeter of the projection (i.e., a 2dimensional polygon);
 (2)

Sample a pair of free dimensions from all possible pairs in proportion to ;
 (3)

On the projection , sample a direction from , where the probability density function is in proportion to the length of the line segment , onto which is projected in the direction of ; and sample the cutting position uniformly on the line segment . The proposed cutting hyperplane is formed as the straight line passing through and crossing through the projection , orthogonal to in the dimensions of and parallel to the rest dimensions
^{1} ;  (4)

Sample the incremental time for the new cut as . If , reject the proposed cutting hyperplane and return as the final partition structure; otherwise accept the proposed cutting hyperplane, increase to and go back to Step (1).
Sampling a twodimensional pair (Step (2)) is the novel key step that helps extend the BSPtree process to dimensional spaces; all other steps are the natural and logical extensions of the generative process of the twodimensional BSPtree process.
Through the above generative process, the cutting hyperplane can be parameterised as , where denotes the index of the selected polygon (leaf node), denotes the th element of vector , and is a twodimensional vector denoting the position on the dimensions of . The cutting hyperplane is parallel to all dimensions except and such that it is fully characterised on .
Moreover, the cutting hyperplane is only meaningful (i.e. it intersects with ) if and only if it intersects with the projection on dimensions . As is a convex polygon, from Proposition 2 of [8], the measure of the hyperplane cuts on dimensions is uniform over the perimeter of . Thus, the measure of the hyperplane cuts for is uniform over the sum of the perimeters of on all unique pairs of dimensions i.e. .
When , the above extended BSPtree process reduces to the original twodimensional version of [8]. According to Propositions 1 and 2 in [8], the likelihood of the cutting hyperplane on the two dimensions for is . Extending this hyperplane to dimensions, the likelihood is . That is, all potential partitions are equally favoured and the hyperplane cut is uniformly distributed, without involving additional (prior) knowledge about the cutting hyperplane.
Selfconsistency of the extended BSPtree process:
As a result of the particular technique of generating dimensional cutting hyperplanes, the BSPtree process retains the selfconsistency property in dimensions.
Proposition 1
The extended BSPtree process is selfconsistent in dimensional () space, and maintains distributional invariance when restricting its domain from a convex polyhedron to a subdomain.
That is, if the extended BSPTree process on a finite convex polyhedron is then restricted to a subregion , then the resulting partitions restricted to are distributed as if they were directly generated on through the BSPtree generative process. Subsequent application of the Kolmogorov Extension Theorem [15] states that the selfconsistency property of the BSPtree process then enables it to be defined on infinite multidimensional space.
This useful property can strongly simplify the sampling procedure and improve algorithmic efficiency in dimensional () implementations, which would otherwise be extremely complicated to implement (see inference sections, below).
4 Binary space partitioning (BSP)forests
We consider the labelpredicting regression task outlined earlier, so that for observed data we aim to predict the unknown labels from the dimensional predictors .
4.1 Model
We apply the extended BSPtree process in a random forest style model. For the regression we attach mean intensity parameters, , to the leaf nodes of each tree structure, and use the sum of the intensities from BSPtrees to predicting the unknown label. That is,
(2) 
where denotes the set of mean parameters associated with the leaf nodes of the th BSPtree, is the number of leaf nodes in the th BSPtree, if belongs to the th leaf node of the th BSPtree, and denotes the observation variance.
Determination of node assignments for follows the typical procedure used for a decision tree. To begin, belongs to the root node of the BSPtree . On any nonleaf node, the cutting hyperplane would then be used to determine the child node that belongs to: or . Through a sequence of such binary node assignments from the top to the bottom of the tree, would finally fall into a single leaf node of , with then allocating the intensity to the regression.
In common with other regression forests, each individual tree only contributes proportionally of the model for label , which prevents any one of the trees dominating the prediction. Further, because the ensemble comprises multiple BSTtrees, each potentially selecting different dimensional pairs to describe the data in local regions, the BSPforest is able to capture all pairwise dependence in dimensions. Similarly, the potentially different depths of each tree allows the BSPforest to reflect different granularities of interactions between dimensions.
Each regression tree in the BSPforest presents a hierarchical partition structure on dimensional space. From this perspective, the sumoftrees forest mechanism provides a compositional “smoothing” of the partition structure for local regions of the space, thereby greatly increasing the flexibility of the regression model. Given , a BSPforest is determined by , which includes the BSPtree structures and their associated leaf node mean parameters, and the observational variance parameter .
Prior specification for :
We specify the prior distributions for , the number of trees in the BSPforest , and the mean parameters for all trees following [5]. Namely: (1) where, given an estimate of the sample variance , is set to satisfy the condition , where is the InverseGamma c.d.f.; (2) As the labels are typically standardised, the prior for each mean parameter is specified as ; (3) While a prior for can be specified, and inference performed by a reversiblejump type algorithm [29], for simplicity we fix which we find performs well in practice.
Comparison with BART and MF:
The BSPforest is directly related to two popular Bayesian regressiontree models: Bayesian additive regression trees (BART) and the Mondrian forest (MF). The key differences and advantage here is that the BSPforest uses both angled cuts, and hyperplane cuts constructed from two dimensions, rather than the onedimensional, axisaligned cuts in both BART and MF. This clearly provides a more flexible way to model the observed data, and as a result, lower tree depth is required to obtain the same or even improved modelling capability.
4.2 Inference
Posterior inference for the BSPforest is implemented using Markov chain Monte Carlo (MCMC). The joint distribution of all parameters, including the partition structures of extended BSPtrees , the leafnode mean parameters of each tree , and the observational variance can be written as:
where and . Algorithm 1 outlines the MCMC algorithm, which iteratively samples the variance (Line 3) and each BSPtree structure with mean parameters ) (Line 5).
Sampling :
Through conjugacy, the full posterior conditional distribution of is
(3) 
where .
Sampling
We use the ConditionalSequential Monte Carlo (CSMC) sampler [2] to infer the partition structures of all the extended BSPtrees and the mean parameters . There is a key difference between our sampler and that used in [16, 8]: These previous works take node splitting events on the time line as the dimensions of the sampled variable. The particles at the same step are under different budget constraints and the target distribution of the CSMC samplers may not be well mixed. In our implementation, we adopt fixed intervals () on the time line as the dimensions of the variable sampled by the CSMC sampler. step , all the particles are under the same budget constraints and each dimension might involve one, more than one, or even no splitting events. This helps to fix the number of “dimensions” in CSMC and improve the efficiency of the sampler. Algorithm 2 shows the detail strategy to infer the tree structure in the th iteration for the th BSPtree.
Implementing cutting hyperplane on convex hull:
Cutting the full dimensional polyhedron generated from a series of cutting hyperplanes on the original domain, is both a challenging and unnecessary task. This is because: (1) complete indexing of this polyhedron requires extensive geometric calculations, including calculating the intersection of multiple hyperplanes, deciding if two hyperplanes intersect in the existence of other hyperplanes, and listing all the vertices of the polyhedron; (2) in some cases, the cutting hyperplane occurs outside the convex hull (i.e. the minimum convex polyhedron that contains all the points in this block), it does not partition the available data and thereby influence the regression. In practice, while data usually lies in a small portion of the feature space, implementing cutting hyperplanes directly on this polyhedron is particularly inefficient.
To address these issues, we implement the hyperplane cut on the convex hull only. The projection of the convex hull on any two dimensions can be obtained by simply slicing the coordinates on these two dimensions and then using a conventional convex hull detection algorithm (such as the Graham Scan algorithm [12]. Algorithm 3 describes our approach to generate the hyperplane cut in the convex hull and Fig. 3 visualises the convex hull projection in dimensional space.
Due to the selfconsistency property of the (extended) BSPtree process, we conveniently have that:
Proposition 2
The hyperplane restricted on the convex hull is distributed the same as if we first partition on the “full” polyhedron and then restrict attention to the convex hull. Both of these two methods lead to the idential equilibrium distribution in the MCMC algorithms.
Computational cost
The computational cost of the BSPForest is almost the same as the batched Mondrian Forest, except that the number of candidate features (used to generate a hyperplane) increases from to .
5 Related Work
Random forests have been proposed for decades and the volume of literature may be too huge to be reviewed here. The interested readers can refer to [6] for a recent review. For the structure of a tree in the forest, the node splitting in a binary tree usually consists of two steps: (1) choose the most suitable dimension to cut; (2) find the best cutting position on the chosen dimension. A widely adopted strategy is to optimize some information gain related criterion in a greedy way. In terms of the decision tree, a random forest uses the sumoftrees strategy to further improve the model capability. Some additional strategies to decorrelate the trees include bagging and subsampling the data set.
Here we use the BreimainRandom Forest [3] and Gradient Boosting regressor [10] as two representative works. The BreimainRF adopts the bagging strategy and chooses a subset of the dimensions for each tree. The Gradient Boosting regressor uses the boosting strategy, which is to propose regression trees to cater for the negative gradient of the loss function at the current step.
For the Bayesian decision trees, Bayesian Additive Regression Trees (BART) [5] and the Mondrian Forest [16] are two representative counterparts of the proposed BSPForest. BART assigns probability distributions to the structure variables of all the trees, while MF uses the Mondrian process as the prior of the tree structure for the forest. In contrast to these two methods which merely select one dimension to conduct node cut to generate the tree structure, the BSPForest adopts two dimensions together to form a hyperplane cut in the feature space.
Multivariate decision trees [4] (singletree model) may be the closest one to the proposed BSPtree in term of the interdimensional dependence. Its less popularity might be due to its high computational cost in finding an optimized cutting hyperplane. Compared to our BSPtree, the lack of prior information in the multivariate decision tree requires more regularization techniques.
6 Experiments
6.1 Toydata: Friedman’s Function
We first evaluate the performance of the BSPForest on the Friedman’s function [10, 5, 20]. In this test setting, each data point is generated from the uniform distribution, while its label takes the following form:
where denotes the white noise effect and denotes the th dimension of . The Friedman’s function consists of two nonlinear terms, two linear terms and an interaction term between the dimensions.
Posterior Mean Estimation
We begin with a simple application on posterior mean estimates of the BSPForest. The number of dimensions is set to , such that considers only the first dimensions of the data points and the rest five dimensions are irrelevant; the number of data points is set to . The results in Figure 4 show the estimated against the groundtruth on the training and test label sets. There is a vertical line on each point to indicate the posterior confidence interval. We can see that, on the training data set, the predicted values correlate very well with the true values , and the confidence intervals tend to cover the true values as well. On the test data set, one can observe a slight degradation of the correlation and wider intervals, which indicate a larger variance estimation of at new values.
Partial Dependence
Partial dependence is commonly used in Statistics to show the effect of adding features to a model. Figure 5 shows the plots of points and interval estimates of the partial dependence functions for from the MCMC samplers. The nonzero marginal effects of and the zero marginal effects of seem to be completely consistent with the form of .
Data Sets  Random Forest  Gradient Boosting  BART  Mondrian Forest  BSPForest 

Concrete  
diamonds  
hatco  
servo  
tecator 
Dimension Usage
The BSPForest can also be used to identify the dependent dimensions that are most correlated to the function . This can be completed by recording the pair of dimensions used for each regression tree and count the times of involvements of these dimensions. Figure 6 plots the proportions of for the number of trees . As gets smaller, the fitted sumoftrees models increasingly incorporate only those dimensions that are critical to explain the label . Without making use of any assumptions or information about the actual function, the BSPForest has exactly identified the underlying dimensions that determine .
Performance Evaluation w.r.t. Budget
For the Mondrian Forest and the BSPForest, the parameter determines the expected depth of the BSPtree, the budget , has more impact on their performance. Figures 7 and 8 show the Root Mean Absolute Error (RMAE) and the average number of cuts for the BSPForest and the Mondrian Forest, while budget is taking different values from to . We can see that the BSPForest can obtain better RMAE performance on all budget values. Also, the number of cuts in the BSPForest is always smaller than that in the Mondrian Forest. The efficiency in the number of cuts is consistent with our expectation on the BSPForest. Although the curve seems quite flat across values of considered, the reason may be that the average number of cuts does not change much: it only changes from 4 to 5 in Figure 8, as the budget varies from 0.4 to 1.4.
Name  name  

Concrete  diamonds  
hatco  servo  
tecator 
6.2 Realworld Datasets
The performance of the BSPForest is compared with other treebased models on a number of realworld data sets. The comparison methods we include Bremain’s Random Forest [3], Gradient Boosting regressor [11], Bayesian Additive Regression Trees (BART) [5], batch version of the Mondrian Forest (MF) [19]. The implementations of Bremain’s Random Forest and Gradient Boosting regressor are imported from the Python module of ScikitLearn [23]. BART is implemented through the R package of BART [25]. We implement the Mondrian Forest through the same strategy as the BSPForest here.
For a fair comparison, we set the number of trees in all the comparison methods as . Except for this, the other parameters of Bremain’s RF, Gradient Boosting regressor and BART are set as default in the modules. The budget of MF and the BSPForest is set to . In addition, we add scaling parameters to the distribution of the cost variable such that the expectation of the cost at stage is for both MF and BSPForest.
Five realworld datasets [18] are used for the performance evaluation (see Table 2). We normalize the data into . Each data set is randomly divided into five equal parts where out of are used for training while the rest for testing. We report the Root Mean Absolute Error (RMAE) as a robust measure over random runs on the datasets in Table 1. It is easily to see that our BSPForest performs the best among all the compared methods.
7 Conclusion
This paper makes the first endeavor to extend the BSPTree process to a dimensional () space. By designing a subtle strategy to only sample two free dimensions each time from the space, the extended BSPTree process can retain the essential selfconsistency property. We further take the extended BSPtrees to construct the BSPForest for regression. Compared to other (Bayesian) regression forests, the BSPForest can perform best due to its flexibility.
8 Acknowledgements
Xuhui Fan and Scott A. Sisson are supported by the Australian Research Council through the Australian Centre of Excellence in Mathematical and Statistical Frontiers (ACEMS, CE140100049), and Scott A. Sisson through the Discovery Project Scheme (DP160102544). Bin Li is supported by Fudan University Startup Research Grant and Shanghai Municipal Science & Technology Commission (16JC1420401).
Footnotes
 Generating a cutting line parameterized by and on the projection (twodimensional polygon) follows the same sampling method used in the BSPTree process [8].
References
 Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership stochastic blockmodels. In NIPS, pages 33–40, 2009.
 Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
 Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 Carla E Brodley and Paul E Utgoff. Multivariate decision trees. Machine learning, 19(1):45–77, 1995.
 Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010.
 Antonio Criminisi, Jamie Shotton, Ender Konukoglu, et al. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semisupervised learning. Foundations and Trends® in Computer Graphics and Vision, 7(2–3):81–227, 2012.
 Xuhui Fan, Bin Li, and Scott Sisson. Rectangular bounding process. In NeurIPS, pages 7631–7641, 2018.
 Xuhui Fan, Bin Li, and Scott A. Sisson. The binary space partitioningtree process. In AISTATS, volume 84 of Proceedings of Machine Learning Research, pages 1859–1867, 2018.
 Xuhui Fan, Bin Li, Yi Wang, Yang Wang, and Fang Chen. The Ostomachion Process. In AAAI Conference on Artificial Intelligence, pages 1547–1553, 2016.
 Jerome H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19:1–67, 1991.
 Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
 Ronald L. Graham. An efficient algorithm for determining the convex hull of a finite planar set. Inf. Process. Lett., 1(4):132–133, 1972.
 Brian Karrer and Mark E.J. Newman. Stochastic blockmodels and community structure in networks. Physical Review E, 83(1):016107, 2011.
 Charles Kemp, Joshua B. Tenenbaum, Thomas L. Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In AAAI, volume 3, pages 381–388, 2006.
 Kai lai Chung. A Course in Probability Theory. Academic Press, 2001.
 Balaji Lakshminarayanan, Daniel M. Roy, and Yee Whye Teh. Mondrian forests: Efficient online random forests. In NIPS, pages 3140–3148, 2014.
 Bin Li, Qiang Yang, and Xiangyang Xue. Transfer learning for collaborative filtering via a ratingmatrix generative model. In ICML, pages 617–624, 2009.
 M. Lichman. UCI machine learning repository, 2013.
 Dahua Lin, Eric Grimson, and John W. Fisher. Construction of dependent dirichlet processes based on poisson processes. In NIPS, pages 1396–1404. 2010.
 Antonio Ricardo Linero and Yun Yang. Bayesian regression tree ensembles that adapt to smoothness and sparsity. arXiv preprint arXiv:1707.09461, 2017.
 Masahiro Nakano, Katsuhiko Ishiguro, Akisato Kimura, Takeshi Yamada, and Naonori Ueda. Rectangular tiling process. In ICML, pages 361–369, 2014.
 Krzysztof Nowicki and Tom A.B. Snijders. Estimation and prediction for stochastic block structures. Journal of the American Statistical Association, 96(455):1077–1087, 2001.
 F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Ian Porteous, Evgeniy Bart, and Max Welling. MultiHDP: A non parametric Bayesian model for tensor factorization. In AAAI, pages 1487–1490, 2008.
 Robert McCulloch, Rodney Sparapani, Robert Gramacy, Charles Spanbauer, Matthew Pratola. BART. R Foundation for Statistical Computing, 2018.
 Daniel M. Roy. Computability, Inference and Modeling in Probabilistic Programming. PhD thesis, MIT, 2011.
 Daniel M. Roy, Charles Kemp, Vikash Mansinghka, and Joshua B. Tenenbaum. Learning annotated hierarchies from relational data. In NIPS, pages 1185–1192, 2007.
 Daniel M. Roy and Yee Whye Teh. The Mondrian process. In NIPS, pages 1377–1384, 2009.
 Scott A. Sisson. Transdimensional Markov chains: A decade of progress and future perspectives. Journal of the American Statistical Association, 100:1077–1089, 2005.