Neural Regression Trees
Abstract
RegressionviaClassification (RvC) is the process of converting a regression problem to a classification one. Current approaches for RvC use adhoc discretization strategies and are suboptimal. We propose a neural regression tree model for RvC. In this model, we employ a joint optimization framework where we learn optimal discretization thresholds while simultaneously optimizing the features for each node in the tree. We empirically show the validity of our model by testing it on two challenging regression tasks where we establish the state of the art.
1 Introduction
One of the most challenging problems in machine learning is that of predicting a numeric attribute of a datum from other features , a task commonly referred to as regression^{1}^{1}1Terminology that is borrowed from the statistics literature. The relationship between the features and the predicted variable (which, for want of a better term, we will call a “response”) can be defined in the form of a regression function . The function is generally unknown and may not be deterministic. The general approach to the problem is to assume a formula for the relationship, and to estimate the details of the formula from training data. Linear regression models assume a linear relationship between the features and the response. Other models such as neural networks assume a nonlinear relationship. The problem here is that the model parameters that are appropriate for one regime of the data may not be appropriate for other regimes. Statistical fits of the model to the data will minimize a measure of the overall prediction error, but may not be truly appropriate for any specific subset of the data. Nonparameteric regression models such as kernel regressions and regression trees attempt to deal with this by partitioning the feature space, and computing separate regressors within each partition. For highdimensional data, however, any computed partition runs the risk of overfitting to the training data, necessitating simplifying strategies such as axisaligned partitions breiman1984classification (); quinlan1986induction ().
An alternate strategy, and one that is explored in this paper, is to partition the space based on the response variable. Formally, we define a partition on a set as
satisfying and s are disjoint. When acting on a , . Now, given a response variable that takes values in some range , we find a set of threshold values to determine if . This process, which effectively converts the continuousvalued response variable into a categorical one , is often referred to as discretization. The new response variable is, in fact, not merely categorical, but ordinal, since its values can be strictly ordered. In order to determine the value for any we must now only find out which bin the feature belongs to; once the appropriate bin has been identified, the actual estimated can be computed in a variety of ways, e.g. the mean or median of the bin. The problem of regression is thus transformed into one of classification. This process of converting a regression problem to a classification is uncommonly known as RegressionviaClassification (RvC) torgo1997regression ()
The key aspect of an RvC system is its method of partition . Naive implementation of RvC can, however, result in very poor regression. Inappropriate choice of bin boundaries can result in bins that are too difficult to classify (since classification accuracy actually depends on the distribution of within the bins). On the other hand, although permitting nearperfect classification, the bins may be too wide, rendering the corresponding “regression” meaningless. Traditional methods for defining the partition by prior knowledge, such as equally probable intervals, equal width intervals, kmeans clustering, etc. dougherty1995supervised (); torgo1996regression (); torgo1997regression (). However, these approaches are adhoc and can result in unbalanced partitions that are too easy or too difficult to regress; partitions that are not fine enough can lead to high regression bias. Ideally, the RvC method must explicitly optimize the bin boundaries for both, classification accuracy and regression accuracy. In addition, the actual classifier employed cannot be ignored; since the decision boundaries are themselves variable, the classifier and the boundaries must conform to one another.
We propose a hierarchical treestructured model for RvC, which addresses all of the problems mentioned above. Jointly optimizing all the parties involved, namely the partition, classifier and regressor, is a combinatorially hard problem. Instead of solving this directly, inspired by the original idea of regression trees quinlan1986induction (), we follow a greedy strategy of hierarchical binary paritioning of the response variable, where each split is locally optimized. This results in a treestructured RvC model with a classifier at each node. We employ a simple marginbased linear classifier for each classification. However, the features derived from the data may not be optimized for each classifier. Hence, the structure of our model affords us an additional optimization: instead of using a single generic feature for classification, we can now optimize the features extracted from the data individually for each classifier in the tree. Since we employ a neural network to optimize the features for classification, we refer to this framework as a Neural Regression Tree (NRT).
Specifically, we adopt a topdown binary splitting approach to recursively grow the partition tree. To determine the splitting threshold for a node , we use a binary classifier to find the optimal by minimizing the classification error , where is the classification error of , and is the data on node . The number of thresholds , which determines the number of partitions and hence the depth of the tree, stops increasing when the classification performance saturates. Finally, in order to determine the value for any feature we find out which leaf node the partition belongs to following the partition tree, and estimate using a regression function on that bin.
To demonstrate the utility of the proposed approach we conduct experiments on a pair of notoriously challenging regression tasks in the speech community: estimating the age and height of speakers from their voice. We show that our model performs significantly better than other regression models, including those that are known to achieve the current stateoftheart in these problems.
2 Related Work
Regression Trees
Treestructured models have been around for a long time. Among them, the most closely related are the regression trees. Following our terminology, the regression tree can be described as , where is a regression function, and the partition is performed on independent variable instead of dependent variable. The first regression tree algorithm was presented by morgan1963problems (), where they propose a greedy approach to fit a piecewise constant function by recursively splitting the data into two subsets based on partition on . The optimal split is a result of minimizing the impurity which defines the homogeneity of the split. This algorithm set the basis for a whole line of research on classification and regression trees. Refined algorithms include CART breiman1984classification (), ID3 quinlan1986induction (), m5 quinlan1992learning (), and C4.5 salzberg1994c4 (). Recent work combines the treestructure and neural nets to gain the power of both structure learning and representation learning. Such work include the convolutional decision trees laptev2014convolutional (), neural decision trees xiao2017ndt (); balestriero2017neural (), adaptive neural trees tanno2018adaptive (), deep neural decision forests kontschieder2015deep (), and deep regression forests shen2017deep ().
We emphasize that there is a fundamental difference between our approach and the traditional regression tree based approaches: Instead of making the split based on the feature space, our splitting criteria is based on the dependent variables, enabling the features to adapt to the partitions of dependent variables.
Regression via Classification (RvC)
The idea for RvC was presented by weiss1995rule (). Their algorithm was based on kmeans clustering to categorize numerical variables. Other conventional approaches torgo1996regression (); torgo1997regression () for discretization of continuous values are based on equally probable (EP) or equal width (EW) intervals, where EP creates a set of intervals with same number of elements, while EW divides into intervals of same range. These approaches are adhoc. Instead, we propose a discretization strategy to learn the optimal thresholds by improving the neural classifiers.
3 Neural regression tree
In this section, we formulate the neural regression tree model for optimal discretization of RvC, and provide algorithms to optimize the model.
3.1 Model Formulation
Formally, following the description of RvC in Section 1, a classification rule classifies the data into disjoint bins , where corresponds to , and parameterizes the classifier. Then, a regression rule predicts the value of the dependent variable
(1) 
where is any regression function that operates locally on all instances that are assigned to the bin .
Alternatively, the classification rule may compute the probability of a data point being classified into bin , , and the regression rule is given by
(2) 
Defining an error between the true and the estimated , our objective is to learn the thresholds and the parameters such that the expected error is minimized
(3) 
Note that the number of thresholds too is a variable that may either be manually set or explicitly optimized. In practice, instead of minimizing the expected error, we will minimize the empirical average error computed over a training set.
However, joint optimization of and is a hard problem as it scales exponentially with . To solve this problem we recast the RvC problem in terms of a binary classification tree, where each of the nodes in the tree is greedily optimized. The structure of the proposed binary tree is shown in Figure 1.
We now describe the treegrowing algorithm. For notational convenience the nodes have been numbered such that for any two nodes and , if , occurs either to the left of or above it in the tree. Each node in the tree has an associated threshold , which is used to partition the data into its two children and (we will assume w.l.o.g. that ). A datum is assigned to the “left” child if , and to the “right” child otherwise. The actual partitions of the dependent variable are the leaves of the tree. To partition the data, each node carries a classifier , which assigns any instance with features to one of or . In our instantiation of this model, the classifier is a neural classifier that not only classifies the features but also adapts and refines the features to each node.
Given an entire tree along with all of its parameters and an input , we can compute the a posteriori probability of the partitions (i.e. the leaves) as follows. For any leaf , let represent the chain of nodes from root to leaf, where is the root and is the leaf itself. The a posteriori probability of the leaf is given by , where each is given by the neural classifier on node . Substitution into (2) yields the final regressed value of the dependent variable
(4) 
where , in our setting, is simply the mean value of the leaf bin. Other options include the center of gravity of the leaf bin, using a regression function, etc.
3.2 Learning the Tree
We learn the tree in a greedy manner, optimizing each node individually. The procedure to optimize an individual node is as follows. Let represent the set of training instances arriving at node . Let and be the children induced through threshold . In principle, to locally optimize , we must minimize the average regression error between the true response values and the estimated response computed using only the subtree with its root at . In practice, is not computable, since the subtree at is as yet unknown. Instead, we will approximate it through the classification accuracy of the classifier at , with safeguards to ensure that the resultant classification is not trivial and permits useful regression.
Let be a binary indicator function that indicates if an instance must be assigned to child or . Let be a quantifier of the classification error (which is the binary cross entropy loss in our case) for any instance . We define the classification loss at node as
(5) 
The classification loss cannot be directly minimized w.r.t , since this can lead to trivial solutions, e.g. setting to an extreme value such that all data are assigned to a single class. While such a setting would result in perfect classification, it would contribute little to the regression. To prevent such solutions, we include a triviality penalty that attempts to ensure that the tree remains balanced in terms of number of instances at each node. For our purpose, we define the triviality penalty at any node as the entropy of the distribution of instances over the partition induced by (other triviality penalties such as the Gini index breiman1984classification () may also apply though)
(6) 
where
The overall optimization of node is performed as
(7) 
where is used to assign the relative importance of the two components of the loss.
In the optimization of (7), the loss function depends on through , which is a discrete function of . Hence, we have two possible ways of optimizing (7). In the first, we can scan through all possible values of to select the one that results in the minimal loss. Alternatively, a faster gradientdescent approach is obtained by making the objective differentiable w.r.t. . Here the discrete function is replaced by a smooth, differentiable relaxation: , where controls the steepness of the function and must typically be set to a large value ( in our settings). The triviality penalty is also redefined (to be differentiable w.r.t. ) as the proximity to the median , since the median is the minimizer of (6). We use coordinate descent to optimize the resultant loss.
Once optimized, the data at are partitioned into and according to the threshold , and the process is recursed down the tree. Algorithm 1 describes the entire training algorithm. The growth of the tree may be continued until the regression performance on a heldout set saturates.
4 Experiments
We consider two regression tasks in the domain of speaker profiling—age estimation and height estimation from voice. The two tasks are generally considered two of the challenging tasks in the speech literature kim2007age (); metze2007comparison (); li2010combining (); dobry2011supervector (); lee2012performance (); bahari2014speaker (); barkana2015new (); poorjam2015height (); fu2008human ().
We compare our model with (1) a regression baseline using the support vector regression (SVR) basak2007support (), and (2) a regression tree baseline using classification and regression tree (CART) breiman1984classification (). Furthermore, in order to show the effectiveness of the “neural part” of our NRT model, we further compare our neural regression tree with a third baseline (3) regression tree with the support vector machine (SVMRT).
4.1 Data
To promote a fair comparison, we select two wellestablished public datasets in the speech community. For age estimation, we use the Fisher English corpus cieri2004fisher (). It consists of a 2channel conversational telephone speech for speakers, comprising of a total of recordings. After removing speakers with no age specified, we are left with speakers with male and female speakers. To the best of our knowledge, the Fisher corpus is the largest English language database that includes the speaker age information to be used for the age estimation task. The division of the data for the age estimation task is shown in Table 1. The division is made such that there is no overlap of speakers and all age groups are represented across the splits. Furthermore, Figure 2 shows the age distribution of the database for the three splits (train, development, and test) in relation to the Table 1.
# of Speakers / Utterances  

Male  Female  
Train  3,100 / 28,178  4,813 / 45,041 
Dev  1,000 / 9,860  1,000 / 9,587 
Test  1,000 / 9,813  1,000 / 9,799 
For height estimation, we use the NIST speaker recognition evaluation (SRE) 2008 corpus kajarekar2009sri (). We only have heights for male speakers and female speakers from it. Because of data scarcity issues, we evaluate this task using crossvalidation. Table 2 and Figure 3 show the statistics for the NISTSRE8 database.
# of Speakers / Utterances  
Male  Female 
384 / 33,493  651 / 59,530 
Since the recordings for both datasets have plenty of silences and the silences do not contribute to the information gain, Gaussian based voice activity detection (VAD) is performed on the recordings. Then, the resulting recordings are segmented to 1minute segments.
To properly represent the speech signals, we adopt one of the most effective and wellstudied representations, the ivectors dehak2011front (). Ivectors are statistical lowdimensional representations over the distributions of spectral features, and are commonly used in stateoftheart speaker recognition systems sadjadi2016speaker () and age estimation systems shivakuma14simplifiedivector (); grzybowska2016speaker (). Respectively, dimensional and dimensional ivectors are extracted for Fisher and SRE datasets using the stateoftheart speaker identification system dhamyal2018optim ()
4.2 Model
The proposed neural regression tree is a binary tree with neural classification models as discussed in Section 3.1. The specifications for our model and the baseline models are shown in Table 3. The NRT is composed of a collection of layer ReLU neural networks. The kernels, regularizations and parameters for SVRs and SVMs are obtained from experimenting on development set.
Model  Specification  

Age  Height  
NRT 



SVMRT 



SVR 



CART 


4.3 Results
As a measure of the performance of our models and the baseline models on age and height estimation, we use the mean absolute error (MAE) and the root mean squared error (RMSE). The results are summarized in Table 4.
Task  Dataset  Methods  Male  Female  

MAE  RMSE  MAE  RMSE  
Age  Fisher  SVR  9.22  12.03  8.75  11.35 
CART  11.73  15.22  10.75  13.97  
SVMRT  8.83  11.47  8.61  11.17  
NRT  7.20  9.02  6.81  8.53  
Height  SRE  SVR  6.27  6.98  5.24  5.77 
CART  8.01  9.34  7.08  8.46  
SVMRT  5.70  7.07  4.85  6.22  
NRT  5.43  6.40  4.27  6.07 
For both age and height estimation, we see that the proposed neural regression tree generally outperforms other baselines in both MAE and RMSE, except that for height task, the neural regression tree has slightly higher RMSE than SVR, indicating higher variance. This is reasonable as our NRT does not directly optimize on the mean square error. Bagging or forest mechanisms may be used to reduce the variance. Furthermore, with the neural classifier in NRT being replaced by a SVM classifier (SVMRT), we obtain higher error than NRT, implying the effectiveness of the neural part of the NRT as it enables the features to refine with each partition and adapt to each node. Nevertheless, SVMRT still yields smaller MAE values than SVR and CART, hence strengthening our hypothesis that the our model can find optimal discretization thresholds for optimal dependent variable discretization even without the use of a neural network.
To test the significance of the results, we further conduct pairedwise statistical significance tests. We hypothesize that the errors achieved from our NRT method are significantly smaller than the closest competitor SVR. Paired ttest for SVR v.s. SVMRT and SVMRT v.s. NRT yield pvalues less than , indicating strong significance of the improvement. Similar results are obtained for height experiments as well. Hence we validate the significant performance improvement of our NRT method on estimating ages and heights over the baseline methods.
4.4 Nodebased Error Analysis
The hierarchical nature of our formulation allows us to analyze our model on every level and every node of the tree in terms of its classification and regression error. In our evaluation for the speaker age estimation task, we noticed that the regression error for the younger speakers was lower than the error in the case of older speakers. In other words, our model was able to discriminate better between younger speakers. This is in agreement with the fact that the vocal characteristics of humans undergo noticeable changes during earlier ages, and then relatively stabilize for a certain age interval stathopoulos2011changes (). Figure 4 shows the perlevel MAE for female and male speakers. The nodes represent the age thresholds used as splitting criteria at each level. The edges represent the regression error. The inherent structural properties of our model not only improve the overall regression performance as we saw in the previous section, but in the case of age estimation, also model the real world phenomenon. This is visible in 4 where the regression error can be seen to increase from left to right for both female and male speakers (except the left most nodes where the behavior does not follow possibly due to data scarcity issues).
4.5 Limitations
We acknowledge that our model might not be ubiquitous in its utility across all regression tasks. Our hypothesis is that it works well with tasks that can benefit from a partition based formulation. We empirically show that for two such tasks above. However, in future we would like to test our model for other standard regression tasks. Furthermore, because our model formulation inherits its properties from the regressionviaclassification (RvC) framework, the objective function is optimized to reduce the classification error rather than the regression error. This limits us in our ability to directly compare our model to other regression methods. In future, we intend to explore ways to directly minimize the regression error while employing the RvC framework.
5 Conclusions
In this paper, we proposed a neural regression tree for optimal discretization of dependent variables in regressionbyclassification tasks. It targeted the two difficulties in regressionbyclassification: finding optimal discretization thresholds and selecting optimal set of features. We developed a discretization strategy by recursive binary partition based on the optimality of neural classifiers. Furthermore, for each partition node on the tree, it was able to locally optimize features to be more discriminative. In addition, we proposed a scan method and a gradient method to optimize the tree. The proposed neural regression tree outperformed baselines in age and height estimation experiments, and demonstrated significant improvements.
References
 (1) L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. Wadsworth and Brooks, Monterey, CA, 1984.
 (2) J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
 (3) L. Torgo and J. Gama. Regression using classification algorithms. Intelligent Data Analysis, 1(4):275–292, 1997.
 (4) J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings, pages 194–202. Elsevier, 1995.
 (5) L. Torgo and J. Gama. Regression by classification. In Brazilian symposium on artificial intelligence, pages 51–60. Springer, 1996.
 (6) J. N. Morgan and J. A. Sonquist. Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association, 58(302):415–434, 1963.
 (7) J. R. Quinlan et al. Learning with continuous classes. In 5th Australian joint conference on artificial intelligence, volume 92, pages 343–348. World Scientific, 1992.
 (8) J. R. Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.
 (9) D. Laptev and J. M. Buhmann. Convolutional decision trees for feature learning and segmentation. In German Conference on Pattern Recognition, pages 95–106. Springer, 2014.
 (10) H. Xiao. Ndt: Neual decision tree towards fully functioned neural graph. arXiv preprint arXiv:1712.05934, 2017.
 (11) R. Balestriero. Neural decision trees. arXiv preprint arXiv:1702.07360, 2017.
 (12) R. Tanno, K. Arulkumaran, D. C. Alexander, A. Criminisi, and A. Nori. Adaptive neural trees. arXiv preprint arXiv:1807.06699, 2018.
 (13) P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulo. Deep neural decision forests. In 2015 IEEE International Conference on Computer Vision, pages 1467–1475. IEEE, 2015.
 (14) W. Shen, Y. Guo, Y. Wang, K. Zhao, B. Wang, and A. Yuille. Deep regression forests for age estimation. arXiv preprint arXiv:1712.07195, 2017.
 (15) S. M. Weiss and N. Indurkhya. Rulebased machine learning methods for functional prediction. Journal of Artificial Intelligence Research, 3:383–403, 1995.
 (16) H. Kim, K. Bae, and H. Yoon. Age and gender classification for a homerobot service. In The 16th IEEE International Symposium on Robot and Human interactive Communication, pages 122–126. IEEE, 2007.
 (17) F. Metze, J. Ajmera, R. Englert, U. Bub, F. Burkhardt, J. Stegmann, C. Muller, R. Huber, B. Andrassy, J. G. Bauer, et al. Comparison of four approaches to age and gender recognition for telephone applications. In IEEE International Conference on Acoustics, Speech and Signal Processing, volume 4, pages IV–1089. IEEE, 2007.
 (18) M. Li, C. Jung, and K. J. Han. Combining five acoustic level modeling methods for automatic speaker age and gender recognition. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
 (19) G. Dobry, R. M. Hecht, M. Avigal, and Y. Zigel. Supervector dimension reduction for efficient speaker age estimation based on the acoustic speech signal. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):1975–1985, 2011.
 (20) M. Lee and K. Kwak. Performance comparison of gender and age group recognition for humanrobot interaction. International Journal of Advanced Computer Science and Applications, 3(12), 2012.
 (21) M. H. Bahari, M. McLaren, H. Van Hamme, and D. A. Van Leeuwen. Speaker age estimation using ivectors. Engineering Applications of Artificial Intelligence, 34:99–108, 2014.
 (22) B. D. Barkana and J. Zhou. A new pitchrange based feature set for a speaker’s age and gender classification. Applied Acoustics, 98:52–61, 2015.
 (23) A. H. Poorjam, M. H. Bahari, V. Vasilakakis, et al. Height estimation from speech signals using ivectors and leastsquares support vector regression. In 2015 38th International Conference on Telecommunications and Signal Processing, pages 1–5. IEEE, 2015.
 (24) Y. Fu and T. S. Huang. Human age estimation with regression on discriminative aging manifold. IEEE Transactions on Multimedia, 10(4):578–584, 2008.
 (25) D. Basak, S. Pal, and D. C. Patranabis. Support vector regression. Neural Information ProcessingLetters and Reviews, 11(10):203–224, 2007.
 (26) C. Cieri, D. Miller, and K. Walker. The fisher corpus: a resource for the next generations of speechtotext. In LREC, volume 4, pages 69–71, 2004.
 (27) S. S. Kajarekar, N. Scheffer, M. Graciarena, E. Shriberg, A. Stolcke, L. Ferrer, and T. Bocklet. The sri nist 2008 speaker recognition evaluation system. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4205–4208. IEEE, 2009.
 (28) N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Frontend factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788–798, 2011.
 (29) S. O. Sadjadi, S. Ganapathy, and J. W. Pelecanos. Speaker age estimation on conversational telephone speech using senone posterior based ivectors. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5040–5044. IEEE, 2016.
 (30) P. G. Shivakumar, M. Li, V. Dhandhania, and S. S. Narayanan. Simplified and supervised ivector modeling for speaker age regression. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4833–4837, 2014.
 (31) J. Grzybowska and S. Kacprzak. Speaker age classification and regression using ivectors. In INTERSPEECH, pages 1402–1406, 2016.
 (32) H. Dhamyal, T. Zhou, B. Raj, and R. Singh. Optimizing neural network embeddings using pairwise loss for textindependent speaker matching. In INTERSPEECH, 2018.
 (33) E. T. Stathopoulos, J. E. Huber, and J. E. Sussman. Changes in acoustic characteristics of the voice across the life span: measures from individuals 4–93 years of age. Journal of Speech, Language, and Hearing Research, 54(4):1011–1021, 2011.