Neural Oblivious Decision Ensembles
for Deep Learning on Tabular Data
Abstract
Nowadays, deep neural networks (DNNs) have become the main instrument for machine learning tasks within a wide range of domains, including vision, NLP, and speech. Meanwhile, in an important case of heterogenous tabular data, the advantage of DNNs over shallow counterparts remains questionable. In particular, there is no sufficient evidence that deep learning machinery allows constructing methods that outperform gradient boosting decision trees (GBDT), which are often the top choice for tabular problems. In this paper, we introduce Neural Oblivious Decision Ensembles (NODE), a new deep learning architecture, designed to work with any tabular data. In a nutshell, the proposed NODE architecture generalizes ensembles of oblivious decision trees, but benefits from both endtoend gradientbased optimization and the power of multilayer hierarchical representation learning. With an extensive experimental comparison to the leading GBDT packages on a large number of tabular datasets, we demonstrate the advantage of the proposed NODE architecture, which outperforms the competitors on most of the tasks. We opensource the PyTorch implementation of NODE and believe that it will become a universal framework for machine learning on tabular data.
1 Introduction
The recent rise of deep neural networks (DNN) resulted in a substantial breakthrough for a large number of machine learning tasks in computer vision, natural language processing, speech recognition, reinforcement learning (Goodfellow et al., 2016). Both gradientbased optimization via backpropagation (Rumelhart et al., 1985) and hierarchical representation learning appear to be crucial in increasing the performance of machine learning for these problems by a large margin.
While the superiority of deep architectures in these domains is undoubtful, machine learning for tabular data still did not fully benefit from the DNN power. Namely, the stateoftheart performance in problems with tabular heterogeneous data is often achieved by ”shallow” models, such as gradient boosted decision trees (GBDT) (Friedman, 2001; Chen and Guestrin, 2016; Ke et al., 2017; Prokhorenkova et al., 2018). While the importance of deep learning on tabular data is recognized by the ML community, and many works address this problem (Zhou and Feng, 2017; Yang et al., 2018; Miller et al., 2017; Lay et al., 2018; Feng et al., 2018; Ke et al., 2018), the proposed DNN approaches do not consistently outperform the stateoftheart shallow models by a notable margin. In particular, to the best of our knowledge, there is still no universal DNN approach that was shown to systematically outperform the leading GBDT packages (e.g., XGBoost (Chen and Guestrin, 2016)). As additional evidence, a large number of Kaggle ML competitions with tabular data are still won by the shallow GBDT methods (Harasymiv, 2015). Overall, at the moment, there is no dominant deep learning solution for tabular data problems, and we aim to reduce this gap by our paper.
We introduce Neural Oblivious Decision Ensembles (NODE), a new DNN architecture, designed to work with tabular problems. The NODE architecture is partially inspired by the recent CatBoost package (Prokhorenkova et al., 2018), which was shown to provide stateoftheart performance on a large number of tabular datasets. In a nutshell, CatBoost performs gradient boosting on oblivious decision trees (decision tables) (Kohavi, 1994; Lou and Obukhov, 2017), which makes inference very efficient, and the method is quite resistant to overfitting. In its essence, the proposed NODE architecture generalizes CatBoost, making the splitting feature choice and decision tree routing differentiable. As a result, the NODE architecture is fully differentiable and could be incorporated in any computational graph of existing DL packages, such as TensorFlow or PyTorch. Furthermore, NODE allows constructing multilayer architectures, which resembles ”deep” GBDT that is trained endtoend, which was never proposed before. Besides the usage of oblivious decision tables, another important design choice is the recent entmax transformation (Peters et al., 2019), which effectively performs a ”soft” splitting feature choice in decision trees inside the NODE architecture. As discussed in the following sections, these design choices are critical to obtain stateoftheart performance. In a large number of experiments, we compare the proposed approach with the leading GBDT implementations with tuned hyperparameters and demonstrate that NODE outperforms competitors consistently on most of the datasets.
Overall, the main contributions of our paper can be summarized as follows:

We introduce a new DNN architecture for machine learning on tabular data. To the best of our knowledge, our method is the first successful example of deep architectures that substantially outperforms leading GBDT packages on tabular data.

Via an extensive experimental evaluation on a large number of datasets, we show that the proposed NODE architecture outperforms existing GBDT implementations.

The PyTorch implementation of NODE is available online^{1}^{1}1https://github.com/Qwicen/node.
2 Related work
In this section, we briefly review the main ideas from prior work that are relevant to our method.
The stateoftheart for tabular data. Ensembles of decision trees, such as GBDT (Friedman, 2001) or random forests (Barandiaran, 1998), are currently the top choice for tabular data problems. Currently, there are several leading GBDT packages, such as XGBoost (Chen and Guestrin, 2016), LightGBM (Ke et al., 2017), CatBoost (Prokhorenkova et al., 2018), which are widely used by both academicians and ML practitioners. While these implementations vary in details, on most of the tasks their performances do not differ much (Prokhorenkova et al., 2018; Anghel et al., ). The most important distinction of CatBoost is that it uses oblivious decision trees (ODTs) as weak learners. As ODTs are also an important ingredient of our NODE architecture, we discuss them below.
Oblivious Decision Trees. An oblivious decision tree is a regular tree of depth that is constrained to use the same splitting feature and splitting threshold in all internal nodes of the same depth. This constraint essentially allows representing an ODT as a table with entries, corresponding to all possible combinations of splits (Lou and Obukhov, 2017). Of course, due to the constraints above, ODTs are significantly weaker learners compared to unconstrained decision trees. However, when used in an ensemble, such trees are less prone to overfitting, which was shown to synergize well with gradient boosting (Prokhorenkova et al., 2018). Furthermore, the inference in ODTs is very efficient: one can compute independent binary splits in parallel and return the appropriate table entry. In contrast, nonoblivious decision trees require evaluating splits sequentially.
Differentiable trees. The significant drawback of treebased approaches is that they usually do not allow endtoend optimization and employ greedy, local optimization procedures for tree construction. Thus, they cannot be used as a component for pipelines, trained in an endtoend fashion. To address this issue, several works (Kontschieder et al., 2015; Yang et al., 2018; Lay et al., 2018) propose to ”soften” decision functions in the internal tree nodes to make the overall tree function and tree routing differentiable. In our work, we advocate the usage of the recent entmax transformation (Peters et al., 2019) to ”soften” decision trees. We confirm its advantages over the previously proposed approaches in the experimental section.
Entmax. The key building block of our model is the entmax transformation (Peters et al., 2019), which maps a vector of realvalued scores to a discrete probability distribution. This transformation generalizes the traditional softmax and its sparsityenforcing alternative sparsemax (Martins and Astudillo, 2016), which has already received significant attention in a wide range of applications: probabilistic inference, topic modeling, neural attention (Niculae and Blondel, 2017; Niculae et al., 2018; Lin et al., 2019). The entmax is capable to produce sparse probability distributions, where the majority of probabilities are exactly equal to . In this work, we argue that entmax is also an appropriate inductive bias in our model, which allows differentiable split decision construction in the internal tree nodes. Intuitively, entmax can learn splitting decisions based on a small subset of data features (up to one, as in classical decision trees), avoiding undesired influence from others. As an additional advantage, using entmax for feature selection allows for computationally efficient inference using the sparse precomputed choice vectors as described below in Section 3.
Multilayer nondifferentiable architectures. Another line of work (Miller et al., 2017; Zhou and Feng, 2017; Feng et al., 2018) promotes the construction of multilayer architectures from nondifferentiable blocks, such as random forests or GBDT ensembles. For instance, (Zhou and Feng, 2017; Miller et al., 2017) propose to use stacking of several random forests, which are trained separately. In recent work, (Feng et al., 2018) introduces the multilayer GBDTs and proposes a training procedure that does not require each layer component to be differentiable. While these works report marginal improvements over shallow counterparts, they lack the capability for endtoend training, which could result in inferior performance. In contrast, we argue that endtoend training is crucial and confirm this claim in the experimental section.
Specific DNN for tabular data. While a number of prior works propose architectures designed for tabular data (Ke et al., 2018; Shavitt and Segal, 2018), they mostly do not compare with the properly tuned GBDT implementations, which are the most appropriate baselines. The recent preprint (Ke et al., 2018) reports the marginal improvement over GBDT with default parameters, but in our experiments, the baseline performance is much higher. To the best of our knowledge, our approach is the first to consistently outperform the tuned GBDTs over a large number of datasets.
3 Neural Oblivious Decision Ensembles
We introduce the Neural Oblivious Decision Ensemble (NODE) architecture with a layerwise structure similar to existing deep learning models. In a nutshell, our architecture consists of differentiable oblivious decision trees (ODT) that are trained endtoend by backpropagation. We describe our implementation of the differentiable NODE layer in Section 3.1, the full model architecture in Section 3.2, and the training and inference procedures in section 3.3.
3.1 Differentiable Oblivious Decision Trees
The core building block of our model is a Neural Oblivious Decision Ensemble (NODE) layer. The layer is composed of differentiable oblivious decision trees (ODTs) of equal depth . As an input, all trees get a common vector , containing numeric features. Below we describe a design of a single differentiable ODT.
In its essence, an ODT is a “decision table” that splits the data along splitting features and compares each feature to a learned threshold. Then, the tree returns one of the possible responses, corresponding to the comparisons result. Therefore, each ODT is completely determined by its splitting features , splitting thresholds and a dimensional tensor of responses . In this notation, the tree output is defined as:
(1) 
where denotes the Heaviside function.
To make the tree output (1) differentiable, we replace the splitting feature choice and the comparison operator by their continuous counterparts. There are several existing approaches that can be used for modelling differentiable choice functions in decision trees (Yang et al., 2018), for instance, REINFORCE (Williams, 1992) or Gumbelsoftmax (Jang et al., 2016). However, these approaches typically require long training time, which can be crucial in practice.
Instead, we propose to use the entmax transformation (Peters et al., 2019) as it is able to learn sparse choices, depending only on a few features, via standard gradient descent. The choice function is hence replaced by a weighted sum of features, with weights computed as entmax over the learnable feature selection matrix :
(2) 
Similarly, we relax the Heaviside function as a twoclass entmax, which we denote as . As different features can have different characteristic scales, we use the scaled version , where and are learnable parameters for thresholds and scales respectively.
Based on the values, we define a ”choice” tensor of the same size as the response tensor by computing the outer product of all :
(3) 
The final prediction is then computed as a weighted linear combination of response tensor entries with weights from the entries of choice tensor :
(4) 
Note, that this relaxation equals to the classic nondifferentiable ODT (1) iff both feature selection and threshold functions reach “onehot” state, i.e. entmax always returns nonzero weights for a single feature and always return exactly zeros or ones.
Finally, the output of the NODE layer is composed as a concatenation of the outputs of individual trees .
Multidimensional tree outputs. In the description above, we assumed that tree outputs are onedimensional . For classification problems, where NODE predicts probabilities of each class, we use multidimensional tree outputs , where is a number of classes.
3.2 Going deeper with the NODE architecture
The NODE layer, described above, can be trained alone or within a complex structure, like fullyconnected layers that can be organized into the multilayer architectures. In this work, we introduce a new architecture, following the popular DenseNet (Huang et al., 2017) model and train it endtoend via backpropagation.
Similar to DenseNet, our architecture is a sequence of NODE layers (see Section 3.1), where each layer uses a concatenation of all previous layers as its input. The input “layer 0” of this architecture corresponds to the input features , accessible by all successor layers. Due to such a design, our architecture is capable to learn both shallow and deep decision rules. A single tree on th layer can rely on chains of up to layer outputs as features, allowing it to capture complex dependencies. The resulting prediction is a simple average of all decision trees from all layers.
Note, in the multilayer architecture described above, tree outputs from early layers are used as inputs for subsequent layers. Therefore, we do not restrict the dimensionality of to be equal to the number of classes, and allow it to have an arbitrary dimensionality , which correspond to the dimensional response tensor . When averaging the predictions from all layers, only first coordinates of are used for classification problems and the first one for regression problems. Overall, is an additional hyperparameter with typical values in .
3.3 Training
Here we summarize the details of our training protocol.
Data preprocessing. First, we transform each data feature to follow a normal distribution via quantile transform^{2}^{2}2sklearn.preprocessing.QuantileTransformer. In experiments, we observed that this step was important for stable training and faster convergence.
Initialization. Before training, we perform the dataaware initialization (Mishkin and Matas, 2016) to obtain a good initial parameter values. In particular, we initialize the feature selection matrix uniformly , while the thresholds are initialized with random feature values observed in the first data batch. The scales are initialized in such a way that all the samples in the first batch belong to the linear region of , and hence receive nonzero gradients. Finally, the response tensor entries are initialized with the standard normal distribution .
Training. As for existing DNN architectures, NODE is trained endtoend via minibatch SGD. We jointly optimize all model parameters: . In this work, we experimented with traditional objective functions (crossentropy for classification and mean squared error for regression), but any differentiable objective can be used as well. As an optimization method, we use the recent QuasiHyperbolic Adam with parameters recommended in the original paper (Ma and Yarats, 2018). We also average the model parameters over consecutive checkpoints (Izmailov et al., 2018) and pick the optimal stopping point on the holdout validation dataset.
Inference. During training, a significant fraction of time is spent computing the entmax function and multiplying the choice tensor. Once the model is trained, one can precompute entmax feature selectors and store them as a sparse vector (e.g., in coordinate (coo) format), making inference more efficient.
4 Experiments
In this section, we report the results of a comparison between our approach and the leading GBDT packages. We also provide several ablation studies that demonstrate the influence of each design choice in the proposed NODE architecture.
4.1 Comparison to the stateoftheart.
As our main experiments, we compare the proposed NODE architecture with two stateoftheart GBDT implementations on a large number of datasets. In all the experiments we set parameter in the entmax transformation to . All other details of the comparison protocol are described below.
Datasets. We perform most of the experiments on six opensource tabular datasets from different domains: Epsilon, YearPrediction, Higgs, Microsoft, Yahoo, Click. The detailed description of the datasets is available in appendix. All the datasets provide train/test splits, and we used 20% samples from the train set as a validation set to tune the hyperparameters. For each dataset, we fix the train/val/test splits for a fair comparison. For the classification datasets (Epsilon, Higgs, Click), we minimize crossentropy loss and report the classification error. For the regression and ranking datasets (YearPrediction, Microsoft, Yahoo), we minimize and report mean squared error (which corresponds to the pointwise approach to learningtorank).
Methods. We compare the proposed NODE architecture to the following baselines:

[leftmargin=*]

Catboost. The recent GBDT implementation (Prokhorenkova et al., 2018) that uses oblivious decision trees as weak learners. We use the opensource implementation, provided by the authors.

XGBoost. The most popular GBDT implementation widely used in machine learning competitions (Chen and Guestrin, 2016). We use the opensource implementation, provided by the authors.

FCNN. Deep neural network, consisting of several fullyconnected layers with ReLU nonlinearity layers (Nair and Hinton, 2010).
Regimes. We perform comparison in two following regimes that are the most important in practice:

[leftmargin=*]

Default hyperparameters. In this regime, we compare the methods as easytotune toolkits that could be used by a nonprofessional audience. Namely, here we do not tune hyperparameters and use the default ones provided by the GBDT packages. The only tunable parameter here is a number of trees (up to 2048) in CatBoost/XGBoost, which is set based on the validation set. We do not compare with FCNN in this regime, as it typically requires much tuning, and we did not find the set of parameters, appropriate for all datasets. The default architecture in our model contains only a single layer with 2048 decision trees of depth six. Both of these hyperparameters were inherited from the CatBoost package settings for oblivious decision trees. With these parameters, the NODE architecture is shallow, but it still benefits from endtoend training via backpropagation.

Tuned hyperparameters. In this regime, we tune the hyperparameters for both NODE and the competitors on the validation subsets. The optimal configuration for NODE contains between two and eight NODE layers, while the total number of trees across all the layers does not exceed 2048. The details of hyperparameter optimization are provided in appendix.
The results of the comparison are summarized in Table 1 and Table 2. For all methods, we report mean performance and standard deviations computed over ten runs with different random seeds. Several key observations are highlighted below:
Epsilon  YearPrediction  Higgs  Microsoft  Yahoo  Click  

Default hyperparameters  
CatBoost  
XGBoost  
NODE 
Epsilon  YearPrediction  Higgs  Microsoft  Yahoo  Click  
Tuned hyperparameters  
CatBoost  
XGBoost  
FCNN  
NODE  
mGBDT  OOM  OOM  OOM  OOM  OOM  
DeepForest  —  —  — 

[leftmargin=*]

With default hyperparameters, the proposed NODE architecture consistently outperforms both CatBoost and XGBoost on all datasets. The results advocate the usage of NODE as a handy tool for machine learning on tabular problems.

With tuned hyperparameters, NODE also outperforms the competitors on most of the tasks. Two exceptions are the Yahoo and Microsoft datasets, where tuned XGBoost provides the highest performance. Given the large advantage of XGBoost over CatBoost on Yahoo, we speculate that the usage of oblivious decision trees is an inappropriate inductive bias for this dataset. This implies that NODE should be extended to nonoblivious trees, which we leave for future work.

In the regime with tuned hyperparameters on some datasets FCNN outperforms GBDT, while on others GBDT is superior. Meanwhile, the proposed NODE architecture appears to be a universal instrument, providing the highest performance on most of the tasks.
For completeness we also aimed to compare to previously proposed architectures for deep learning on tabular data. Unfortunately, many works did not publish the source code. We were only able to perform a partial comparison with mGBDT (Feng et al., 2018) and DeepForest (Zhou and Feng, 2017), which source code is available. For both baselines, we use the implementations, provided by the authors, and tune the parameters on the validation set. Note, that the DeepForest implementation is available only for classification problems. Moreover, both implementations do not scale well, and for many datasets, we obtained OutOfMemory error (OOM). On datasets in our experiments it turns out that properly tuned GBDTs outperform both (Feng et al., 2018) and (Zhou and Feng, 2017).
4.2 Ablative analysis
In this section, we analyze the key architecture components that define our model.
Choice functions. Constructing differentiable decision trees requires a function that selects items from a set. Such function is required for both splitting feature selection and decision tree routing. We experimented with four possible options, each having different implications:

[leftmargin=*]

Softmax learns “dense” decision rules where all items have nonzero weights;

GumbelSoftmax (Jang et al., 2016) learns to stochastically sample a single element from a set;

Sparsemax (Martins and Astudillo, 2016) learns sparse decision rules, where only a few items have nonzero weights;

Entmax (Peters et al., 2019) generalizes both sparsemax and softmax; it is able to learn sparse decision rules, but is smoother than sparsemax, being more appropriate for gradientbased optimization. In comparison parameter was set to .
We experimentally compare the four options above with both shallow and deep architectures in Table 3. We use the same choice function for both feature selection and tree routing across all experiments. In GumbelSoftmax, we replaced it with hard argmax onehot during inference. The results clearly show that Entmax with outperforms the competitors across all experiments. First, Table 3 demonstrates that sparsemax and softmax are not universal choice functions. For instance, on the YearPrediction dataset, sparsemax outperforms softmax, while on the Epsilon dataset softmax is superior. In turn, entmax provides great empirical performance across all datasets. Another observation is that GumbelSoftmax is unable to learn deep architectures with both constant and annealed temperature schedules. This behavior is probably caused by the stochasticity of GumbelSoftmax and the responses on the former layers are too noisy to produce useful features for the latter layers.
Dataset  YearPrediction  Epsilon  

Function  softmax  Gumbel  sparsemax  entmax  softmax  Gumbel  sparsemax  entmax 
1 layer  78.41  79.39  78.13  77.43  0.1045  0.1979  0.1083  0.1043 
2 layers  77.61  79.31  76.81  77.05  0.1041  0.2884  0.1052  0.1031 
4 layers  77.58  79.69  76.60  76.21  0.1034  0.2908  0.1058  0.1033 
8 layers  77.47  80.49  76.31  76.17  0.1036  0.3081  0.1058  0.1036 
Feature importance. In this series of experiments, we analyze the internal representations, learned by the NODE architecture. We begin by estimating the feature importances from different layers of a multilayer ensemble via permutation feature importance, initially introduced in (Breiman, 2001). Namely, for 10.000 objects from the Higgs dataset we randomly shuffle the values of each feature (original or learnt on some NODE layer) and compute the increase in the classification error. Then for each layer, we split feature importance values into seven equal bins and calculate the total feature importance of each bin, shown on Figure 3 (lefttop). We discovered that the features from the first layer are used the most, with feature importances decreasing with depth. This figure shows that deep layers are able to produce important features, even though earlier layers have an advantage because of the DenseNet architecture. Next, we estimated the mean absolute contribution of individual trees to the final response, reported on Figure 3 (leftbottom). One can see the reverse trend, deep trees tend to contribute more to the final response. Figure 3 (right) clearly shows that there is anticorrelation of feature importances and contributions in the final response, which implies that the main role of ealier layers is to produce informative features, while the latter layers mostly use them for accurate prediction.
Training/Inference runtime. Finally, we compare the NODE runtime to the timings of the stateoftheart GBDT implementations. In Table 4 we report the training and inference time for million of objects from the YearPrediction dataset. In this experiment, we evaluate ensembles of 1024 trees of depth six with all other parameters set to their default values. Our GPU setup has a single 1080Ti GPU and 2 CPU cores. In turn, our CPU setup has a 28core Xeon E52660 v4 processor (which costs almost twice as much as the GPU). We use CatBoost v0.15 and XGBoost v0.90 as baselines, while NODE inference runs on PyTorch v1.1.0. Overall, NODE inference time is on par with heavily optimized GBDT libraries despite being implemented in pure PyTorch (i.e. no custom kernels).
Method  NODE 8 layers 1080Ti  XGBoost Xeon  XGBoost 1080Ti  CatBoost Xeon 

Training  7min 42s  5min 39s  1min 13s  41s 
Inference  8.56s  5.94s  4.45s  4.62s 
5 Conclusion
In this paper, we introduce a new DNN architecture for deep learning on heterogeneous tabular data. The architecture is differentiable deep GBDTs, trained endtoend via backpropagation. In extensive experiments, we demonstrate the advantages of our architecture over existing competitors with the default and tuned hyperparameters. A promising research direction is incorporating the NODE layer into complex pipelines trained via backpropagation. For instance, in multimodal problems, the NODE layer could be employed as a way to incorporate the tabular data, as CNNs are currently used for images, or RNNs are used for sequences.
References
 [1] Benchmarking and optimization of gradient boosting decision tree algorithms. Cited by: §2.
 The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
 Random forests. Machine Learning 45 (1), pp. 5–32. External Links: Link, Document Cited by: §4.2.
 Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §1, §2, 2nd item.
 Multilayered gradient boosting decision trees. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS, Cited by: §1, §2, §4.1.
 Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §1, §2.
 Deep learning. MIT press. Cited by: §1.
 Lessons from 2 million machine learning models on kaggle. Cited by: §1.
 Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §3.2.
 Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407. Cited by: §A.2.2, §3.3.
 Categorical reparameterization with gumbelsoftmax.. CoRR abs/1611.01144. External Links: Link Cited by: §3.1, 2nd item.
 Lightgbm: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pp. 3146–3154. Cited by: §1, §2.
 TabNN: a universal neural network solution for tabular data. Cited by: §1, §2.
 Bottomup induction of oblivious readonce decision graphs: strengths and limitations. In AAAI, Cited by: §1.
 Deep neural decision forests. In Proceedings of the IEEE international conference on computer vision, Cited by: §2.
 Random hinge forest for differentiable learning. arXiv preprint arXiv:1802.03882. Cited by: §1, §2.
 Sparsemax and relaxed wasserstein for topic sparsity. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Cited by: §2.
 Bdt: gradient boosted decision tables for high accuracy and scoring efficiency. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Cited by: §1, §2.
 Quasihyperbolic momentum and adam for deep learning. arXiv preprint arXiv:1810.06801. Cited by: §3.3.
 From softmax to sparsemax: a sparse model of attention and multilabel classification. In International Conference on Machine Learning, Cited by: §2, 3rd item.
 Forward thinking: building deep random forests. arXiv preprint arXiv:1705.07366. Cited by: §1, §2.
 All you need is a good init. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Cited by: §3.3.
 Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), Cited by: 3rd item.
 A regularized framework for sparse and structured neural attention. In Advances in Neural Information Processing Systems, Cited by: §2.
 SparseMAP: differentiable sparse structured inference. arXiv preprint arXiv:1802.04223. Cited by: §2.
 Sparse sequencetosequence models. In ACL, 2019, pp. 1504–1519. Cited by: §1, §2, §2, §3.1, 4th item.
 CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, pp. 6638–6648. Cited by: §1, §1, §2, §2, 1st item.
 Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §1.
 Regularization learning networks: deep learning for tabular datasets. In Advances in Neural Information Processing Systems, pp. 1379–1389. Cited by: §2.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning 8, pp. 229–256. Cited by: §3.1.
 Deep neural decision trees. arXiv preprint arXiv:1806.06988. Cited by: §1, §2, §3.1.
 Deep forest: towards an alternative to deep neural networks. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI, Cited by: §1, §2, §4.1.
Appendix A Appendix
a.1 Description of the datasets
In our experiments, we used six tabular datasets, described in Table 5. (1) Epsilon is high dimensional dataset from the PASCAL Large Scale Learning Challenge 2008. The problem is a binary classification. (2) YearPrediction is a subset of Million Song Dataset. It is regression dataset, and the task is to predict the release year of the song by using the audio features. It contains tracks from 1922 to 2011. (3) Higgs is a dataset from the UCI ML Repository. The problem is to predict whether the given event produces Higgs bosons or not. (4) Microsoft is a Learning to Rank Dataset. It consists of 136dimensional feature vectors extracted from queryurl pairs. Each pair has relevance judgment labels, which take values from 0 (irrelevant) to 4 (perfectly relevant) (5) Yahoo is very similar ranking dataset with queryurl pairs labeled from 0 to 4. We treat both ranking problems as regression (which corresponds to the pointwise approach to learningtorank) (6) Click is a subset of data from the 2012 KDD Cup. For the subset construction, we randomly sample 500.000 objects of a positive class and 500.000 objects of a negative class. The categorical features were converted to numerical ones via LeaveOneOut encoder from category_encoders package of the scikitlearn library.
a.2 Optimization of hyperparameters
In order to tune the hyperparameters, we performed a random stratified split of full training data into train set (80%) and validation set (20%) for the Epsilon, YearPrediction, Higgs, Microsoft, and Click datasets. For Yahoo, we use train/val/test split provided by the dataset authors. We use the Hyperopt^{3}^{3}3https://github.com/hyperopt/hyperopt library to optimize Catboost, XGBoost, and FCNN hyperparameters. For each method, we perform 50 steps of Treestructured Parzen Estimator (TPE) optimization algorithm. As a final configuration, we choose the set of hyperparameters, corresponding to the smallest loss on the validation set.
a.2.1 Catboost and XGBoost
On each iteration of Hyperopt, the number of trees was set based on the validation set, with maximal trees count set to 2048. Below is the list of hyperparameters and their search spaces for Catboost.

learning_rate: LogUniform distribution

random_strength: Discrete uniform distribution

one_hot_max_size: Discrete uniform distribution

l2_leaf_reg: LogUniform distribution

bagging_temperature: Uniform distribution

leaf_estimation_iterations: Discrete uniform distribution
Train  Test  Features  Task  Metric  Description  
Epsilon^{4}^{4}4https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html  400K  100K  2000  Classification  Error  PASCAL Challenge 2008 
YearPrediction^{5}^{5}5https://archive.ics.uci.edu/ml/datasets/yearpredictionmsd  463K  51.6K  90  Regression  MSE  Million Song Dataset 
Higgs^{6}^{6}6https://archive.ics.uci.edu/ml/datasets/HIGGS  10.5M  500K  28  Classification  Error  UCI ML Higgs 
Microsoft^{7}^{7}7https://www.microsoft.com/enus/research/project/mslr/  723K  241K  136  Regression  MSE  MSLRWEB10K 
Yahoo^{8}^{8}8https://webscope.sandbox.yahoo.com/catalog.php?datatype=c  544K  165K  699  Regression  MSE  Yahoo LETOR dataset 
Click^{9}^{9}9http://www.kdd.org/kddcup/view/kddcup2012track2  800K  200K  11  Classification  Error  2012 KDD Cup 
XGBoost tuned parameters and their search spaces:

eta: LogUniform distribution

max_depth: Discrete uniform distribution

subsample: Uniform distribution

colsample_bytree: Uniform distribution

colsample_bylevel: Uniform distribution

min_child_weight: LogUniform distribution

alpha: Uniform choice {0, LogUniform distribution }

lambda: Uniform choice {0, LogUniform distribution }

gamma: Uniform choice {0, LogUniform distribution }
a.2.2 Fcnn
Fully connected neural networks were tuned using Hyperas ^{10}^{10}10https://github.com/maxpumperla/hyperas library, which is a Keras wrapper for Hyperopt. We consider FCNN constructed from the following blocks: DenseReLUDropout. The number of units in each layer is independent of each other, and dropout value is the same for the whole network. The networks are trained with the Adam optimizer with averaging the model parameters over consecutive checkpoints (Izmailov et al., 2018) and early stopping on validation. Batch size is fixed to 1024 for all datasets. Below is the list of tuned hyperparameters.

Number of layers: Discrete uniform distribution

Number of units: Dicrete uniform distribution over a set

Learning rate: Uniform distribution

Dropout: Uniform distribution
a.2.3 Node
Neural Oblivious Decision Ensembles were tuned by grid search over the following hyperparameter values. In the multilayer NODE, we use the same architecture for all layers, i.e., the same number of trees of the same depth. total_tree_count here denotes the total number of trees on all layers. For each dataset, we use the maximal batch size, which fits in the GPU memory. We always use learning rate .

num_layers: {2, 4, 8}

total_tree_count: {1024, 2048}

tree_depth: {6, 8}

tree_output_dim: {2, 3}