Residual Networks Behave Like Boosting Algorithms^{†}^{†}thanks: This work was supported by and completed whilst author was at Suncorp Group Limited.
Abstract
We show that Residual Networks (ResNet) is equivalent to boosting feature representation, without any modification to the underlying ResNet training algorithm. A regret bound based on Online Gradient Boosting theory is proved and suggests that ResNet could achieve Online Gradient Boosting regret bounds through neural network architectural changes with the addition of a shrinkage parameter in the identity skipconnections and using residual modules with maxnorm bounds. Through this relation between ResNet and Online Boosting, novel feature representation boosting algorithms can be constructed based on altering residual modules. We demonstrate this through proposing decision tree residual modules to construct a new boosted decision tree algorithm and demonstrating generalization error bounds for both approaches; relaxing constraints within BoostResNet algorithm to allow it to be trained in an outofcore manner. We evaluate convolution ResNet with and without shrinkage modifications to demonstrate its efficacy, and demonstrate that our online boosted decision tree algorithm is comparable to stateoftheart offline boosted decision tree algorithms without the drawback of offline approaches.
1 Introduction
Residual Networks (ResNet) He et al. (2016) have previously had a lot of attention due to performance, and ability to construct “deep” networks while largely avoiding the problem of vanishing (or exploding) gradients. Some attempts have been made in explaining ResNets through: unravelling their representationVeit et al. (2016); observing identity loops and finding no spurious local optimaHardt and Ma (2017); and reinterpreting residual modules as weak classifiers which allows sequential training under boosting theoryHuang et al. (2018).
Empirical evidence shows that these deep residual networks, and subsequent architectures with the same parameterizations, are easier to optimize. They also outperform nonresidual ones, and have consistently achieved stateoftheart performance on various computer vision tasks such as CIFAR10 and ImageNetHe et al. (2016).
1.1 Summary of Results
We demonstrate the equivalence of ResNet and Online Boosting in Section 3. We show that the layer by layer boosting method of ResNet has an equivalent representation of additive modelling approaches first demonstrated in Logitboost Friedman et al. (2000) which boost feature representations rather than the label directly, i.e. ResNet can be framed as an Online Boosting algorithm with composite loss functions. Although traditional boosting results may not apply as ResNet are not a naive weighted ensemble, we can refer to them as Online Boosting analogues which presents regret bound guarantees. We demonstrate that under “nice” conditions for the composite loss function, the regret bound for Online Boosting holds, and by extension also applies for ResNet architectures.
Taking inspiration from Online Boosting, we also modify the architecture of ResNet with an additional learnable shrinkage parameter (vanilla ResNet can be interpreted as Online Boosting algorithm where the shrinkage factor is fixed/unlearnable and set to ). As this approach only modifies the neural network architecture, the same underlying ResNet algorithms can still be used.
Experimentally, we compare vanilla ResNet with our modified ResNet using convolutional neural network residual network (ResNetCNN) on multiple image datasets. Our modified ResNet shows some improvement over vanilla ResNet architecture.
We also compare our boosted decision tree neural decision tree residual network on multiple benchmark datasets and their results against other decision tree ensemble methods, including Deep Neural Decision Forests Kontschieder et al. (2016), neural decision trees ensembled via AdaNet Cortes et al. (2017), and offtheshelf algorithms (gradient boosting decision tree/random forest) using LightGBM Ke et al. (2017). In our experiments, neural decision tree residual network showed superior performance to neural decision tree variants, and comparable performance to offline tradition gradient boosting decision tree models.
1.2 Related Works
In recent years researchers have sought to understand why ResNet perform the way that they do. The BoostResNet algorithm reinterprets ResNet as a multichannel telescoping sum boosting problem for the purpose of introducing a new algorithm for sequential training Huang et al. (2018), providing theoretical justification for the representational power of ResNet under linear neural network constraintsHardt and Ma (2017). One interpretation of residual networks is as a collection of many paths of differing lengths which behave like a shallow ensemble; empirical studies demonstrate that residual networks introduce short paths which can carry gradients throughout the extent of very deep networks Veit et al. (2016).
Comparison with BoostResNet and AdaNet
Combining Neural Networks and boosting has previously been explored in architectures such as AdaNet Cortes et al. (2017) and BoostResNet Huang et al. (2018). We seek to understand why ResNet achieves their level of performance, without altering how ResNet are trained.
In the case of BoostResNet, the distribution must be explicitly maintain over all examples during training and parts of ResNet are trained sequentially which cannot be updated in a truly online, outofcore manner. And in the case of AdaNet, which do not always work for ResNet structure, additional feature vectors are sequentially added, and chooses their own structure during learning. In our proposed approach, we do not require these modifications, and can train the model in the same way as an unmodified ResNet. A ResNet style architecture is a special case of AdaNet, so AdaNet generalization guarantee applies here and our generalization analysis is built upon their work. Furthermore we also demonstrate Neural Decision Trees belong to same family of feedforward neural networks as AdaNet, so AdaNet generalization guarentee also applies to Neural Decision Tree ResNet modules and our generalization analysis is built upon their work.
2 Preliminaries
In this section we cover the background of residual neural networks and boosting. We also explore the conditions which enable regret bounds in Online Gradient Boosting setting and the class of feedforward neural networks for AdaNet generalization bounds.
2.1 Residual Neural Networks
A residual neural network (ResNet) is composed of stacked entities referred to as residual blocks.
A Residual Block of ResNet contains a module and an identity loop. Let each module map its input to where denotes the level of the module, and where is typically a sequence of convolutions, batch normalizations or nonlinearities. These formulations may differ depending on context and the model architecture.
We denote the output of the th residual block to be
(1) 
where is the input of the ResNet.
Output of ResNet has a recursive relation specified in equation 1, then output of the th residual block is equal to the summation of lower module outputs, i.e., , where and . For classification tasks, the output of a ResNet is rendered after a linear classifier on representation where is the number of classes, and is the number of channels:
(2) 
where denotes a map from classifier output to labels. For example, could represent a softmax function.
2.2 Boosting
The goal of boosting is to combine weaker learners into a strong learner. There are many variations to boosting. For example, in AdaBoost and its derivatives, we require the boosting algorithm to choose training sets for the weak classifier to force it to make novel inferences Friedman et al. (2000). This was the approached used by BoostResNet Huang et al. (2018). In gradient boosting, this requirement is removed through training against pseudoresidual and can even be extended to the online learning setting Beygelzimer et al. (2015).
In either scenario, boosting can be viewed as an additive model or linear combinations of models , where is a function of the input and is the corresponding multiplier for the th model Friedman et al. (2000)Beygelzimer et al. (2015).
LogitBoost is an algorithm first introduced in “Additive Logistic Regression: A Statistical View of Boosting”Friedman et al. (2000), which introduces boosting on the input feature representation, including neural networks. In the general binary classification scenario, the formulation relies on boosting over the logit or softmax transformation
(3) 
Where represents the softmax function. This form is similar to the linear classifier layer which is used by ResNet algorithm.
Online Boosting introduces a framework for training boosted models in an online manner. Within this formulation, there are two adjustments which are required to make offline boosting models online. First, the partial sums (where represents the predictions of the th model) is multiplied by a shrinkage factor, which is tuned using gradient descent. Second, the partial sums outputs are to be bounded Beygelzimer et al. (2015).
The bounds presented for online gradient boosting are based on regret. The regret of a learner is defined as the difference between the total loss from the learner and the total learner of the best hypothesis in hindsight
Online gradient boosting regret bounds applies can be applied to any linear combination of a give base weak learner with a convex, linear loss function that is Lipschitz constant bounded by .
Corollary 2.1
(From Corollary 1 Beygelzimer et al. (2015)) Let the learning rate , number of weak learners , be given parameters. Algorithm 1 is an online learning algorithm for for set of convex, linear loss functions with Lipschitz constant bounded by with the following regret bound for any :
where , or the initial error, and is the regret or excess loss for the base learner algorithm.
The regret bound in this theorem depends on several conditions; the requirement that for any weak learner , that it has a finite upper bound, i.e. , for some , and the set of loss functions constraints an efficiently computable subgradient has a finite upper bound.
Compared with boosting approach used in BoostResNet which is based on AdaBoostHuang et al. (2018), the usage of the online gradient boosting algorithm does not require maintaining an explicit distribution of weights over the whole training data set and is a “true” online, outofcore algorithm. Leveraging online gradient boosting allows us to overcome the constraints of BoostResNet approach.
AdaNet Generalization Bounds for feedforward neural networks defined to be a multilayer architecture where units in each layer are only connected to those in the layer below has been provided by Cortes et al. (2017). It requires the weights of each layer to be bounded by norm, with , and all activation functions between each layer to be coordinatewise and Lipschitz activation functions. This yields the following generalization error bounds provided by Lemma 2 from Cortes et al. (2017):
Corollary 2.2
(From Lemma 2 Cortes et al. (2017)) Let be distribution over and be a sample of examples chosen independently at a random according to . With probability at least , for , the strong decision tree classifier satisfies that
As this bound depends only on the logarithmically on the depth for the network this demonstrates the importance of strong performance in the earlier layers of the feedforward network.
Now that we have a formulation for ResNet and boosting, we explore further properties of ResNet, and how we may evolve and create more novel architectures.
3 ResNet are Equivalent to a Boosted Model
As we recall from equations 2 and 3, ResNet indeed have a similar form to LogitBoost. In this scenario, both formulations aim to boost the underlying feature representation. One consequence of the ResNet formulation is that the linear classifier , would be a shared linear classifier across all all ResNet modules.
Assumption 3.1
The th residual module with a trainable linear classifier layer defined by
(4) 
Is a weak learner for all . We will call this weak learner the hypothesis module.
This assumption is required to ensure that is a weak learner to adhere to learning bounds proposed in Corollary 2.1. We show that different ResNet modules variants used in our experiments assumption in Sections 6.3.
Overall this demonstrates that the proposed framework is equivalent to traditional boosting frameworks which boost on the feature representation. However, to further analyse the algorithmic results, we need to first consider additional restrictions which are placed within the “Online Boosting Algorithm” framework.
3.1 Online Boosting Considerations
Our representation is a special case of online gradient boosting as shown in Algorithm 1, our regret bound analysis is built upon work in Beygelzimer et al. (2015). The regret bounds for an online boosting algorithm that competes with linear combination of the base weak learner applies when used for a class of convex, linear loss function with Lipschitz constant bounded by .
4 Analysis of Online Boosting for Composite Loss Functions
In this section we provide analysis on corollary 2.1. Corollary 2.1 holds for the learning algorithm with losses in , where is defined to be set of convex, linear loss functions with Lipschitz constant bounded by . Next we describe the conditions in which composite loss functions belongs in .
A composite loss function , where is the link function, belongs to if is the canonical link function. This has been shown to be a sufficient but not necessary condition for canonical link to lead to convex composite loss functions Reid and Williamson (2010).
Lemma 4.1
Composite loss functions retain smoothness
Proof: If satisfies Lipschitz continuous function (e.g. logistic function/softmax, as its derivative is bounded everywhere), then the composite loss is also Lipschitz constant, as composition of functions which are Lipschitz constant is also Lipschitz constant. As if has Lipschitz constant and has Lipschitz constant then
Hence, if has Lipschitz constant bounded by , then the composition of the particular loss function with Lipschitz constant of also has Lipschitz constant bounded by and belongs to the base loss function class. An example of such a link function is the logit function, which has a Lipschitz constant of 1 and is the canonical link function for log loss (cross entropy loss), which suggests that the composite loss function is indeed convex and belongs in loss function class
This demonstrates ResNet which boost on the feature representation and have a logit link satisfies regret bound as shown in Collorary 2.1.
5 Recovering Loss For Intermediary Residual Modules
When analysing the ResNet and Online Boosting algorithm, the Online Boosting algorithm requires the gradient of the underlying boosting function to be recovered as part of the update process. This is shown in line 12 within Algorithm 1. One approach to tackle this challenge was suggested in BoostResNet where a common auxiliary linear classifier is used across all residual modules, however this approach was not explored in the work as BoostResNet was focused on sequential training of residual modules, and such a constraint was deemed inappropriate. Instead BoostResNet would construct different linear classifier layers which were dropped at every stage when the residual modules have been trained.
Our approach to remediate this is to formulate the ResNet architecture as a singleinput, multioutput training problem, whereby each residual module will have an explicit ‘shortcut’ to the output labels whilst sharing the same linear classifier layer. This architecture is shown in Figure 1.
Remark: It has been demonstrated that through carefully constructing a ResNet, the last layer need not be trained, and instead can be a fixed random projectionHardt and Ma (2017). This has been demonstrated through theoretical justifications in linear neural networks.
In our ResNet algorithm, if the linear classifier layer is shared, then the model would be framed as a single input, multioutput residual network, where the outputs are all predicting the same output . The predicted output of the network, which corresponds to each of the weak learners would correspond to on lines 8 and 9 of Algorithm 2. Through this setup, it allows each residual module, to be updated by back propagation with respect to the label in the same manner as line 12 in Algorithm 1. In a similar manner the shrinkage layers in Algorithm 2 would be updated as shown in Algorithm 1 as per line .
Through unravelling a ResNet, the paths of a ResNet are distributed in a binomial mannerVeit et al. (2016), that is, there is one path that passes through modules and paths that go through one module, with an average path length of Veit et al. (2016). This means that even without a shared layer, there will be paths within the ResNet framework where the gradient is recovered to the residual modules. This approach is shown by figure 2, and has an identical setup as algorithm 2 except in the back propagation step, we update all layers based on the whole network using output only.
Remark: if a residual network is reframed as the Online Boosting Algorithm 1, it would be equivalent to choosing , with being fixed or untrainable. For the regret bounds to hold, we require shrinkage parameter to be trainable, and the outputs of each residual module to be bounded by a predetermined maxnorm.
In Section 6 we will provide empirical evidence validating both approaches.
5.1 Neural Decision Tree
Another popular application of boosting algorithms is through the construction of decision trees. In order to demonstrate how ResNet could be used to boost a variety of models with different residual module representations, we describe our construction for our Neural Decision Tree ResNet and the associated generalization error analysis.
5.1.1 Construction of Neural Decision Tree and Generalization Error Analysis
To demonstrate decision tree formulation based on Deep Neural Decision Forests belongs to this family of neural network models, consider the residual module is shown by Figure 5, where the split functions are realized by randomized multilayer perceptrons Kontschieder et al. (2016). This construction is a neural network has sets of layers that belongs to family of artificial neural networks defined by Cortes et al. (2017); which require the weights of each layer to be bounded by norm, with , and all activation functions between each layer to be coordinatewise and Lipschitz activation functions. The size of these layers are based on a predetermined number of nodes with a corresponding number of leaves . Let the input space be and for any , let denote the corresponding feature vector.
The first layer is decision node layer. This is defined by trainable parameters , with and . Define and , which represent the positive and negative routes of each node. Then the output of the first layer is . This is interpreted as the linear decision boundary which dictates how each node is to be routed.
The next is the probability routing layer, which are all untrainable, and are a predetermined binary matrix . This matrix is constructed to define an explicit form for routing within a decision tree. We observe that routes in a decision tree are fixed and predetermined. We introduce a routing matrix which is a binary matrix which describes the relationship between the nodes and the leaves. If there are nodes and leaves, then , where the rows of represents the presence of each binary decision of the nodes for the corresponding leaf . We define the activation function to be . Then the output of the second layer is . As is 1Lipschitz bounded function in the domain and the range of , then by extension, is a 1Lipschitz bounded function for . As is a binary matrix, then the output of must also be in the range .
The final output layer is the leaf layer, this is a fully connected layer to the previous layer, which is defined by parameter , which represents the number of leaves. The activation function is defined to be . The the output of the last layer is defined to be . Since has range , then is a 1Lipschitz bounded function as is 1Lipschitz bounded in the domain . As each activation function is 1Lipschitz functions, then our decision tree neural network belongs to the same family of artificial neural networks defined by Cortes et al. (2017), and thus our decision trees have the corresponding generalisation error bounds related to AdaNet.
The formulation of these equations and their parameters is shown in figure 3 which demonstrates how a decision tree trained in Python ScikitLearn can have its parameters be converted to a neural decision tree, and figure 4 demonstrates the formulation of the three layer network which constructs this decision tree.
5.1.2 Extending Neural Decision Trees to ResNet
For our Neural Decison Tree ResNet, in order to ensure that the feature representation is invariant to the number of leaves in the tree, we add a linear projection to ensure that the shortcut connection match the dimensions, as suggested in the original ResNet implementation He et al. (2016).
6 Experiments
Below, we perform experiments on two different ResNet architectures.
First, we examine the ResNet convolution network variantHe et al. (2016), with and without the addition of trainable shrinkage parameter. Both models are assessed over street view house numbers SVHN Netzer et al. (2011), and CIFAR10 Krizhevsky et al. (2012) benchmark datasets.
Second, we examine the efficacy of creating boosted decision tree models in ResNet framework. Our approach was compared against other neural decision tree ensemble models and offline models including Deep Neural Decision Forests Kontschieder et al. (2016), neural decision trees ensembled via AdaNet Cortes et al. (2017), and offtheshelf algorithms (gradient boosting decision tree/random forest) using LightGBM Ke et al. (2017). All models were assess using UCI datasets which are detailed in Section B of the appendix.
In both scenarios, the datasets were divided using a split into a training and test dataset respectively.
6.1 Convolution Network ResNet
In both the CIFAR10 and SVHN datasets we fit the same 20layer ResNet. This ResNet consists of one convolution, followed by stacks of layers with convolutions of the feature maps sizes of respectively, with layers for each feature map size. The number of filters are . The subsampling is performed by convolutions with a stride of 2 and the network ends with global average pooling, a way fully connected layer and softmax. The implementation is taken directly from the Keras CIFAR10 ResNet sample code. The model was run without image augmentation, and with a batch size of for epochs. To compare the original ResNet, we augment the ResNet model by adding a trainable shrinkage parameter as described in Section 5 (ResNetShrinkage), and our augmented ResNet model with both shrinkage parameter and shared linear layer (ResNetShared).
Model  Train  Test 

ResNet  0.93886  0.94630 
\hdashlineResNet (Shrinkage)  0.93917  0.94852 
ResNet (Shared)  0.93689  0.94626 
Model  Train  Test 

ResNet  0.98412  0.91530 
\hdashlineResNet (Shrinkage)  0.98504  0.91870 
ResNet (Shared)  0.94496  0.88570 
We find that the model with shrinkage only has marginally higher accuracy than the vanilla ResNet20 implementation in both datasets. For the ResNetShared model, it is comparable to the SVHN task, however falls short in the CIFAR10 task. In general, adding shrinkage does not impact performance of ResNet models and in certain cases, it improves the performance.
6.2 Neural Decision Tree ResNet
The next experiment conducted was to address whether ResNet could be used to boost a variety of models with different residual module representations. We compared our decision tree in ResNet (ResNetDT), and ResNet with shared linear classifier layer (ResNetDT Shared) with Deep Neural Decision Forests Kontschieder et al. (2016) (DNDF), neural decision trees ensembled via AdaNet Cortes et al. (2017) (AdaNetDT), and offtheshelf algorithms (gradient boosting decision tree/random forest) using LightGBM Ke et al. (2017) which we denote as LightGBDT, LightRF respectively.
For ResNetDT, ResNetDT Shared, DNDF, LightGBM and LightRF, all models used an ensemble of 15 trees with a maximum depth of 5 (i.e. 32 nodes). For each of these models, they were run for 200 epoch.
For AdaNetDT, the candidate subnetworks used are decision trees identical to implementation in DNDF. This means that at every iteration, a candidate neural decision tree was either added or discarded with no change to the ensemble. The complexity measure function was defined to be where is the number of hidden layers (i.e. number of nodes) in the decision tree Golowich et al. (2018). For AdaNetDT, the algorithm started with tree, and was run times with epoch per iteration, allowing AdaNet to build up to trees. Once the final neural network structure was chosen, it was run for another 200 epoch and used for comparison with the other models.
To assess the efficacy, we used a variety of datasets from the UCI repository. Full results for the training and test data sets are provided in section B of the appendix.
Mean  Mean  
Improv.  Reciprocal Rank  
LightGBM  26.140%  0.545 
LightRF  17.414%  0.2683 
AdaNetDT  16.519%  0.345 
\hdashlineResNetDT  25.360%  0.5783 
ResNetDT (Shared)  20.306%  0.545 
Mean  Mean  
Improv.  Reciprocal Rank  
LightGBM  20.904%  0.67 
LightRF  13.665%  0.3367 
AdaNetDT  14.771%  0.315 
\hdashlineResNetDT  17.892%  0.5117 
ResNetDT (Shared)  12.949%  0.4067 
In order to construct a baseline for all models to be comparable, the results presented are on the average and median error improvement compared with DNDF models, as they were the worse performing model based on these benchmarks. From the results in Table 4, LightGBM performed the best with the best average improvement on error relative to the baseline DNDF model. What is interesting is that both our ResNetDT model performed second best, beating LightRF and AdaNetDT models.It is important to note that our setup for AdaNetDT only allowed a “bushy” candidate model, this did not allow AdaNetDT to build deeper layers compared with ResNetDT approach; only allowing it to build a wider and shallow architect through appending additional decision trees. Despite this, the AdaNetDT implementation did outperform the DNDF implementation.
When examining relative improvement, it is important to understand how the values are then distributed. Figures 6 and 7 contain the boxplots of relative performance based on the train and test datasets respectively. From our empirical experiments, it suggests that the difference between the ResNetDT and ResNetDT Shared are around the variance in the results. One interpretation is through joint training, the variability in performance is lowered and may possible provide more stable models. As to whether joint training should be used or not, we believe it should be considered to be a optional parameter that is learned in training time instead.
In general, it would appear our ResNetDT performance is comparable to LightGBM models whilst providing the ability to update the tree ensemble in an online manner and producing nongreedy decision splits. As this approach can be performed in an online or minibatch manner, it can be used to incrementally update and train over large datasets compared with LightGBM models which can only operate in an offline manner.
6.3 Weak Learning Condition Check
We present a summarised proof demonstrating ResNetCNN and ResNetDT satisfy the weak learning condition as stated in Assumption 3.1. The full proof is provided in Section A of the appendix.
For both cases, it is sufficient to demonstrate that there exists a parameterization such that the residual module . Applying this parameterization over the recursive relation , suggests there exists a parameterization of the residual module such that . As is a learnable weight and a linear model, which is a known weak learner Mannor and Meir (2002), demonstrating that hypothesis modules created through residual modules are weak learners.
ResNetCNN: We will briefly demonstrate that with dense layers in a ResNet setup He et al. (2016) can recover the identity. We defer demonstrating convolutional layers scenario to section A of the appendix.
We will ignore the batch normalization function in ResNet, noting that batch normalization layer with centering value of and scale of is a valid parameterization. As such the residual module can be expressed as
Where , are the appropriate weights matrices with , being the respective biases and is ReLu activation. Suppose , is chosen to be the identity matrix and is chosen to be a matrix containing a single value representing , and . Hence there exists a parameterization of ResNetCNN where as required.
Remark: ResNet built under constraints of a linear residual module with only convolution layers and ReLu activations have been shown to have perfect finite sample expressivity; which is a much stronger condition than recovering only the identity Hardt and Ma (2017).
ResNetDT: The weak learning condition can be trivially demonstrated through routing the input in a deterministic manner to a single leaf with probability . Under this condition the final linear projection layer, project only the target leaf, would result in an identity mapping. This demonstrates a decision tree which routes only to one leaf will have a parameterization . This can also interpreted as a “decision stump” which is commonly used in boosting applications.
7 Conclusions and Future Work
We have demonstrated the equivalence between ResNet and Online Boosting algorithm, and provided a regret bound for ResNet based on the interpretation of residual modules with the linear classifier as weak learners. We have proposed the addition of shrinkage parameters to ResNet, which based on initial results demonstrating it as a promising approach in refining ResNet models. We have also demonstrated a method to remove “offline” restriction of BoostResNet of requiring maintaining distribution of all training data weights through extending it to an online gradient boosting algorithm. Together these provide insight into the interpretation of ResNet as well as extensions of residual modules to new and novel feature representations, such as neural decision trees. These representations allow us to create new boosting variations of decision trees. We have additionally demonstrated that this approach is superior to other neural network decision tree ensemble variants and comparable with stateoftheart offline variations without the drawbacks of offline approaches. In addition we have also provided generalization bounds for our residual module implementations. The insights into the relation between boosting and ResNet could spur other changes to the default ResNet architecture, such as challenging the default size of the step parameter in the identity skipconnect. These insights may also change how residual modules are optimized and built, and encourage developments into new residual modules architectures.
References
 Online gradient boosting. In Advances in Neural Information Processing Systems, pp. 2458–2466. Cited by: §2.2, §2.2, §2.2, Corollary 2.1, §3.1, §3.1.
 "Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables.". In Computers and Electronics in Agriculture, 24(3), pp. 131–151. Cited by: 2nd item.
 AdaNet: adaptive structural learning of artificial neural networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 874–883. Cited by: §1.1, §1.2, §2.2, Corollary 2.2, §5.1.1, §5.1.1, §5.1.2, §6.2, §6.
 Letter recognition using hollandstyle adaptive classifiers. Machine Learning 6, pp. 161. Cited by: 8th item.
 Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics 28 (2), pp. 337–407. Cited by: §1.1, §2.2, §2.2, §2.2.
 Sizeindependent sample complexity of neural networks. In Proceedings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigollet (Eds.), Proceedings of Machine Learning Research, Vol. 75, , pp. 297–299. Cited by: §6.2.
 Result analysis of the nips 2003 feature selection challenge. In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou (Eds.), pp. 545–552. Cited by: 5th item.
 Identity matters in deep learning. In International Conference on Learning Representations, Cited by: §1.2, §1, §5, §6.3.
 Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §1, §5.1.2, Figure 8, §6.3, §6.
 Learning deep resnet blocks sequentially using boosting theory. International Conference of Machine Learning 2018 abs/1706.04964. External Links: 1706.04964 Cited by: §1.2, §1.2, §1, §2.2, §2.2, §5.1.2.
 LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3146–3154. Cited by: §1.1, §6.2, §6.
 Scaling up the accuracy of naivebayes classifiers: a decisiontree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 202–207. Cited by: 1st item.
 Deep neural decision forests. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 915 July 2016, pp. 4190–4194. Cited by: §1.1, Figure 5, §5.1.1, §6.2, §6.
 ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §6.
 On the existence of linear weak learners and applications to boosting. Machine Learning 48 (13), pp. 219–251. Cited by: 1.§, §6.3.
 A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14, pp. 897–911. Note: MEDLINE Abstract Cited by: 7th item.
 Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, Cited by: §6.
 Composite binary losses. Journal of Machine Learning Research 11 (Sep), pp. 2387–2422. Cited by: §4.
 Using weighted networks to represent classification knowledge in noisy domains. In ML, pp. 121. Cited by: 6th item.
 Residual networks behave like ensembles of relatively shallow networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, USA, pp. 550–558. External Links: ISBN 9781510838819 Cited by: §1.2, §1, §5.
Appendix
.1 Weak Learners
To demonstrate that both ResNetCNN and ResNetDT we first prove that the existance of the parameterization is a sufficient condition to construct a weak learner.
Lemma .1
If there exists a parameterization , then for any , .
(By Induction Hypothesis) It is given that and , then using the definition and , we have .
Assume, for some , holds true, then
Since both the base case and the inductive step have been performed, then by mathematical induction for .
Lemma .2
If there exists a parameterization , then the hypothesis module is a weak learner.
Using Lemma A.1, we can easily see that the hypothesis module
As is a learnable parameter, then the hypothesis module is a weak learner as linear models are weak learners Mannor and Meir (2002).
ResNetCNN
The ResNetCNN modules generally consist of repeated blocks consisting of convolutionbatch normalizationReLu activation repeated several times (see resnet50.py
implementation in Keras examples, under conv_block
and identity_block
).
Lemma .3
There exists convolution layer and batch normalization weights such that
where is the weights of the convolution layer, BN is batch normalization function and is the ReLu function, and is some constant scalar.
As before, we will ignore the batch normalization function in ResNet, noting that batch normalization layer with centering value of and scale of is a valid parameterization. Then we construct convolution layer through choosing only the identity kernel, with bias constructed to be the absolute value of the minimal element in , which we will call . Then
As all elements in would be greater than , which negates the effect of the ReLu activation function.
Lemma .4
We define the residual module is the composition of arbritary many , as defined below
Using this definition for some constant scalar
This can trivially be shown via induction. This holds for by Lemma A.3. Assume it holds for , i.e. . Then for
For some constant scalar . Since both the base case and the inductive step have been performed, then by mathematical induction for some constant scalar , which suggests that there exists a parameterization for convolution ResNet modules of any depth which recovers the identity.
Therefore using Lemma A.2 and Lemma A.4 the hypothesis module for convolution ResNet model is a weak learner.
.2 Description of Data Sets
The full results are shown in table 5. All datasets are measured based on accuracy.
The datasets used come from the UCI repository and are listed as follows:

Letter Recognition Frey and Slate (1991)