EnsembleNet: End-to-End Optimization of Multi-headed Models

EnsembleNet: End-to-End Optimization of Multi-headed Models

Hanhan Li   Joe Yue-Hei Ng   Paul Natsev
{uniqueness,yhng,natsev}@google.com
Google AI
1600 Amphitheatre Parkway
Mountain View, CA 94043
Abstract

Ensembling is a universally useful approach to boost the performance of machine learning models. However, individual models in an ensemble were traditionally trained independently in separate stages without information access about the overall ensemble. Many co-distillation approaches were proposed in order to treat model ensembling as first-class citizens. In this paper, we reveal a deeper connection between ensembling and distillation, and come up with a simpler yet more effective co-distillation architecture. On large-scale datasets including ImageNet, YouTube-8M, and Kinetics, we demonstrate a general procedure that can convert a single deep neural network to a multi-headed model that has not only a smaller size but also better performance. The model can be optimized end-to-end with our proposed co-distillation loss in a single stage without human intervention.

1 Introduction

In machine learning, ensemble methods combine multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone (Opitz & Maclin, 1999; Polikar, 2006; Rokach, 2010). It is proven to be useful in a variety of domains including machine perception, natural language processing, user behavior prediction, and optimal control. Many top entries in Netflix Prize and Kaggle competitions are generated by a large ensemble of models.

Traditionally, constituent models in an ensemble are trained independently in different stages and later on combined together. The process is laborious and requires manual interventions. Moreover, the constituent models are not ensemble aware and therefore not properly optimized. To jointly optimize the ensemble, one could naively train with a single loss on the final prediction, but the increased model size often leads to overfitting. Many recent works have studied different strategies to learn ensembles end-to-end by simultaneously optimizing multiple loss heads. In particular, Lee et al. (2015b) studies multi-headed convolutional neural network (CNN) ensembles with shared base networks. Many recent approaches (Lan et al., 2018; Song & Chai, 2018; Lin et al., 2018) also incorporate co-distillation losses.

In this paper, we extend the multi-headed approach and present the EnsembleNet architecture, where we use light-weight heads and a co-distillation loss which is simpler and more effective than previously used. We also reveal a similarity between conventional ensembling and distillation. Besides, previous ensembling approaches scale up the model size, but an EnsembleNet achieves much better performance than a single network with comparable model sizes in both training and inference, where the model size is measured in both the number of parameters and the number of FLOPs. We demonstrate this behavior extensively in a variety of large-scale vision datasets including ImageNet (Russakovsky et al., 2015), YouTube-8M (Abu-El-Haija et al., 2016), and Kinetics (Kay et al., 2017).

2 Related Work

There is extensive literature, going back decades, on ways to come up with and combine a group of strong and diverse models. Works related to ensembling can be broadly grouped into the following categories.

Ensembling Theory.  Empirically, the prediction errors from individual models tend to cancel out when we ensemble them, and more diverse architectures tend to make better ensembles. This behavior can be explained from the point of view of Bayesian Model Combination (Domingos, 2000; Minka, 2000; Carroll et al., 2011; Kim & Ghahramani, 2012). There are various theoretical models that estimate error bounds of specific ensemble formulations, given score distribution assumptions on the outputs and independence assumptions on the inputs (Kuncheva, 2014).

Ensembling Methods.  There are many works proposing specific ways to select candidate models and combine the predictions. A simple and popular ensembler just averages over the predictions of individual models. Other types of ensembling include greedy selection (Partalas et al., 2012; Li et al., 2012; Partalas et al., 2012), Mixture-of-Experts (MoE) fusion (Lan et al., 2018), and sparsely gated MoE fusion (Shazeer et al., 2017; Wang et al., 2018). One may optionally incorporate the computation or memory cost into the optimization (e.g., the AdaNet algorithm in Cortes et al. (2017)). The ensembler may be trained either on the same data partition as the individual models, or on a separate partition.

Parameter Sharing.  Sharing a common base structure among multiple individual models may produce better ensembles, and this technique is used in Lee et al. (2015b); Lan et al. (2018). Furthermore, hierarchical sharing may give additional performance boost (Song & Chai, 2018).

Ensemble-Aware Learning.  We would like to have a training strategy where individual models are aware of the ensembler during the optimization. A simple approach is to add a loss from the ensembler prediction. A related approach is co-distillation (Zhang et al., 2018; Anil et al., 2018; Lan et al., 2018; Song & Chai, 2018), where constituent models are encouraged to learn from each other by regressing their predictions to the ensembler prediction. Our work provides a deeper insight on these approaches.

Also related to our work is shortcut auxiliary classifiers (Szegedy et al., 2015; Lee et al., 2015a), which are used during training and discarded during inference. An EnsembleNet treats all individual models on equal footing and we don’t have to tune their weights in loss. Nevertheless, one may also add shortcut auxiliary classifiers to an EnsembleNet as well.

3 Approach

Figure 1: A multi-headed network with branches.

We use a multi-headed network (Figure 1) and the output of each branch is an auxiliary prediction . As observed by Lee et al. (2015b), properly sharing the base network not only reduces computation resources but also increases model accuracy. An ensembler takes in all auxiliary predictions and outputs the final prediction , which is used for inference. In training, a loss is computed for each prediction head and the final loss is the sum of all losses:

(1)

Let be any loss function for measuring the discrepancy between a ground truth and a prediction . We compare two ways of constructing and .

The first one is the ensembling loss structure.

(2)

where we compute the loss for each prediction head against the same ground truth for this task, and is a scalar hyper-parameter that needs to be tuned. The rationale behind the coefficients is as follows. Suppose the simple average is used as the ensembler and the N branches have the same structure as well as initialization, then the loss is independent of .

The second one is the co-distillation loss structure.

(3)

where we compute the loss for each auxiliary prediction head against the ensembler prediction instead of the ground truth, and is a scalar hyper-parameter that needs to be tuned.

For the ensembling loss structure, if , we are directly optimizing for the ensembler prediction. This naive approach usually leads to severe overfitting and bad generalization for strongly performing individual models, which are typically quite large. If , it is equivalent to the conventional ensembling and the resulting ensembler is typically much better than individual models. The overfitting problem is relieved here because an auxiliary loss will not be influenced by other branches, but one downside is that the auxiliary loss heads are not ensemble-aware. One might expect that an optimal should be somewhere between and . However, for most of the strongly performing networks we experimented with on YouTube-8M and ImageNet, the optimal values for are negative! Basically if we decrease from all the way to some negative number (e.g., ), the performance of the resulting model decreases on the train set and increases on the holdout set, and the gap between them becomes smaller.

This observation is counter-intuitive, but we prove in Appendix A.1 that if the ensembler does the simple averaging and is the loss, Equation 2 and Equation 3 produce exactly the same loss if . The co-distillation term is an intuitive regularizer that encourages the individual model predictions to agree with each other, and choosing a negative is nothing more than applying a strong distillation. A negative ensembler loss, which seemingly regresses the final prediction away from the ground truth, in fact helps regularize the models due to the presence of auxiliary losses. For other types of ensembler and other types of loss functions, we should still expect the effects from the two loss structures to be similar. In our experiments, we always use the simple average ensembler and the cross entropy loss function, with gradient stopped on in . We found that the co-distillation loss with the best always slightly outperforms the ensembling loss with the best , and combining the two loss structures doesn’t provide additional enhancement. This also means that co-distillation is preferable over the conventional ensembling.

Unlike previous end-to-end ensembling approaches (Zhang et al., 2018; Anil et al., 2018; Lan et al., 2018; Song & Chai, 2018) that mix up the two loss structures by using both and , our EnsembleNet only applies the co-distillation loss (i.e., Equation 3), which is structurally simpler and performs better.

Although we take simple average as the ensembler in our experiments, both loss structures can be generalized for any differentiable parametric ensembler, like an MoE model as in Lan et al. (2018), potentially with better results. In this case, we should also remove the coefficient for in Equation 2 and add it as a gradient multiplier to the input of the ensembler instead.

We invoke the following procedure to construct an EnsembleNet from a general deep neural network, where the EnsembleNet has both smaller size and better performance. We first take the upper half of a deep neural network (for example, a ResNet (He et al., 2016a) has 4 blocks and we take the upper 2 blocks), and shrink the width of the network (for example, the number of channels in a CNN layer) to reduce its size by more than a half (typically our shrinking ratio is about ). Then we duplicate this block once to build an EnsembleNet with two heads. Due to randomness in the initialization, the two branches with the same architecture will learn differently and become complimentary. The performance gain is robust against the exact layer from which we fork the network, and typically we choose some middle layers. Using different architectures for the two branches may yield even more complementary models and better results, but for simplicity, we choose the same architecture for them in the current experiments. We also found that using more than two branches gives at most marginally better performance, and results were omitted in the paper.

4 Experiments

This section presents our detailed experiments on ImageNet (Russakovsky et al., 2015), YouTube-8M (Abu-El-Haija et al., 2016), and Kinetics (Kay et al., 2017). The reported accuracy metrics are computed using the sample mean of about 3 runs. The uncertainty of the sample mean, , for is estimated with .

We measure both the number of model parameters and the FLOP counts for a model to report sizes. The FLOP count is computed with tensorflow.profiler, which treats multiplication and addition as two separate ops. We fold batch norm operations during inference when possible.

4.1 ImageNet Classification

ImageNet (Russakovsky et al., 2015) is a large scale quality-controlled and human-annotated image dataset. We use the ILSVRC2012 classification dataset which consists of 1.2 million training images with 1000 classes. Following the standard practice, the top-1 and top-5 accuracy on the validation set are reported.

Table 1(a) shows the performance and architecture comparisons between the baseline ResNet-152 (He et al., 2016a) and its multi-headed versions. Figure 2 plots the validation set top-1 accuracies for the ensembling loss with different and the co-distillation loss with different . ResNet-152 has 4 blocks. For the multi-headed variants, we leave the lower two blocks as they are, and construct two copies of the upper two layers as two branches. Each branch will have the bottleneck depth dimensions reduced, and the depth dimensions of the expanding convolutions will be reduced proportionally. The width config in the table specifies the bottleneck dimensions for the 4 blocks.

The top-1 accuracy of our baseline ResNet-152 is , which is higher than reported in the original implementation (He et al., 2016b). The performance of the ensembling loss with is lower than the baseline, showing that naively optimizing for the final prediction is not a good approach. The model with and gives and top-1 accuracy gain over the baseline respectively despite being slightly smaller. By decreasing from to , we get an additional gain in top-1 accuracy, indicating a properly tuned negative does reduce overfitting. An even more negative degrades performance. For the co-distillation loss, the performance improves as increases from to around and slightly degrades when increases further. Please note that the top-1 accuracy with is higher than that with .

Table 1(b) shows comparisons between the Squeeze-Excitation ResNet-152 (Hu et al., 2018) and its multi-headed versions. We set the reduction ratio to and use batch normalization in the Squeeze-Excitation layers. The top-1 accuracy of our baseline SE-ResNet-152 is , which is higher than reported in Hu et al. (2018). The ensembling loss with gives gain in top-1 accuracy, and the same loss with gives another gain. The co-distillation loss with enhances it by another . All the observations are similar to those without Squeeze-Excitation.

Overall, the EnsembleNet (with ) offers a top-1 accuracy improvement of for ResNet152 and for SE-ResNet-152 on ImageNet. For comparisons, by keeping model sizes similar, the improvement is about for co-distillation combined with MoE (Lan et al., 2018) and for sparsely gated Deep MoE (Wang et al., 2018).

ResNet-152 based model Width Config Top-1 Acc. Top-5 Acc. #Params #FLOPs
Baseline (64, 128, 256, 512) 78.54% 94.05% 60.1M 21.8B
or base: (64, 128) 2 branches: (176, 352) 77.74% 93.43% 58.2M 21.0B
base: (64, 128) 2 branches: (176, 352) 79.85% 94.98% 58.2M 21.0B
base: (64, 128) 2 branches: (176, 352) 80.28% 95.27% 58.2M 21.0B
base: (64, 128) 2 branches: (176, 352) 80.58% 95.38% 58.2M 21.0B
(a) ResNet-152 based models.
SE-ResNet-152 based model Width Config Top-1 Acc. Top-5 Acc. #Params #FLOPs
Baseline (64, 128, 256, 512) 78.85% 94.25% 66.7M 21.9B
or base: (64, 128) 2 branches: (176, 352) 78.83% 93.75% 64.5M 21.1B
base: (64, 128) 2 branches: (176, 352) 80.76% 95.33% 64.5M 21.1B
base: (64, 128) 2 branches: (176, 352) 81.36% 95.67% 64.5M 21.1B
base: (64, 128) 2 branches: (176, 352) 81.75% 95.82% 64.5M 21.1B
(b) Squeeze-Excitation ResNet-152 based models.
Table 1: Comparison of baseline models and their multi-headed versions on the ImageNet dataset. A resolution of is used for both training and evaluation. Performance is evaluated using a single crop. The uncertainties of the accuracy metrics are about .
Figure 2: The validation set top-1 accuracies of multi-headed ResNet-152 with the ensembling loss (top) and the co-distillation loss (bottom). Detailed settings are specified in Table 1(a).

Implementation Details. The original ResNet architecture without pre-activation is used as the backbone for all experiments in this section. However, we made two slight modifications. First, for memory efficiency, we subsample the output activations in the last residual unit of each block, instead of subsampling the input activations in the first residual unit of each block. Second, the rectified linear units of the bottleneck layers are capped at .

We use color augmentation as in Howard (2013) in addition to scale and aspect ratio augmentation as in Szegedy et al. (2015). Label smoothing is set to 0.1, and small l2 regularization is applied on all weights and batch norm parameters. All convolutional weights are initialized using the same normal distribution with standard deviation of . We use 128 TPU v3 cores (8x8 configuration) with a batch size of 32 in each core. The models are trained with the Momentum optimizer, where the learning rate starts at 0.01 and decays by 0.2 every 60 epochs.

4.2 The YouTube-8M Video Classification

YouTube-8M (Abu-El-Haija et al., 2016) is a large-scale labeled video dataset that consists of features from millions of YouTube videos with high-quality machine-generated annotations. We use the 2018 version for the experiments where can compare with the best performing models in the Kaggle competition on this dataset. This version has about 6 million videos with a diverse vocabulary of 3862 audio-visual entities. 1024-dimensional visual features and 128-dimensional audio features at 1 frame per second are extracted from bottleneck layers of pre-trained deep neural networks and are provided as input features for this dataset.

Following Abu-El-Haija et al. (2016), we measure the performance of our models in both global average precision (GAP) and mean average precision (mAP), with the number of predicted entities per video capped at 20. GAP is the area under curve for predictions across all video-entity pairs. For mAP, we first compute the area under curve for each entity across all videos, and then take the average across all entities. Following the standard of many YouTube-8M Kaggle participants, we train our models on the union of the train set and of the validation set and evaluate them on the remaining of the validation set.

Figure 3: The architecture of the deep bag of frame (DBoF) model for YouTube-8M.

Our baseline model is a variant of deep bag of frame (DBoF) model (Abu-El-Haija et al., 2016) that incorporates context gates (Miech et al., 2017). The architecture of the model is shown in Figure 3. We first pass the features of each frame through a fully-connected clustering layer to get the cluster representation, followed by a context-gating layer, and then a feature-wise weighted average pooling is used to extract a single compact representation. Afterward, we use a fully-connected hidden layer, followed by an MoE (Jordan & Jacobs, 1994) classification layer to compute the final class scores. The classification layer uses one MoE for each class independently, and each MoE consists of several logistic experts.

The feature-wise weighted average frame pooling is defined as

(4)

where is the feature unit for the frame and is the number of frames. The intuition behind this pooling is that we would like to up-weight the features with large values so they don’t get washed out in pooling. We refer to this non-parametric pooling method as Self-Weighted Average Pooling (SWAP).

The context gate (Miech et al., 2017), which is a multiplicative layer with a skip connection, is added in two places. One is in between the clustering layer and the frame pooling, and the other is after the classification layer. We also used batch normalization (Ioffe & Szegedy, 2015) on input and after any fully connected layer in the model.

For the multi-headed version, we fork right after the frame pooling layer, and in each branch the cluster layer size and the number of mixtures are reduced compared with the baseline DBoF.

Table 2 shows the performance and architecture comparisons between the baseline DBoF and its multi-headed versions. The width config specifies the number of neurons in the cluster and hidden layers, as well as the number of logistic experts in the MoE classification layer. Our baseline DBoF has a GAP of . The NeXtVLAD model (Lin et al., 2018), which is the best performing single model in 2018 YouTube-8M Kaggle competition, has a similar GAP of . Although our DBoF is not as parameter efficient as NeXtVLAD, it is structurally simpler and serves as a good baseline for emsembling architectures. Our findings are similar to those from the ImageNet experiments. The ensembling loss with gives a sizable performance gain despite being smaller. Setting produces roughly the best performance, which has a small gain in GAP with the same mAP. The co-distillation loss with is slightly better in both GAP and mAP.

DBoF based model Width Config GAP mAP #Params #FLOPs
Baseline cluster-4096 hidden-4096, MoE-5 87.93% 59.65% 229M 415M
base: cluster-4096 2 branches: (hidden-3000, MoE-3) 88.30% 60.06% 226M 410M
base: cluster-4096 2 branches: (hidden-3000, MoE-3) 88.35% 60.06% 226M 410M
base: cluster-4096 2 branches: (hidden-3000, MoE-3) 88.36% 60.07% 226M 410M
Table 2: Comparisons baseline DBoF with its multi-headed versions on the YouTube-8M dataset. The uncertainties of the accuracy metrics are about . (The run-to-run variation is very small because the dataset uses pre-extracted high-level features instead of pixels as input.)

Implementation Details. The models are trained with randomly sampling 25 frames and evaluated with taking all frames in each video. We use 32 TPU v3 cores (4x4 configuration) with a batch size of 16 in each core. We use the Adam optimizer, where the learning rate starts at 0.005 and decays by 0.95 every 1500000 examples.

4.3 Kinetics Action Recognition

The Kinetics-400 dataset (Kay et al., 2017) contains about 240,000 short video clips with 400 action classes. Some videos are deleted over time so the train and validation sets have slightly fewer videos compared to the original version. We use the dataset snapshot captured in May 2019. Following the standard practice, the top-1 and top-5 accuracy on the validation set are reported.

We experimented with the S3D-G (Xie et al., 2018) model and its multi-headed counterparts. S3D-G consists of separable spatio-temporal convolutions and feature gating, and it gives a good speed-accuracy trade-off.

The performance and architecture comparisons are shown in Table 3. Similar to the notation in Section 4.1, we specify the model configuration with 4 numbers indicating the number of channels in the feature map right after each spatial sub-sampling in the network. Despite video removal in the dataset, our S3D-G baseline model closely matches the accuracy of the originally reported numbers (74.6% in ours vs 74.7% in Xie et al. (2018)). We use two branches as before, and the number of channels in each branch is reduced by a factor of about 1.5 to maintain a comparable number of parameters and FLOPs with the original model. The multi-headed versions give large improvement over the baseline S3D-G model similar to what we observed before. The performance of the ensembling loss is roughly the best with or , and the performance of the co-distillation loss with is slightly better.

S3D-G based model Width Config Top-1 Acc. Top-5 Acc. #Params #FLOPs
Baseline (64, 192, 480, 832) 74.6% 91.4% 9.7M 129B
base: (64, 192, 480) 2 branches: (554) 75.7% 92.2% 9.7M 129B
base: (64, 192, 480) 2 branches: (554) 75.8% 92.1% 9.7M 129B
base: (64, 192, 480) 2 branches: (554) 75.9% 92.3% 9.7M 129B
Table 3: Comparisons baseline S3D-G with its multi-headed versions on the Kinetics dataset. Our models take 64 RGB frames with spatial resolution as input. The uncertainties of the accuracy metrics are about .

Implementation Details. Following Feichtenhofer et al. (2018), our S3D-G models use random initialization instead of ImageNet pretraining with a longer training schedule of 196 epochs. We use 128 TPU v3 cores (8x8 configuration) with a batch size of 16 in each core. The models are trained with the Momentum optimizer, where the learning rate starts at 0.8 and gradually decays to 0 at the end of training with the half-cosine decay schedule (Loshchilov & Hutter, 2016).

5 Conclusions

In this paper, we presented a multi-headed architecture to train model ensembles end-to-end in a single stage. Unlike many previous works (Zhang et al., 2018; Anil et al., 2018; Lan et al., 2018; Song & Chai, 2018) that mix up the ensembling loss and the co-distillation loss, we only apply the co-distillation loss, where there is no need to explicitly regress individual model predictions to the ground truth. The co-distillation loss is theoretically related to the ensembling loss, and co-distillation empirically performs slightly better. Besides, contrary to the conventional belief that ensembling gains performance at the cost of higher computational resources, we demonstrated on a variety of large scale image and video datasets that we can scale down the size of an ensembled model to that of a single model while still maintaining a large accuracy improvement. Therefore, our approach is valuable for developing model ensembles from the perspectives of both automation and performance. Finally, EnsembleNet provides guidance on how to incorporate multiple loss heads in neural architectural search, by which we may potentially discover even better models.

References

Appendix A Appendix

a.1 The Connection Between Ensembling and Co-distillation.

In this section, we prove that for the simple average ensembler and the loss, the ensembling loss is equivalent to the co-distillation loss. As a corollary, the conventional ensembling, which sets , has the co-distillation effect in it as well.

Let be the simple average of all of the predictions . The ensembling loss can be written as

which is the co-distillation loss.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
391918
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description