Deep Architectures and Ensembles for Semantic Video Classification
Abstract
This work addresses the problem of accurate semantic labelling of short videos. We advance the state of the art by proposing a new residual architecture, with stateofthe art classification performance at significantly reduced complexity. Further, we propose four new approaches to diversitydriven multinet ensembling, one based on fast correlation measure and three incorporating a DNNbased combiner. We show that significant performance gains can be achieved by ”clever” ensembling of diverse nets and we investigate factors contributing to high diversity. Based on the extensive YouTube8M dataset, we perform a detailed evaluation of a broad range of deep architectures, including designs based on recurrent networks (RNN), feature space aggregation (FV, VLAD, BoW), simple statistical aggregation, midstage AV fusion and others, presenting for the first time an indepth evaluation and analysis of their behaviour.
I Introduction
Accurate cliplevel video classification, utilising a rich vocabulary of sophisticated terms, remains a challenging problem. One of the contributing factors is the complexity and ambiguity of the interrelations between linguistic terms and the actual audiovisual content of the video. For example, while a ”travel” video can depict any location with any accompanying sound, it is the intent of the producer or even the perception of the viewer that makes it a ”travel” video, as opposed to a ”news” or ”real estate” clip. Hence true understanding of the video’s overall meaning is called for, and not mere recognition or a ’sum’ of the constituent locations, objects or sounds.
Another factor is the multidimensional (space and time) and multimodal (audio and video) characteristics of the input data, which exponentially amplifies the complexity of the task compared to the already challenging problems of semantic annotation of images or audio snippets. For videos, a successful approach has to identify and localise important semantic entities not only in space, but also in time; it has to understand not only spatial but also temporal interactions between semantic entities or events, and it also has to link and balance the sometimes contradictory clues originating from the audio and video tracks.
The recent Kaggle competition entitled ”Google Cloud & YouTube8M Video Understanding Challenge” provided a unique platform to benchmark existing methods and to develop new approaches to video analysis and classification. The associated YouTube8M (v.2) dataset contains approximately 7 million individual video clips, corresponding to almost half a million hours (totalling 50 years!), annotated with a rich vocabulary of 4716 semantic labels [1]. The challenge is to develop classification algorithms which accurately assign videolevel semantic labels.
Given the complexity of the task, where humans are known to use diverse clues, we hypothesise that a successful solution must efficiently combine different expert models. Here, we pose several important questions:

What are the best architectures for this task?

How to construct diverse models and optimally to combine them?

Do we need to individually train and combine discrete models or can we simply train a very large/flexible Deep Neural Networks (DNN)s to obtain a fully trained endtoend solution?
The first question clearly links to ensemblebased classifiers, where a significant body of prior work demonstrates that diversity is important. However, do we know all the different ways to promote diversity in DNN architectures? On the second question, our analysis shows that training a single network results in suboptimal solutions as compared to an ensemble.
This manuscript is based on our work on the above Kaggle competition [3]. However, this work has significantly newer contributions and advances the field in a number of ways. Firstly, we propose a new deep residual architecture for semantic classification and demonstrate that it achieves stateofthe art classification performance with significantly faster training and reduced complexity. Secondly, in order to advance beyond the stateofthe art, we propose four new approaches to ensembling of multiple classifiers. We show a very simple but effective method which is based on optimal weight approximation and determined by a fast correlation measure. Further, we also propose and investigate three (learning) approaches incorporating DNNbased ensemblers. Our extensive experiments demonstrate that significant performance gains can be achieved by optimal ensembling of diverse nets and we investigate, for the first time, factors contributing to productive diversity. Based on the extensive YouTube8M dataset, we study and comparatively evaluate a broad range of deep architectures, including designs based on recurrent networks (RNN, LSTM), feature space aggregation (FV, VLAD, BoW), simple statistical aggregation, midstage AV fusion and others. Finally, we show that our diversityguided solution delivers the GAP of 85.12% (on Kaggle evaluation set), which is the best result published to date. Importantly, our solution has a significantly reduced complexity, compared to the previous state of the art.
Ia Related Work
We first review existing approaches to video classification before discussing ensemblebased classifiers. Ng et al. [20] introduced two methods which aggregate framelevel features into videolevel predictions: Long shortterm memory (LSTM) and feature pooling. Fernando et al. [7] proposed a novel rankbased pooling method that captures the latent structure of video sequence data. Karpathy et al. [13] investigated several methods for fusing information across temporal domain and introduced Multiresolution CNNs for efficient video classification. Wu et al. [28] developed a multistream architecture to model shortterm motion, spatial and audio information respectively. LSTMs are then used to capture longterm temporal dynamics.
DNNs are known to provide significant improvement in performance over traditional classifiers across a wide range of datasets. However, it has also been found that further significant gains can be achieved by constructing ensembles of DNNs. One example is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [23]. Here, improvements up to 5% were achieved over individual DNN performance (e.g. GoogLeNet[26]) by using ensembles of existing networks. Furthermore, all the top entries in this challenge employed ensembles of some form.
One of the key reasons for such a large improvement was found to be due to the diversity present across different base classifiers (i.e. different classifiers specialise to different data or label subsets)[10, 14]. An increase in diversity of classifiers of equal performance will usually increase the ensemble performance. There are numerous methods for achieving this; such as random initialisation of the same models, or data modification using Bagging [4] or Boosting [24] processes. Recently, work was carried out on endtoend training of an ensemble based on diversityaware loss functions. Chen et al. [5] proposed to use Negative Correlation Learning for promoting diversity in an ensemble of DNNs, where a penalty term based on the covariance of classifier outputs is added to the loss function. An alternative was proposed by Lee et al [16] based on the approach of Multiple Choice Learning (MCL) [9]. Here, DNNs are trained based on a loss function that uses the final prediction chosen from an individual DNN with the lowest independent loss value.
IB Contribution and Overview
The rest of the paper is organised as follows: In Section II, we evaluate the performance of a wide range of different DNN architectures and video features. We also propose a novel DNN architecture that is inspired based on ResNet [11] that achieves state of the art performance for individual classifiers. We then provide detailed analysis on their individual performances across different classes to show that they are indeed diverse, and offer strong potential for ensembling (Section IVA). In order to advance state of the art, we propose four different methods for DNN ensembling leading to performance that is significantly higher than individual DNN classifiers (Section V. We also provide an analysis of where improvements were obtained by an ensemble of classifiers. Finally, we draw conclusions in Section VIII.
Ii DNN Models
In this section, we describe the three different classes of DNN architectures that can be used for semantic labelling. The first class are fully connected NNs that use feature vectors based on the mean and standard deviation of the frames in a video (i.e. the frames of each video are aggregated using mean and standard deviation operations). This approach has the advantage of working with a representation that is simpler in size and computational complexity. The next classes are the recurrent networks of LSTM and GRUs. These have the advantage of being able to explicitly model the temporal nature of the data. Finally, we have a class of models that account for individual frames in a video via aggregation mechanisms that are agnostic to the temporal ordering. These include netFV, netBOW and netVLAD.
Architecture  Architecture details  Params.  Av.  Time  Num.  Kaggle 
[]  GAP  (Hours)  Epochs  GAP  
ROI FC  FC Layers: 12K12K12K, DO: 0.3  502  81.84  60  40  81.74 
AV Fusion  (see Sec. IIA1)  (see Sec. IIA1)  81.8  48  40  81.7 
Gated Resnet 8K  3 Resnet Layers: 8K8K8KGated, DO 0.4  82.89  48  40  82.81  
Gated Resnet 10K  2 (2conv) Resnet Layers: 10K10KGated, DO 0.3  82.8  48  40  82.55  
Gated Resnet 10K  2 (2conv) Resnet Layers: 10K10KGated, DO 0.4  82.8  48  40  82.64  
Gated FC 10K  3 FC Layers: 10K10K10KGated, DO: 0.3  82.49  48  40  82.22  
Gated FC 10K  3 FC Layers: 10K10K10KGated, DO: 0.4  82.77  48  40  82.76  
Gated FC 12K  3 FC Layers: 12K12K12KGated, DO: 0.3  82.41  60  40  82.22  
Gated FC 12K  3 FC Layers: 12K12K12KGated, DO: 0.4  82.72  60  40  82.70  
LSTM  2 layers, 1024 cells  31  81.88  160  12  81.84 
GRU  2 layers, 1200 cells  34  82.15  199  9  82.07 
Gated BOW  4096 Clusters  39  81.97  348  20  81.79 
NonGated BOW  8000 Clusters  17  81.88  178  16  81.81 
Gated netVLAD  256 Clusters  308  82.89  304  12  82.83 
Gated netRVLAD  256 Clusters  307  82.95  140  12  82.85 
Gated netFV  128 Clusters  308  82.37  131  12  82.23 
Iia Gated Fully Connected NN Architecture
For our work, we use a 3hidden layer fully connected neural network, with layers FC6, FC7 and FC8. The size of the input layer is dependent on the feature vectors chosen. We also employ dropout on each hidden layer. These will be described in more detail in the sections below.
The activation function for all hidden units is ReLU. We have considered numbers of hidden units of 8000, 10000 and 12000 (also referred to as 8k, 10k and 12k), with droupout rates of 0.3 and 0.4. The output layer is again a fully connected layer, with a total of 4716 output units, one for each class. In order to provide classprediction probability values, each output unit uses a sigmoid activation function.
We have found significant improvements can be obtained when a simple fully connected output layer is replaced by a Gating Layer, as was found in [19]. An illustration of this layer can be seen in Fig. 1a. The structure is similar to a fully connected layer with two important features: 1) The number of hidden units is the same as the input dimension; 2) the input vector is multiplied elementwise with the hidden layer output, resulting in the output vector. Essentially, the hidden layer here acts as a “gate”, determining how much of a particular input dimension to let through. Crucially, this gate is calculated using all the input values. This allows this layer to learn and exploit correlations (or decorrelations) amongst different classes.
IiA1 FCNN Features
We employed the following types of input features:

VideoLevel Mean Features (MF)
The framelevel features were obtained by two separate Inception DNNs, one for video and another for audio. They gave 1024 and 128 output values respectively, which are then concatenated into a 1152dimensional frame level feature. The mean feature for each video was obtained by averaging these framelevel features across the time dimension. 
VideoLevel Mean Features + Standard Deviation (MF+STD)
We extract the standard deviation feature from each video. The signature is L2normalised and concatenated with L2normalised mean feature to form a 2304Dim representation . 
Region of Interest pooling (ROI)
The ROIpooling based descriptor, proposed by Tolias et al [22], is a global image representation for image retrieval and classification. We compute a new videolevel representation using the ROIpooling approach, where the framelevel features are maxpooled across 10 temporalscale overlapping regions, obtained from a rigidgrid covering the framelevel features, producing a single signature per region. These regionlevel signature are independently L2normalised, PCA transformed and whitened. The transformed vectors are then sumaggregated and L2normalised. The dimensionality of final videolevel representation is 1152, similar to that of the videolevel mean features. 
AudioVisual Fusion network (AVF)
The idea behind the AVF network is to perform audiovision fusion in order to maximise information extracted from each mode. This method comprises of two stages: (i) First, the audio and visual features networks are trained separately to minimise the classification loss and then (ii) combined in a fusion network consisting of two fullyconnected layers [29].We first train the audio and video networks individually. We use three fully connected layers similar to FC6, FC7 and FC8, respectively, all of size 4096. Each FC layer is followed by a ReLu and a dropout layer. The output of the FC8 layer is passed through another fully connected layer FC9 which computes the predictions and finally updates the network parameters to minimise the cross entropy loss over the training data.
After training the audio and video networks, we discard their FC9 layers and connect their FC8 layers to the fusion network shown in Fig. 1(f). In this way, 4096Dim audio and 4096Dim video features are concatenated to form a 8192Dim representation as an input to the fusion network. This fusion network contains two fully connected layers of size 8192, followed by a fully connected prediction layer and crossentropy optimisation.
IiB Temporal Models: LSTMs and GRUs
In order to explicitly model the video frames as a temporal sequence, two recurrent neural networks were used: LongShortTerm Memory (LSTM) [8] and Gated Recurrent Units (GRU) [6]. Both LSTMs and GRU encode information seen in the past using high dimensional vectors called memory cells. These memory cells have the same dimensionality as the LSTM and GRU output vectors.
In the LSTM model (Fig. 1c), the input and memory cell vectors are linearly transformed (via weight matrices and bias vectors), with a sigmoid function applied to yield three different vectors used for gating. These gates are applied by means of elementwise multiplication with another vector and are called: input, forget and output gates. The gates, together with the current memory state, input vector and previous output vector, are recursively used to update the memory state for each time step and also produce the current output vector. In this paper, a stack of two LSTMs were used, where the output of an initial LSTM is fed as input to a second LSTM, and its output used instead. The output (and memory state) of the LSTMs is 1024.
GRUs can be thought of as a simplified version of the LSTM, where the output gate is removed. Additionally, GRUs do not contain an internal hidden memory state in LSTMs. Instead, the information from previous frames are encoded in the output vector. As a result, the GRU architecture contains a smaller number of parameters compared to LSTMs. Despite this, the performance of GRUs are often better than LSTMs. As with LSTMs, a stack of two GRUs is used.The output state of the GRUs is 1200.
The output values of both LSTMs and GRUs results were then passed into a gated fully connected layer that provides the 4717 class output values.
IiC TemporalAgnostic Aggregation Models: NetVLAD, DeepFV and DeepBOW
IiC1 NetVLAD
NetVLAD [2] is CNN architecture that is trainable in a endtoend manner directly for computer vision tasks such as image retrieval, place recognition and action recognition. The NetVLAD network typically consists of a standard CNN (VGG [25], RESNET [12]) followed by a Vector of Locally Aggregated Descriptors (VLAD) layer that aggregates the last convolutional features into a fixed dimensional signature and its parameters are trainable via backpropagation.
The VLAD block encodes the positions of convolutional descriptors in each voronoi region by computing their residuals with respect to the nearest visual words. A precomputed codebook of cluster centres is first computed offline using Kmeans clustering. The descriptors are softassigned to each cluster centre and the residual vectors are accumulated to obtain clusterlevel representations. The final VLAD representation is obtained by concatenating all aggregated vector for all clusters . The VLAD block can be implemented effectively using standard CNN blocks (Convolution, Softmax and Sumpooling).
IiC2 Deep Fisher Vectors
Another popular method for generating global descriptors for image matching is the Fisher Vector (FV) method, which aggregates local image descriptors (e.g. SIFT [17]) based on the Fisher Kernel framework. A Gaussian Mixture Model (GMM) is used to model the distribution of local image descriptors, and the global descriptor for a video is obtained by computing and concatenating the gradients of the loglikelihoods with respect to the model parameters. One advantage of the FV approach is its encoding of higher order statistics, resulting in a more discriminative representation and hence better performance [21]. Here, we have used a model that learns the FV model in an endtoend manner.
IiC3 Deep BOW
The Deep BagofWords encoding is another orderless representation constructed from framelevel descriptors by grouping similar audio and visual features into clusters (known as visual or audiowords). A video sequence is represented as a sparse histogram over the vocabulary. The model tested uses softassignment strategy which has been shown to deliver better performance in AV retrieval and classification applications.
Iii Novel ResidualDNNs for Learning Semantic Video Content
Iiia Fully Connected ResNet
Inspired by the success of ResNet [11] for image recognition, we propose a Fully Connected ResNet (FCRN) architecture to tackle the problem of video classification. More precisely, let be the video level features (Mean+Standard deviation) extracted from a video. The ResNet block can be defined as:
(1) 
where, and are the outputs and the weights of the convolutional layer respectively. The function represents the residual mapping to be learned. The ResNet block is demonstrated in Figure 1g, in which denotes the ReLU and is the randomly sampled dropout mask. The operation is computed by a shortcut connection and elementwise addition.
In the FCRN architecture (Fig. 1h), the input (Mean+Standard deviation feature vector) is first fed to a Fully Connected layer. The output of FC layer is then passed through a series of ResNet blocks. Finally, the resultant representation is forwarded into a gated fully connected layer that provides the 4717 class output values.
FC layer size  Number of ResNet blocks  GAP (%) 

4096 4096  5  82.35 
8192 8192  4  82.89 
10240 10240  3  82.81 
Iv Individual DNN Experimental Results
The complete Youtube8M dataset consists of approximately 7 million Youtube videos, each approximately 25 minutes in length, with at least 1000 views each. There are 4716 possible classes for each video, given in a multilabel form. For the Kaggle challenge, we were provided with 6.3 million labelled videos (i.e. each video was associated with a 4716 binary vector for labels). For test purposes, approximately 700K unlabelled videos were provided. The resulting class test predictions from our trained models were uploaded to the Kaggle website for evaluation.
The evaluation measure used is called ‘GAP20’. This is essentially the mean average precision of the top20 ranked predictions across all examples. To calculate its value, the top20 predictions (and their corresponding groundtruth labels) are extracted for each test video. The sets of top20 predictions for all videos are concatenated into a long list of predictions. A similar process is performed for the corresponding groundtruth labels. Both lists are then sorted according to the confidence prediction values and mean average precision is calculated on the resulting list.
The performances of individual DNNs is summarised in Table. I. Here, we show both the GAP20 scores on the validation set where groundtruth labels are available, as well as GAP20 scores on unseen test data, by uploading the testdata inferences to the Kaggle website. We see that the performances of the different DNNs fall in the range of 82% to 83%.
As expected, GRUs perform better than LSTMs. The temporal agnostic models based on VLAD and Fisher Vectors consistently achieve high Kaggle GAP20 scores above 82%. Interestingly, the BOW models achieve lower scores less than 82%. However, using considerably simpler mean and standard deviation features in the fully connected and ResNet models provide similar performances. We also performed experiments to find the optimum depth and size of fully connected layer in the ResNet block. Table II demonstrates the impact of different ResNet architectures on the classification performance of FCRN network. It can be seen that 4 residual blocks of 8K hidden units achieves the best performance of 82.89%.
However, none of the DNN performances exceed 83%. Nonetheless, we find that whilst the performances of DNNs are roughly similar, different type of DNNs perform well (and conversely) on different sets of classes. This in turn will provide significant benefits in the GAP score after ensembling these individual DNNs together. To see this, we next provide an analysis on how well each DNN did for different classes. To achieve this, we next propose a measure for how accurate each DNNs relative to the final GAP20 score.
Iva Class Dependent DNN Performance Measure
In this section, we analyse the performance of individual DNNs in order to understand how they can contribute to improvements in the final ensembled system. To achieve this, we calculate separate accuracy scores for each classifier on each video label. Whilst this is not exactly the GAP20 score, it is highly related.
The classifier accuracy is based on “oracle” outputs, that is, for some given class and example, an oracle will inform us with 1 if that example’s class label was output correctly from some classifier, and 0 vice versa. Given that our classifiers output probability values, their output values need to be first binarised. We chose a threshold of 0.5 for this, so that any output value from a classifier with a value greater than or equal to 1 will be equated to 1, and 0 vice versa.
Now, let the number of classifiers be . For one of these classifiers, we obtain an oracle output matrix specific to it by comparing its thresholded output with the groundtruth label. Specifically, suppose the classifier has the binarised output matrix , with each of its element denoted as . Suppose as before, the groundtruth labels for class and example is denoted as , then the oracle matrix for this classifier is denoted as , with each element .
IvA1 Class Accuracy of Base DNNs
The accuracy of class of some DNN with index , can be directly obtained using the oracle matrix as follows:
(2) 
The performance of the classifiers for the most frequent 100 classes can be seen in Fig. 2. It can be seen from Fig. 2a that all the classifiers in the ensemble perform roughly the same. This correlates well with the overall GAP scores shown in Table. I where all the individual DNNs used had GAP20 scores that were fairly similar. We find that the accuracy for all classes are very high. This is due to the imbalance between the occurrence of a class (i.e. video label) and nonoccurrence. That is, the majority of the videos considered will not have a specific video label associated with it. As an example, the most frequent label of “Games” only occurs in about 10% of the videos.
IvA2 Mean DeltaClass Accuracy
There are variations amongst individual DNN performances that can be exploited by ensembling for raising the final ensemble GAP score. To see this more clearly, we will compare the values of with the class mean accuracy curve: The deviations of each classifier compared with the mean accuracy can then be obtained: .
The plot of the deviation between the mean GAP accuracy and individual DNN performance in Fig. 2b. In this figure, we see the largest discrepancies between DNN performances exist at the most frequent labels, and decreasing with frequency.
In order to see the overall pattern, we extract a correlation matrix from the perclass performance deviations . Each element of this correlation matrix is calculated as:
The correlation matrix is shown in Fig. 3. Of particular interest are the negative correlations in the performances of pairs of individual DNNs (shown as dark blue). Negative correlations indicate that when a particular DNN is underperforming (below mean accuracy), whilst the other DNN overperforms. Here, we find that different classes of DNNs in the ensemble tend to contain negative correlations. In particular, BoW type methods are negatively correlated with VLAD approaches.
(a)  (b) 
(a)  (b) 
IvB DNN Diversity Analysis
It is well known that by combining the outputs from a set of base classifiers with similar performance, significant improvements in accuracy can be obtained. However, this improvement is governed by the diversity present in the group of the base classifiers. Roughly speaking, a set of classifiers are deemed diverse if they perform well on different examples or classes. However, there is no agreed diversity measure for a set of classifiers.
Multiple authors have proposed a multitude of other potential diversity measures. Ten such measures were considered in [15]. It was observed there that there are correlations between the ensemble accuracy and amount of diversity present in the baseclassifiers of an ensemble across the diversity measures analysed. These 10 measures can be split into two categories: pairwise and nonpairwise measures. Pairwise measures only quantify the diversity of two classifiers. To obtain a single number to represent an ensemble diversity, the average is usually obtained. Nonpairwise measures combine performances for all classifiers throughout the dataset into a single measure. The analysis here will use the nonpairwise measurement of Entropy and interrater agreement.
One issue with the above measures is that they assume single label classification, Here, each video example is associated with multiple labels. Consequently, we have chosen to extract the diversity scores per class. The aim is to show how across different classes does the different diversity scores change as we add more base classifiers.
The diversity measures are all based on the oracle matrices described in Section IVA. They all use a common property, which is the number of classifiers that have recognised class in example correctly. We denote this as . We can also compute the average accuracy for class across all classifiers as:
We can then calculate the interrater agreement for this ensemble of classifiers as:
The values of indicate the amount of agreement between different classifiers in an ensemble, whilst correcting for chance. The smaller the value, the greater the diversity (since classifiers agree less with each other).
Another measure is based on entropy, and can be calculated as follows:
For this measure, a larger value will indicate larger diversity amongst the ensemble. Using the above two measures, we exhaustively calculated the diversity measure for all possible combinations of the DNN classifiers in Table I from a range of 2 classifiers to the final single ensemble of 14 DNNs. The results can be seen in Fig. 4.
We find that in general, the lower bound for the results from the entropy measure increases. We have also shown the curve of average entropy across all possible combinations of DNNs for a fixed number of classifiers. This measure has an interesting artefact where odd numbered ensembles and even numbered ensembles have different ranges of entropy measurement. This is caused by the ceiling operator in the equation for above. We also observe that the upper bound for the interrater agreement decreases, both indicate that the diversity of the ensemble is increasing. Interestingly, we observe that the less frequent the classes are, the less diverse the ensemble becomes.
V DNN Ensembling
We have shown in Table I that individual DNNs are not able to improve beyond 83% GAP20 scores. Nonetheless, the analysis in Section IVA indicated that whilst the overall scores of individual DNNs are similar, the classes where they perform well in are quite diverse. This suggests that ensembling sets of individual DNNs will provide further improvements.
A common approach to learn the ensembling coefficients is first to propose a diversity measure, such as those considered in [15] or [27]. This is followed by simultaneously training multiple classifiers to also maximise the selected diversity measure. To this end, Lee et al. [16] proposed endtoend training of multiple classifiers simultaneously with the diversity measure factored into the loss function. This approach is unfeasible here, as the individual DNNs are complex and learn at potentially different rates. Additionally, it is still an open question whether explicit optimisation of diversity scores will always correlate with increased ensemble accuracy. To analyse this, a theoretical study of 6 different diversity measures was carried out by Tang et al. [27]. It was also found that diversity measures themselves can be ambiguous in predicting the generalisation accuracy of an ensemble.
Consequently, the approach chosen here assumes that the baseclassifiers in the ensemble are fixed (i.e. prelearnt). We have also chosen to perform ensembling by linearly combining outputs from each DNN. Therefore, the ensemble learning task is to determine the classifier coefficients.
The number of linear combination coefficients commonly fall into two classes: Single coefficient perDNN and separate coefficients for both DNNs and class. Specifically, suppose the number of DNNs available is . The outputs of each DNN for a class problem is denoted as . The ensembling of the different approaches for the single coefficient perDNN approach is then:
(3) 
where are scalars. The second ensembling approach is:
(4) 
where represents elementwise multiplication and is the set of coefficients (one per class) for each DNN. In order to learn the ensembling coefficients in Eq. 3 and Eq. 4, two approaches are considered:
Va Correlationbased Ensembling
In this section, we propose a novel nonDNNbased method for finding optimal DNN coefficients by greedy selection using Pearson’s correlation of pairs of DNN outputs.
The benefits of this method is that it is very fast compared to DNN methods, and directly tries to find the combination of weights that optimises GAP  as opposed to requiring a differentiable loss function such as crossentropy to optimise as a proxy for GAP.
The algorithm works as follows:

First, we compute a Pearson’s correlation matrix based on the predictions of all the candidate DNN models.

We greedily select the pair of DNN models A, B with lowest correlation.

For this pair of models, we compute the GAP score for combinations of weights (eg for model A, and for model B). We used in our experiments.

We can then fit a quadratic curve to the GAP score against weight for this pair of models. We found experimentally that across several evaluation metrics, the score of a given fit can be very well approximated by a quadratic fit, greatly cutting down the time that would be required for an exhaustive search.
Using this quadratic, we can estimate the optimal weight for this pair of models, and combine the predictions linearly using this weight. 
Next, we compute correlations between the current ensemble and remaining models, and once again select the model with the lowest correlation.

We return to step 3, using the ensemble and candidate model as models A and B.

The loop is repeated until all candidate models have been added to the ensemble, and final used weights are calculated.
As the algorithm adds models iteratively, it can be terminated early to gain almost all the total score with a subset of the models.
VB Mixture of Experts for Ensembling DNNs
In this section, we describe the approach of using mixture of experts for ensembling different DNNs. There are numerous methods commonly used for learning MoEs, and we refer the reader to [18] for a complete review. For our purpose, we have chosen to ensemble the outputs of different pretrained DNNs using a gating network, whose weights are learnt using stochastic gradient descent.
We consider the following configurations of mixture of experts: Single coefficient per DNN; classdependent linear combination coefficients and a dualstream model that combines both approaches.
VB1 Single Coefficient MoE
It is possible to learn the ensembling coefficients by posing it as a single layer convolutional network (Fig. 5(a)). The input to this network, , is the concatenation of the outputs of base DNNs into an tensor. The ensembling can be achieved by introducing a convolutional layer of a single filter of size (with depth ). The output of this layer would be a tensor of size . As such, the convolutional process performs the linear combination, with the weights of the filter acting as the ensembling weights. These outputs can then be compared against the corresponding labels using the cross entropy loss. The convolutional filter weights were learnt using the Adam algorithm.
VB2 Perclass Coefficient MoE
It is possible to increase the modelling capacity of the ensembling method by introducing more coefficients. To this end, we can introduce separate coefficients for each class and DNN (Fig. 5(b)). As before, this can redefined as a DNN learning problem. In this case, it is not possible to use convolutional filters. Instead, we introduce a new layer where its weight tensor is the same size as the input tensor (). An elementwise multiplication between this and the input is then performed and summing carried out across the last dimension to produce an output of . The output can then be compared with the corresponding labels using the cross entropy loss and the weights learnt using the Adam algorithm.
VC Regularised DualStream MoE
One shortcoming of the perclass coefficient MoE is the possibility of overfitting. On the other hand, a single coefficient per DNN with far fewer parameters run little risk of overfitting. However, it instead suffers from potentially being too restrictive.
As such, we also explore a novel dualstream model that combines both approaches. In this model, we aim to have the ensembling coefficients for a particular DNN centered around some “main” value (i.e. similar to the single coefficient model), but the the ability to have a small “residual” from this mean to provide some perclass flexibility.
The architecture of this model can be seen in Fig. 5(c). It consists of two initial parallel streams, each receiving the input tensor. The first stream is the convolutional layer, providing the main ensembling coefficient for each DNN. The second stream is the perclass weight layer, providing the residual weight away from the first stream’s coefficients. The outputs of both vectors ( size ) are summed together providing the ensembled output vector.
In order to restrict the residual weights, we perform regularisation on the weight tensor of the second stream. This restricts the distribution of these weights to be zero mean with a small standard deviation. Similar to the previous two approaches, the weights of both streams are then learnt by optimising the cross entropy loss function using the Adam algorithm.
Vi Ensembling Experimental Results
In this section, we compare performances of DNN ensembles with coefficients learnt using the methods described in Section V. We present the performances of different DNN ensembles separately. For each of these approaches, we describe results on our validation dataset and the Kaggle test dataset. We have found that it was not possible to train the DNNs on the training dataset. The reason is that the fully connected DNNs have a much higher score on the training data. This in turn caused the ensembling algorithms to converge on those and ignore the remaining DNNs.
As a result, it was necessary to use the validation dataset for learning the ensembling coefficients. However, the use of the entire validation dataset for learning ensembling coefficients leaves us with no remaining data for evaluating its generalisation performance. One approach would be to use the entire validation set for ensemble weights learning which are then used to ensemble and upload the Kaggle test data to its website for evaluation. However, this would be too costly timewise.
In order to address the above issue, we have chosen to split the validation dataset into two partitions, which we denote as ensembletrain/test splits. We will then learn the DNN ensemble coefficients on the ensembletrain and test on the ensembletest partition. The ensembling coefficients will be learnt on the ensembletrain dataset and evaluated on the ensembletest partition.
Via Learnt Ensemble Weights
We can see how the different ensembling methods have weighted the base DNN models in Fig. 6. In terms of the perclass coefficient and dual stream ensembling DNNs, we instead obtained the mean weights for individual base DNNs. We see that all the methods have assigned approximately the same weights to different base DNNs. As expected, in general, the higher the individual performance of the DNNs, the higher the weights. Of particular interest is the negative weighting assigned by all the ensemblers to the LSTM method. We can see the evolution of the weights for each DNN in Fig. 7. For the single coefficient DNN emsembler, we find that the weights converge after approximately 40 epochs (Fig. 7a). An interesting artefact we have found is when the Adam algorithm performs large modifications to the weights, resulting in sudden jumps in weight values assigned to each DNN. However, this change is consistent across all DNNs. For the perclass coefficient DNN ensembler, each DNN is not assigned a single weight, but instead 4716 separate weights, each for a class. However, we can see the trend of these weights by obtaining its mean value (for each DNN) at every epoch. This is shown in Fig. 7b.
Finally, we show the weights assigned to the first stream of the dualstream DNN ensembler in Fig. 7c. For this method, the mean weights of the other stream for each baseDNN is approximately 0 due to the L2 regularisation employed here. The weights for each DNN evolve at a slower rate compared with the single coefficient DNN. However, we notice similar spikes in the weights due to the Adam algorithm.
ViB DNNbased Ensembling Generalisation
In order to shed light on the generalisation capabilities of the different DNNbased ensembling approaches, we analyse the results from the training and test splits. These can be seen for all three ensembling methods in Fig. 8. It can be seen that the training and test results for both single coefficient and dualstream approaches are roughly the same. However, we find that the perclass coefficient approach shows strong indications of overfitting. In particular, the training GAP20 score is seen to dramatically improve with increasing epochs. Unfortunately, we find that the test GAP20 score decreases in an equally dramatic fashion.
This can be more clearly seen by producing the scatterplot of training vs test GAP20 scores across all epochs. This can be seen in Fig. 9. In this figure, we see a strong correlation between increases in training and test gaps for the single coefficient ensembling DNN. In constrast, we find that there is an inverse correlation between the train and test GAP20 scores for the perclass ensembling method. Finally, for the dualstream method, we still observe good correlation between train and test GAP increases. These suggests that the single coefficient method stands the highest chance of obtaining significant improvement in the GAP20 performance over the other methods.
EnsembleType  Validation GAP20  Kaggle GAP20 

Average  85.03  84.97 
Correlation  85.12  85.11 
Single Coeff. DNN  85.13  85.11 
PerClass DNN  86.16  83.83 
DualStream DNN  85.12  85.10 
ViC Kaggle Test Results
In order to produce the predictions for the Kaggletest data, we chose to learn the ensembling coefficients using all the validation data. The resulting GAP20 scores on this validation dataset can be seen in Table III. We know that the perclass DNN score will be unreliable from the analysis of Section VIB. However, we expect the other validation scores to closely reflect the performance gain compared to simple averaging of multiple DNNs in the unseen Kaggletest data. The resulting GAP20 test results from the Kaggle website confirms this, and can be seen in the third column in Table III.
The baseline of the ensembling approach is obtained by simple averaging of all the outputs of the base DNNs. A baseline on the unseen Kaggle test dataset is obtained by averaging the base DNN results and uploading to the Kaggle website. The reported test GAP20 score obtained is 84.977%. Both the correlationbased approach and single coefficient DNN ensembler achieved significant improvement of 85.11% GAP20 results. This overfitting issue of the perclass DNN is confirmed by a Kaggle test GAP20 score of 83.83%. Finally, the dual stream method achieved a Kaggletest GAP20 of 85.10%. However, whilst this is a significant improvement compared with simple averaging, it does not improve on the single coefficient models.
The improvements obtained in the validation dataset can be verified by ensembling the unseen Kaggle test data using the learnt weights. This gives a Kaggle test GAP20 of 85.108% over the score of 84.977% when averaging is used. We have found that the Kaggle test GAP20 of 85.108% could also be achieved by removing LSTM from the DNN ensemble. This is not suprising as the ensemble weight given to LSTMs was minimal (Fig. 6).
(a) Single Coeff. DNN  (b) PerClas Coeff. DNN  (c) Dual Stream DNN 
(a) Training GAP20  (b)Test GAP20 
(a) Single Coeff. DNN  (b) PerClas Coeff. DNN  (c) Dual Stream DNN 
Vii Analysis of DNNEnsemble Performance
In the previous section, we have shown that ensembling provides significant improvement on the overall GAP20, both on the validation datset and the Kaggle test GAP20. In this section, we analyse the improvements brought about by the ensembling. To this end, we employ the same GAP20based class accuracy scores described in Section IVA. We can then extract how accurate a particular model (base DNN or ensemble) at recommending some class label for videos. Similar to the analysis on individual DNNs, we consider the top 100 classes.
As before, to more clearly see the performance differences between different classifiers, here between base DNNs and its ensemble, we shall plot their difference to the mean accuracy of the base DNNs. This can be seen in Fig. 10a. We find that the single coefficient DNN ensemble is significantly more accurate than individual DNNs for the most frequent classes. In fact, the more frequent the class is, the greater the accuracy improvement.
Following this, we count of the number of classes the ensemble achieves maximum accuracy when compared with other classifiers. The results and can be seen in Fig. 10b. As expected, we find that the ensemble is the classifier that has maximum accuracy for the majority of the top100 classes.
(a)  (b) 
Viii Conclusions
In this paper, we have evaluated the performance of a wide range of different DNN architectures and video features. The different architectures ranged from videolevel statistics to framelevel recurrent networks such as LSTMs. Additionally, we have also proposed a novel DNN architecture based on ResNet that achieves state of the art performance for individual classifiers, whilst using substantially simpler videolevel statistical information. We have found that, on an individual level, each DNN achieves roughly the same range of GAP20 accuracy, from 82% to 83%. Nonethelss, analysis on their individual performances across different classes to show that they are diverse. Additionally, two ensemble level measures of diversity provide further indication that adding more classifiers will increase diversity. This in turn suggests a strong case for ensembling the different classifiers.
We proposed four different methods for performing DNN ensembling. These approaches differ mainly by the number of linear combination weights used for the ensembling process. We have found that the ensembling method with only a single coefficient assigned to each baseDNN has the best generalisation capability. When more weights are used, with each class and baseDNN assigned a different weight, overfitting becomes a serious issue. Whether adding more training data will overcome this remains an open question. We have found that the single coefficient ensemble of 13 diverse base DNNs gives us the stateoftheart GAP20 performance from the Kaggle website of 85.12%. This is in contrast with the existing highest score of 84.9% obtained by simple averaging of 25 DNNs models. Analysis of the performance of the ensemble indicates that there is significant improvement in labelling accuracy for the most frequent classes.
References
 [1] S. AbuElHaija, N. Kothari, J. Lee, A. P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. Youtube8m: A largescale video classification benchmark. In arXiv:1609.08675, 2016.
 [2] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2018.
 [3] M. BoberIrizar, S. Husain, E. Ong, and M. Bober. Cultivating DNN diversity for large scale video labelling. CoRR, abs/1707.04272, 2017.
 [4] L. Breiman. Bagging predictors. Mach. Learn., 24(2):123–140, Aug. 1996.
 [5] H. Chen and X. Yao. Multiobjective neural network ensembles based on regularized negative correlation learning. IEEE Transactions on Knowledge and Data Engineering, 22:1738–1751, 2010.
 [6] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734. Association for Computational Linguistics, Oct. 2014.
 [7] B. Fernando, E. Gavves, J. O. M., A. Ghodrati, and T. Tuytelaars. Rank pooling for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):773–787, April 2017.
 [8] F. A. Gers, J. A. Schmidhuber, and F. A. Cummins. Learning to forget: Continual prediction with lstm. Neural Comput., 12(10):2451–2471, Oct. 2000.
 [9] A. GuzmÃ¡nRivera, D. Batra, and P. Kohli. Multiple choice learning: Learning to produce multiple structured outputs. In NIPS, pages 1808–1816, 2012.
 [10] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell., 12(10):993–1001, Oct. 1990.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei. Largescale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732, June 2014.
 [14] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and active learning. In Advances in Neural Information Processing Systems, pages 231–238. MIT Press, 1995.
 [15] L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2):181–207, May 2003.
 [16] S. Lee, S. Purushwalkam, M. Cogswell, D. J. Crandall, and D. Batra. Why M heads are better than one: Training a diverse ensemble of deep networks. CoRR, abs/1511.06314, 2015.
 [17] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, pages 91–110, 2004.
 [18] S. Masoudnia and R. Ebrahimpour. Mixture of experts: a literature survey. Artificial Intelligence Review, 42(2):275–293, Aug 2014.
 [19] A. Miech, I. Laptev, and J. Sivic. Learnable pooling with context gating for video classification. CoRR, abs/1706.06905, 2017.
 [20] J. Y.H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4694–4702, June 2015.
 [21] F. Perronnin and C. R. Dance. Fisher kernels on visual vocabularies for image categorization. In IEEE Conference on Computer Vision and Pattern Recognition, 2007.
 [22] F. Radenovic, G. Tolias, and O. Chum. Cnn image retrieval learns from bow: Unsupervised finetuning with hard examples. In ECCV, 2016.
 [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [24] R. E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012.
 [25] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.
 [27] E. K. Tang, P. N. Suganthan, and X. Yao. An analysis of diversity measures. Machine Learning, 65(1):247–271, Oct 2006.
 [28] Z. Wu, Y.G. Jiang, X. Wang, H. Ye, and X. Xue. Multistream multiclass fusion of deep networks for video classification. In Proceedings of the 2016 ACM on Multimedia Conference, MM ’16, pages 791–800. ACM, 2016.
 [29] S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian. Learning affective features with a hybrid deep model for audiovisual emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology, PP(99):1–1, 2017.