Deep Architectures and Ensembles for Semantic Video Classification

Deep Architectures and Ensembles for Semantic Video Classification

Eng-Jon Ong, Sameed Husain, Mikel Bober-Irizar, Miroslaw Bober Mikel Bober was with Visual Atoms Ltd, Guildford, Surrey, UK. E. Ong, S. Husain and M. Bober were with University of Surrey, Guildford, Surrey, UK. E-mails:,{,sameed.husain,m.bober}

This work addresses the problem of accurate semantic labelling of short videos. We advance the state of the art by proposing a new residual architecture, with state-of-the art classification performance at significantly reduced complexity. Further, we propose four new approaches to diversity-driven multi-net ensembling, one based on fast correlation measure and three incorporating a DNN-based combiner. We show that significant performance gains can be achieved by ”clever” ensembling of diverse nets and we investigate factors contributing to high diversity. Based on the extensive YouTube8M dataset, we perform a detailed evaluation of a broad range of deep architectures, including designs based on recurrent networks (RNN), feature space aggregation (FV, VLAD, BoW), simple statistical aggregation, mid-stage AV fusion and others, presenting for the first time an in-depth evaluation and analysis of their behaviour.

Computer Vision, Artificial Neural Networks, Machine Learning Algorithms.

I Introduction

Accurate clip-level video classification, utilising a rich vocabulary of sophisticated terms, remains a challenging problem. One of the contributing factors is the complexity and ambiguity of the interrelations between linguistic terms and the actual audio-visual content of the video. For example, while a ”travel” video can depict any location with any accompanying sound, it is the intent of the producer or even the perception of the viewer that makes it a ”travel” video, as opposed to a ”news” or ”real estate” clip. Hence true understanding of the video’s overall meaning is called for, and not mere recognition or a ’sum’ of the constituent locations, objects or sounds.

Another factor is the multi-dimensional (space and time) and multi-modal (audio and video) characteristics of the input data, which exponentially amplifies the complexity of the task compared to the already challenging problems of semantic annotation of images or audio snippets. For videos, a successful approach has to identify and localise important semantic entities not only in space, but also in time; it has to understand not only spatial but also temporal interactions between semantic entities or events, and it also has to link and balance the sometimes contradictory clues originating from the audio and video tracks.

The recent Kaggle competition entitled ”Google Cloud & YouTube-8M Video Understanding Challenge” provided a unique platform to benchmark existing methods and to develop new approaches to video analysis and classification. The associated YouTube-8M (v.2) dataset contains approximately 7 million individual video clips, corresponding to almost half a million hours (totalling 50 years!), annotated with a rich vocabulary of 4716 semantic labels [1]. The challenge is to develop classification algorithms which accurately assign video-level semantic labels.

Given the complexity of the task, where humans are known to use diverse clues, we hypothesise that a successful solution must efficiently combine different expert models. Here, we pose several important questions:

  • What are the best architectures for this task?

  • How to construct diverse models and optimally to combine them?

  • Do we need to individually train and combine discrete models or can we simply train a very large/flexible Deep Neural Networks (DNN)s to obtain a fully trained end-to-end solution?

The first question clearly links to ensemble-based classifiers, where a significant body of prior work demonstrates that diversity is important. However, do we know all the different ways to promote diversity in DNN architectures? On the second question, our analysis shows that training a single network results in sub-optimal solutions as compared to an ensemble.

This manuscript is based on our work on the above Kaggle competition [3]. However, this work has significantly newer contributions and advances the field in a number of ways. Firstly, we propose a new deep residual architecture for semantic classification and demonstrate that it achieves state-of-the art classification performance with significantly faster training and reduced complexity. Secondly, in order to advance beyond the state-of-the art, we propose four new approaches to ensembling of multiple classifiers. We show a very simple but effective method which is based on optimal weight approximation and determined by a fast correlation measure. Further, we also propose and investigate three (learning) approaches incorporating DNN-based ensemblers. Our extensive experiments demonstrate that significant performance gains can be achieved by optimal ensembling of diverse nets and we investigate, for the first time, factors contributing to productive diversity. Based on the extensive YouTube8M dataset, we study and comparatively evaluate a broad range of deep architectures, including designs based on recurrent networks (RNN, LSTM), feature space aggregation (FV, VLAD, BoW), simple statistical aggregation, mid-stage AV fusion and others. Finally, we show that our diversity-guided solution delivers the GAP of 85.12% (on Kaggle evaluation set), which is the best result published to date. Importantly, our solution has a significantly reduced complexity, compared to the previous state of the art.

I-a Related Work

We first review existing approaches to video classification before discussing ensemble-based classifiers. Ng et al. [20] introduced two methods which aggregate frame-level features into video-level predictions: Long short-term memory (LSTM) and feature pooling. Fernando et al. [7] proposed a novel rank-based pooling method that captures the latent structure of video sequence data. Karpathy et al. [13] investigated several methods for fusing information across temporal domain and introduced Multiresolution CNNs for efficient video classification. Wu et al. [28] developed a multi-stream architecture to model short-term motion, spatial and audio information respectively. LSTMs are then used to capture long-term temporal dynamics.

DNNs are known to provide significant improvement in performance over traditional classifiers across a wide range of datasets. However, it has also been found that further significant gains can be achieved by constructing ensembles of DNNs. One example is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [23]. Here, improvements up to 5% were achieved over individual DNN performance (e.g. GoogLeNet[26]) by using ensembles of existing networks. Furthermore, all the top entries in this challenge employed ensembles of some form.

One of the key reasons for such a large improvement was found to be due to the diversity present across different base classifiers (i.e. different classifiers specialise to different data or label subsets)[10, 14]. An increase in diversity of classifiers of equal performance will usually increase the ensemble performance. There are numerous methods for achieving this; such as random initialisation of the same models, or data modification using Bagging [4] or Boosting [24] processes. Recently, work was carried out on end-to-end training of an ensemble based on diversity-aware loss functions. Chen et al. [5] proposed to use Negative Correlation Learning for promoting diversity in an ensemble of DNNs, where a penalty term based on the covariance of classifier outputs is added to the loss function. An alternative was proposed by Lee et al [16] based on the approach of Multiple Choice Learning (MCL) [9]. Here, DNNs are trained based on a loss function that uses the final prediction chosen from an individual DNN with the lowest independent loss value.

I-B Contribution and Overview

The rest of the paper is organised as follows: In Section II, we evaluate the performance of a wide range of different DNN architectures and video features. We also propose a novel DNN architecture that is inspired based on ResNet [11] that achieves state of the art performance for individual classifiers. We then provide detailed analysis on their individual performances across different classes to show that they are indeed diverse, and offer strong potential for ensembling (Section IV-A). In order to advance state of the art, we propose four different methods for DNN ensembling leading to performance that is significantly higher than individual DNN classifiers (Section V. We also provide an analysis of where improvements were obtained by an ensemble of classifiers. Finally, we draw conclusions in Section VIII.

Ii DNN Models

In this section, we describe the three different classes of DNN architectures that can be used for semantic labelling. The first class are fully connected NNs that use feature vectors based on the mean and standard deviation of the frames in a video (i.e. the frames of each video are aggregated using mean and standard deviation operations). This approach has the advantage of working with a representation that is simpler in size and computational complexity. The next classes are the recurrent networks of LSTM and GRUs. These have the advantage of being able to explicitly model the temporal nature of the data. Finally, we have a class of models that account for individual frames in a video via aggregation mechanisms that are agnostic to the temporal ordering. These include netFV, netBOW and netVLAD.

Fig. 1: Here we have all nets in one place saving space.
Architecture Architecture details Params. Av. Time Num. Kaggle
[] GAP (Hours) Epochs GAP
ROI FC FC Layers: 12K-12K-12K, DO: 0.3 502 81.84 60 40 81.74
AV Fusion (see Sec. II-A1) (see Sec. II-A1) 81.8 48 40 81.7
Gated Resnet 8K 3 Resnet Layers: 8K-8K-8K-Gated, DO 0.4 82.89 48 40 82.81
Gated Resnet 10K 2 (2-conv) Resnet Layers: 10K-10K-Gated, DO 0.3 82.8 48 40 82.55
Gated Resnet 10K 2 (2-conv) Resnet Layers: 10K-10K-Gated, DO 0.4 82.8 48 40 82.64
Gated FC 10K 3 FC Layers: 10K-10K-10K-Gated, DO: 0.3 82.49 48 40 82.22
Gated FC 10K 3 FC Layers: 10K-10K-10K-Gated, DO: 0.4 82.77 48 40 82.76
Gated FC 12K 3 FC Layers: 12K-12K-12K-Gated, DO: 0.3 82.41 60 40 82.22
Gated FC 12K 3 FC Layers: 12K-12K-12K-Gated, DO: 0.4 82.72 60 40 82.70
LSTM 2 layers, 1024 cells 31 81.88 160 12 81.84
GRU 2 layers, 1200 cells 34 82.15 199 9 82.07
Gated BOW 4096 Clusters 39 81.97 348 20 81.79
Non-Gated BOW 8000 Clusters 17 81.88 178 16 81.81
Gated netVLAD 256 Clusters 308 82.89 304 12 82.83
Gated netRVLAD 256 Clusters 307 82.95 140 12 82.85
Gated netFV 128 Clusters 308 82.37 131 12 82.23
TABLE I: Summary of different architectures with their complexity, training time and the achieved GAP performance

Ii-a Gated Fully Connected NN Architecture

For our work, we use a 3-hidden layer fully connected neural network, with layers FC6, FC7 and FC8. The size of the input layer is dependent on the feature vectors chosen. We also employ dropout on each hidden layer. These will be described in more detail in the sections below.

The activation function for all hidden units is ReLU. We have considered numbers of hidden units of 8000, 10000 and 12000 (also referred to as 8k, 10k and 12k), with droupout rates of 0.3 and 0.4. The output layer is again a fully connected layer, with a total of 4716 output units, one for each class. In order to provide class-prediction probability values, each output unit uses a sigmoid activation function.

We have found significant improvements can be obtained when a simple fully connected output layer is replaced by a Gating Layer, as was found in [19]. An illustration of this layer can be seen in Fig. 1a. The structure is similar to a fully connected layer with two important features: 1) The number of hidden units is the same as the input dimension; 2) the input vector is multiplied element-wise with the hidden layer output, resulting in the output vector. Essentially, the hidden layer here acts as a “gate”, determining how much of a particular input dimension to let through. Crucially, this gate is calculated using all the input values. This allows this layer to learn and exploit correlations (or de-correlations) amongst different classes.

Ii-A1 FC-NN Features

We employed the following types of input features:

  • Video-Level Mean Features (MF)
    The frame-level features were obtained by two separate Inception DNNs, one for video and another for audio. They gave 1024 and 128 output values respectively, which are then concatenated into a 1152-dimensional frame level feature. The mean feature for each video was obtained by averaging these frame-level features across the time dimension.

  • Video-Level Mean Features + Standard Deviation (MF+STD)
    We extract the standard deviation feature from each video. The signature is L2-normalised and concatenated with L2-normalised mean feature to form a 2304-Dim representation .

  • Region of Interest pooling (ROI)
    The ROI-pooling based descriptor, proposed by Tolias et al [22], is a global image representation for image retrieval and classification. We compute a new video-level representation using the ROI-pooling approach, where the frame-level features are max-pooled across 10 temporal-scale overlapping regions, obtained from a rigid-grid covering the frame-level features, producing a single signature per region. These region-level signature are independently L2-normalised, PCA transformed and whitened. The transformed vectors are then sum-aggregated and L2-normalised. The dimensionality of final video-level representation is 1152, similar to that of the video-level mean features.

  • Audio-Visual Fusion network (AVF)
    The idea behind the AVF network is to perform audio-vision fusion in order to maximise information extracted from each mode. This method comprises of two stages: (i) First, the audio and visual features networks are trained separately to minimise the classification loss and then (ii) combined in a fusion network consisting of two fully-connected layers [29].

    We first train the audio and video networks individually. We use three fully connected layers similar to FC6, FC7 and FC8, respectively, all of size 4096. Each FC layer is followed by a ReLu and a dropout layer. The output of the FC8 layer is passed through another fully connected layer FC9 which computes the predictions and finally updates the network parameters to minimise the cross entropy loss over the training data.

    After training the audio and video networks, we discard their FC9 layers and connect their FC8 layers to the fusion network shown in Fig. 1(f). In this way, 4096-Dim audio and 4096-Dim video features are concatenated to form a 8192-Dim representation as an input to the fusion network. This fusion network contains two fully connected layers of size 8192, followed by a fully connected prediction layer and cross-entropy optimisation.

Ii-B Temporal Models: LSTMs and GRUs

In order to explicitly model the video frames as a temporal sequence, two recurrent neural networks were used: Long-Short-Term Memory (LSTM) [8] and Gated Recurrent Units (GRU) [6]. Both LSTMs and GRU encode information seen in the past using high dimensional vectors called memory cells. These memory cells have the same dimensionality as the LSTM and GRU output vectors.

In the LSTM model (Fig. 1c), the input and memory cell vectors are linearly transformed (via weight matrices and bias vectors), with a sigmoid function applied to yield three different vectors used for gating. These gates are applied by means of element-wise multiplication with another vector and are called: input, forget and output gates. The gates, together with the current memory state, input vector and previous output vector, are recursively used to update the memory state for each time step and also produce the current output vector. In this paper, a stack of two LSTMs were used, where the output of an initial LSTM is fed as input to a second LSTM, and its output used instead. The output (and memory state) of the LSTMs is 1024.

GRUs can be thought of as a simplified version of the LSTM, where the output gate is removed. Additionally, GRUs do not contain an internal hidden memory state in LSTMs. Instead, the information from previous frames are encoded in the output vector. As a result, the GRU architecture contains a smaller number of parameters compared to LSTMs. Despite this, the performance of GRUs are often better than LSTMs. As with LSTMs, a stack of two GRUs is used.The output state of the GRUs is 1200.

The output values of both LSTMs and GRUs results were then passed into a gated fully connected layer that provides the 4717 class output values.

Ii-C Temporal-Agnostic Aggregation Models: NetVLAD, DeepFV and DeepBOW


NetVLAD [2] is CNN architecture that is trainable in a end-to-end manner directly for computer vision tasks such as image retrieval, place recognition and action recognition. The NetVLAD network typically consists of a standard CNN (VGG [25], RESNET [12]) followed by a Vector of Locally Aggregated Descriptors (VLAD) layer that aggregates the last convolutional features into a fixed dimensional signature and its parameters are trainable via back-propagation.

The VLAD block encodes the positions of convolutional descriptors in each voronoi region by computing their residuals with respect to the nearest visual words. A pre-computed code-book of cluster centres is first computed offline using K-means clustering. The descriptors are soft-assigned to each cluster centre and the residual vectors are accumulated to obtain cluster-level representations. The final VLAD representation is obtained by concatenating all aggregated vector for all clusters . The VLAD block can be implemented effectively using standard CNN blocks (Convolution, Softmax and Sum-pooling).

Ii-C2 Deep Fisher Vectors

Another popular method for generating global descriptors for image matching is the Fisher Vector (FV) method, which aggregates local image descriptors (e.g. SIFT [17]) based on the Fisher Kernel framework. A Gaussian Mixture Model (GMM) is used to model the distribution of local image descriptors, and the global descriptor for a video is obtained by computing and concatenating the gradients of the log-likelihoods with respect to the model parameters. One advantage of the FV approach is its encoding of higher order statistics, resulting in a more discriminative representation and hence better performance [21]. Here, we have used a model that learns the FV model in an end-to-end manner.

Ii-C3 Deep BOW

The Deep Bag-of-Words encoding is another order-less representation constructed from frame-level descriptors by grouping similar audio and visual features into clusters (known as visual- or audio-words). A video sequence is represented as a sparse histogram over the vocabulary. The model tested uses soft-assignment strategy which has been shown to deliver better performance in AV retrieval and classification applications.

Iii Novel Residual-DNNs for Learning Semantic Video Content

Iii-a Fully Connected ResNet

Inspired by the success of ResNet [11] for image recognition, we propose a Fully Connected ResNet (FCRN) architecture to tackle the problem of video classification. More precisely, let be the video level features (Mean+Standard deviation) extracted from a video. The ResNet block can be defined as:


where, and are the outputs and the weights of the convolutional layer respectively. The function represents the residual mapping to be learned. The ResNet block is demonstrated in Figure 1g, in which denotes the ReLU and is the randomly sampled dropout mask. The operation is computed by a shortcut connection and element-wise addition.

In the FCRN architecture (Fig. 1h), the input (Mean+Standard deviation feature vector) is first fed to a Fully Connected layer. The output of FC layer is then passed through a series of ResNet blocks. Finally, the resultant representation is forwarded into a gated fully connected layer that provides the 4717 class output values.

FC layer size Number of ResNet blocks GAP (%)
4096 4096 5 82.35
8192 8192 4 82.89
10240 10240 3 82.81
TABLE II: Impact of different ResNet architectures on the classification performance of FCRN network

Iv Individual DNN Experimental Results

The complete Youtube-8M dataset consists of approximately 7 million Youtube videos, each approximately 2-5 minutes in length, with at least 1000 views each. There are 4716 possible classes for each video, given in a multi-label form. For the Kaggle challenge, we were provided with 6.3 million labelled videos (i.e. each video was associated with a 4716 binary vector for labels). For test purposes, approximately 700K unlabelled videos were provided. The resulting class test predictions from our trained models were uploaded to the Kaggle website for evaluation.

The evaluation measure used is called ‘GAP20’. This is essentially the mean average precision of the top-20 ranked predictions across all examples. To calculate its value, the top-20 predictions (and their corresponding ground-truth labels) are extracted for each test video. The sets of top-20 predictions for all videos are concatenated into a long list of predictions. A similar process is performed for the corresponding ground-truth labels. Both lists are then sorted according to the confidence prediction values and mean average precision is calculated on the resulting list.

The performances of individual DNNs is summarised in Table. I. Here, we show both the GAP20 scores on the validation set where groundtruth labels are available, as well as GAP20 scores on unseen test data, by uploading the test-data inferences to the Kaggle website. We see that the performances of the different DNNs fall in the range of 82% to 83%.

As expected, GRUs perform better than LSTMs. The temporal agnostic models based on VLAD and Fisher Vectors consistently achieve high Kaggle GAP20 scores above 82%. Interestingly, the BOW models achieve lower scores less than 82%. However, using considerably simpler mean and standard deviation features in the fully connected and ResNet models provide similar performances. We also performed experiments to find the optimum depth and size of fully connected layer in the ResNet block. Table II demonstrates the impact of different ResNet architectures on the classification performance of FCRN network. It can be seen that 4 residual blocks of 8K hidden units achieves the best performance of 82.89%.

However, none of the DNN performances exceed 83%. Nonetheless, we find that whilst the performances of DNNs are roughly similar, different type of DNNs perform well (and conversely) on different sets of classes. This in turn will provide significant benefits in the GAP score after ensembling these individual DNNs together. To see this, we next provide an analysis on how well each DNN did for different classes. To achieve this, we next propose a measure for how accurate each DNNs relative to the final GAP20 score.

Iv-a Class Dependent DNN Performance Measure

In this section, we analyse the performance of individual DNNs in order to understand how they can contribute to improvements in the final ensembled system. To achieve this, we calculate separate accuracy scores for each classifier on each video label. Whilst this is not exactly the GAP20 score, it is highly related.

The classifier accuracy is based on “oracle” outputs, that is, for some given class and example, an oracle will inform us with 1 if that example’s class label was output correctly from some classifier, and 0 vice versa. Given that our classifiers output probability values, their output values need to be first binarised. We chose a threshold of 0.5 for this, so that any output value from a classifier with a value greater than or equal to 1 will be equated to 1, and 0 vice versa.

Now, let the number of classifiers be . For one of these classifiers, we obtain an oracle output matrix specific to it by comparing its thresholded output with the groundtruth label. Specifically, suppose the classifier has the binarised output matrix , with each of its element denoted as . Suppose as before, the groundtruth labels for class and example is denoted as , then the oracle matrix for this classifier is denoted as , with each element .

Iv-A1 Class Accuracy of Base DNNs

The accuracy of class of some DNN with index , can be directly obtained using the oracle matrix as follows:


The performance of the classifiers for the most frequent 100 classes can be seen in Fig. 2. It can be seen from Fig. 2a that all the classifiers in the ensemble perform roughly the same. This correlates well with the overall GAP scores shown in Table. I where all the individual DNNs used had GAP20 scores that were fairly similar. We find that the accuracy for all classes are very high. This is due to the imbalance between the occurrence of a class (i.e. video label) and non-occurrence. That is, the majority of the videos considered will not have a specific video label associated with it. As an example, the most frequent label of “Games” only occurs in about 10% of the videos.

Iv-A2 Mean Delta-Class Accuracy

There are variations amongst individual DNN performances that can be exploited by ensembling for raising the final ensemble GAP score. To see this more clearly, we will compare the values of with the class mean accuracy curve: The deviations of each classifier compared with the mean accuracy can then be obtained: .

The plot of the deviation between the mean GAP accuracy and individual DNN performance in Fig. 2b. In this figure, we see the largest discrepancies between DNN performances exist at the most frequent labels, and decreasing with frequency.

In order to see the overall pattern, we extract a correlation matrix from the per-class performance deviations . Each element of this correlation matrix is calculated as:

The correlation matrix is shown in Fig. 3. Of particular interest are the negative correlations in the performances of pairs of individual DNNs (shown as dark blue). Negative correlations indicate that when a particular DNN is underperforming (below mean accuracy), whilst the other DNN over-performs. Here, we find that different classes of DNNs in the ensemble tend to contain negative correlations. In particular, BoW type methods are negatively correlated with VLAD approaches.

(a) (b)
Fig. 2: (a) Shows the top20 classification error for the first 100 classes. For clarity, we plot the maximum and minimum accuracy curves. This allows us to see the range of accuracies across different DNNs for a specific class. (b) shows the deviation of the errors of each DNN to the mean accuracy for the first 100 classes.
Fig. 3: This shows the correlation of the delta-per-class performance. TODO - describe this more and also in the text
(a) (b)
Fig. 4: (a) Entropy diversity measure and (b) interrater agreement measures for different combinations of classifiers for the top 10 classes. The different colours in (a) and (b) represent different classes as indicated by the legend in (a).
Fig. 5: Overview of 3 different ensembling architectures using (a) single coefficients per DNN, (b) separate coefficients per class and DNN and (c) regularised hybrid of (a) and (b) with two parallel streams.

Iv-B DNN Diversity Analysis

It is well known that by combining the outputs from a set of base classifiers with similar performance, significant improvements in accuracy can be obtained. However, this improvement is governed by the diversity present in the group of the base classifiers. Roughly speaking, a set of classifiers are deemed diverse if they perform well on different examples or classes. However, there is no agreed diversity measure for a set of classifiers.

Multiple authors have proposed a multitude of other potential diversity measures. Ten such measures were considered in [15]. It was observed there that there are correlations between the ensemble accuracy and amount of diversity present in the base-classifiers of an ensemble across the diversity measures analysed. These 10 measures can be split into two categories: pairwise and non-pairwise measures. Pairwise measures only quantify the diversity of two classifiers. To obtain a single number to represent an ensemble diversity, the average is usually obtained. Non-pairwise measures combine performances for all classifiers throughout the dataset into a single measure. The analysis here will use the non-pairwise measurement of Entropy and inter-rater agreement.

One issue with the above measures is that they assume single label classification, Here, each video example is associated with multiple labels. Consequently, we have chosen to extract the diversity scores per class. The aim is to show how across different classes does the different diversity scores change as we add more base classifiers.

The diversity measures are all based on the oracle matrices described in Section IV-A. They all use a common property, which is the number of classifiers that have recognised class in example correctly. We denote this as . We can also compute the average accuracy for class across all classifiers as:

We can then calculate the interrater agreement for this ensemble of classifiers as:

The values of indicate the amount of agreement between different classifiers in an ensemble, whilst correcting for chance. The smaller the value, the greater the diversity (since classifiers agree less with each other).

Another measure is based on entropy, and can be calculated as follows:

For this measure, a larger value will indicate larger diversity amongst the ensemble. Using the above two measures, we exhaustively calculated the diversity measure for all possible combinations of the DNN classifiers in Table I from a range of 2 classifiers to the final single ensemble of 14 DNNs. The results can be seen in Fig. 4.

We find that in general, the lower bound for the results from the entropy measure increases. We have also shown the curve of average entropy across all possible combinations of DNNs for a fixed number of classifiers. This measure has an interesting artefact where odd numbered ensembles and even numbered ensembles have different ranges of entropy measurement. This is caused by the ceiling operator in the equation for above. We also observe that the upper bound for the interrater agreement decreases, both indicate that the diversity of the ensemble is increasing. Interestingly, we observe that the less frequent the classes are, the less diverse the ensemble becomes.

V DNN Ensembling

We have shown in Table I that individual DNNs are not able to improve beyond 83% GAP20 scores. Nonetheless, the analysis in Section IV-A indicated that whilst the overall scores of individual DNNs are similar, the classes where they perform well in are quite diverse. This suggests that ensembling sets of individual DNNs will provide further improvements.

A common approach to learn the ensembling coefficients is first to propose a diversity measure, such as those considered in [15] or [27]. This is followed by simultaneously training multiple classifiers to also maximise the selected diversity measure. To this end, Lee et al. [16] proposed end-to-end training of multiple classifiers simultaneously with the diversity measure factored into the loss function. This approach is unfeasible here, as the individual DNNs are complex and learn at potentially different rates. Additionally, it is still an open question whether explicit optimisation of diversity scores will always correlate with increased ensemble accuracy. To analyse this, a theoretical study of 6 different diversity measures was carried out by Tang et al. [27]. It was also found that diversity measures themselves can be ambiguous in predicting the generalisation accuracy of an ensemble.

Consequently, the approach chosen here assumes that the base-classifiers in the ensemble are fixed (i.e. pre-learnt). We have also chosen to perform ensembling by linearly combining outputs from each DNN. Therefore, the ensemble learning task is to determine the classifier coefficients.

The number of linear combination coefficients commonly fall into two classes: Single coefficient per-DNN and separate coefficients for both DNNs and class. Specifically, suppose the number of DNNs available is . The outputs of each DNN for a class problem is denoted as . The ensembling of the different approaches for the single coefficient per-DNN approach is then:


where are scalars. The second ensembling approach is:


where represents element-wise multiplication and is the set of coefficients (one per class) for each DNN. In order to learn the ensembling coefficients in Eq. 3 and Eq. 4, two approaches are considered:

  1. The first is a novel algorithm that attempts to optimise the ensembling coefficients by directly optimising the GAP20 score, as detailed in Section V-A.

  2. The second approach poses the learning of ensembling coefficients as a Mixture of Experts (MoEs) problem, as detailed in Section V-B.

V-a Correlation-based Ensembling

In this section, we propose a novel non-DNN-based method for finding optimal DNN coefficients by greedy selection using Pearson’s correlation of pairs of DNN outputs. The benefits of this method is that it is very fast compared to DNN methods, and directly tries to find the combination of weights that optimises GAP - as opposed to requiring a differentiable loss function such as cross-entropy to optimise as a proxy for GAP.

The algorithm works as follows:

  1. First, we compute a Pearson’s correlation matrix based on the predictions of all the candidate DNN models.

  2. We greedily select the pair of DNN models A, B with lowest correlation.

  3. For this pair of models, we compute the GAP score for combinations of weights (eg for model A, and for model B). We used in our experiments.

  4. We can then fit a quadratic curve to the GAP score against weight for this pair of models. We found experimentally that across several evaluation metrics, the score of a given fit can be very well approximated by a quadratic fit, greatly cutting down the time that would be required for an exhaustive search.
    Using this quadratic, we can estimate the optimal weight for this pair of models, and combine the predictions linearly using this weight.

  5. Next, we compute correlations between the current ensemble and remaining models, and once again select the model with the lowest correlation.

  6. We return to step 3, using the ensemble and candidate model as models A and B.

  7. The loop is repeated until all candidate models have been added to the ensemble, and final used weights are calculated.

As the algorithm adds models iteratively, it can be terminated early to gain almost all the total score with a subset of the models.

V-B Mixture of Experts for Ensembling DNNs

In this section, we describe the approach of using mixture of experts for ensembling different DNNs. There are numerous methods commonly used for learning MoEs, and we refer the reader to [18] for a complete review. For our purpose, we have chosen to ensemble the outputs of different pre-trained DNNs using a gating network, whose weights are learnt using stochastic gradient descent.

We consider the following configurations of mixture of experts: Single coefficient per DNN; class-dependent linear combination coefficients and a dual-stream model that combines both approaches.

V-B1 Single Coefficient MoE

It is possible to learn the ensembling coefficients by posing it as a single layer convolutional network (Fig. 5(a)). The input to this network, , is the concatenation of the outputs of base DNNs into an tensor. The ensembling can be achieved by introducing a convolutional layer of a single filter of size (with depth ). The output of this layer would be a tensor of size . As such, the convolutional process performs the linear combination, with the weights of the filter acting as the ensembling weights. These outputs can then be compared against the corresponding labels using the cross entropy loss. The convolutional filter weights were learnt using the Adam algorithm.

V-B2 Per-class Coefficient MoE

It is possible to increase the modelling capacity of the ensembling method by introducing more coefficients. To this end, we can introduce separate coefficients for each class and DNN (Fig. 5(b)). As before, this can re-defined as a DNN learning problem. In this case, it is not possible to use convolutional filters. Instead, we introduce a new layer where its weight tensor is the same size as the input tensor (). An element-wise multiplication between this and the input is then performed and summing carried out across the last dimension to produce an output of . The output can then be compared with the corresponding labels using the cross entropy loss and the weights learnt using the Adam algorithm.

V-C Regularised Dual-Stream MoE

One shortcoming of the per-class coefficient MoE is the possibility of overfitting. On the other hand, a single coefficient per DNN with far fewer parameters run little risk of overfitting. However, it instead suffers from potentially being too restrictive.

As such, we also explore a novel dual-stream model that combines both approaches. In this model, we aim to have the ensembling coefficients for a particular DNN centered around some “main” value (i.e. similar to the single coefficient model), but the the ability to have a small “residual” from this mean to provide some per-class flexibility.

The architecture of this model can be seen in Fig. 5(c). It consists of two initial parallel streams, each receiving the input tensor. The first stream is the convolutional layer, providing the main ensembling coefficient for each DNN. The second stream is the per-class weight layer, providing the residual weight away from the first stream’s coefficients. The outputs of both vectors ( size ) are summed together providing the ensembled output vector.

In order to restrict the residual weights, we perform regularisation on the weight tensor of the second stream. This restricts the distribution of these weights to be zero mean with a small standard deviation. Similar to the previous two approaches, the weights of both streams are then learnt by optimising the cross entropy loss function using the Adam algorithm.

Vi Ensembling Experimental Results

In this section, we compare performances of DNN ensembles with coefficients learnt using the methods described in Section V. We present the performances of different DNN ensembles separately. For each of these approaches, we describe results on our validation dataset and the Kaggle test dataset. We have found that it was not possible to train the DNNs on the training dataset. The reason is that the fully connected DNNs have a much higher score on the training data. This in turn caused the ensembling algorithms to converge on those and ignore the remaining DNNs.

As a result, it was necessary to use the validation dataset for learning the ensembling coefficients. However, the use of the entire validation dataset for learning ensembling coefficients leaves us with no remaining data for evaluating its generalisation performance. One approach would be to use the entire validation set for ensemble weights learning which are then used to ensemble and upload the Kaggle test data to its website for evaluation. However, this would be too costly time-wise.

In order to address the above issue, we have chosen to split the validation dataset into two partitions, which we denote as ensemble-train/test splits. We will then learn the DNN ensemble coefficients on the ensemble-train and test on the ensemble-test partition. The ensembling coefficients will be learnt on the ensemble-train dataset and evaluated on the ensemble-test partition.

Fig. 6: DNN weights for the different ensembling methods (Sec. V).

Vi-a Learnt Ensemble Weights

We can see how the different ensembling methods have weighted the base DNN models in Fig. 6. In terms of the per-class coefficient and dual stream ensembling DNNs, we instead obtained the mean weights for individual base DNNs. We see that all the methods have assigned approximately the same weights to different base DNNs. As expected, in general, the higher the individual performance of the DNNs, the higher the weights. Of particular interest is the negative weighting assigned by all the ensemblers to the LSTM method. We can see the evolution of the weights for each DNN in Fig. 7. For the single coefficient DNN emsembler, we find that the weights converge after approximately 40 epochs (Fig. 7a). An interesting artefact we have found is when the Adam algorithm performs large modifications to the weights, resulting in sudden jumps in weight values assigned to each DNN. However, this change is consistent across all DNNs. For the per-class coefficient DNN ensembler, each DNN is not assigned a single weight, but instead 4716 separate weights, each for a class. However, we can see the trend of these weights by obtaining its mean value (for each DNN) at every epoch. This is shown in Fig. 7b.

Finally, we show the weights assigned to the first stream of the dual-stream DNN ensembler in Fig. 7c. For this method, the mean weights of the other stream for each base-DNN is approximately 0 due to the L2 regularisation employed here. The weights for each DNN evolve at a slower rate compared with the single coefficient DNN. However, we notice similar spikes in the weights due to the Adam algorithm.

Vi-B DNN-based Ensembling Generalisation

In order to shed light on the generalisation capabilities of the different DNN-based ensembling approaches, we analyse the results from the training and test splits. These can be seen for all three ensembling methods in Fig. 8. It can be seen that the training and test results for both single coefficient and dual-stream approaches are roughly the same. However, we find that the per-class coefficient approach shows strong indications of overfitting. In particular, the training GAP20 score is seen to dramatically improve with increasing epochs. Unfortunately, we find that the test GAP20 score decreases in an equally dramatic fashion.

This can be more clearly seen by producing the scatterplot of training vs test GAP20 scores across all epochs. This can be seen in Fig. 9. In this figure, we see a strong correlation between increases in training and test gaps for the single coefficient ensembling DNN. In constrast, we find that there is an inverse correlation between the train and test GAP20 scores for the per-class ensembling method. Finally, for the dual-stream method, we still observe good correlation between train and test GAP increases. These suggests that the single coefficient method stands the highest chance of obtaining significant improvement in the GAP20 performance over the other methods.

Ensemble-Type Validation GAP20 Kaggle GAP20
Average 85.03 84.97
Correlation 85.12 85.11
Single Coeff. DNN 85.13 85.11
Per-Class DNN 86.16 83.83
Dual-Stream DNN 85.12 85.10
TABLE III: GAP20 scores on the validation dataset and Kaggle-test dataset for different DNN ensembling methods.

Vi-C Kaggle Test Results

In order to produce the predictions for the Kaggle-test data, we chose to learn the ensembling coefficients using all the validation data. The resulting GAP20 scores on this validation dataset can be seen in Table III. We know that the per-class DNN score will be unreliable from the analysis of Section VI-B. However, we expect the other validation scores to closely reflect the performance gain compared to simple averaging of multiple DNNs in the unseen Kaggle-test data. The resulting GAP20 test results from the Kaggle website confirms this, and can be seen in the third column in Table III.

The baseline of the ensembling approach is obtained by simple averaging of all the outputs of the base DNNs. A baseline on the unseen Kaggle test dataset is obtained by averaging the base DNN results and uploading to the Kaggle website. The reported test GAP20 score obtained is 84.977%. Both the correlation-based approach and single coefficient DNN ensembler achieved significant improvement of 85.11% GAP20 results. This overfitting issue of the per-class DNN is confirmed by a Kaggle test GAP20 score of 83.83%. Finally, the dual stream method achieved a Kaggle-test GAP20 of 85.10%. However, whilst this is a significant improvement compared with simple averaging, it does not improve on the single coefficient models.

The improvements obtained in the validation dataset can be verified by ensembling the unseen Kaggle test data using the learnt weights. This gives a Kaggle test GAP20 of 85.108% over the score of 84.977% when averaging is used. We have found that the Kaggle test GAP20 of 85.108% could also be achieved by removing LSTM from the DNN ensemble. This is not suprising as the ensemble weight given to LSTMs was minimal (Fig. 6).

(a) Single Coeff. DNN (b) Per-Clas Coeff. DNN (c) Dual Stream DNN
Fig. 7: Evolution of DNN weights (or their means) across different epochs of the three different ensembling methods.
(a) Training GAP20 (b)Test GAP20
Fig. 8: The ensemble-train (a) and ensemble-test (b) GAP20 across all epochs for each of the proposed ensembling methods.
(a) Single Coeff. DNN (b) Per-Clas Coeff. DNN (c) Dual Stream DNN
Fig. 9: The scatterplots of training GAP20 vs test GAP20 for the three DNN-based ensembling methods.

Vii Analysis of DNN-Ensemble Performance

In the previous section, we have shown that ensembling provides significant improvement on the overall GAP20, both on the validation datset and the Kaggle test GAP20. In this section, we analyse the improvements brought about by the ensembling. To this end, we employ the same GAP20-based class accuracy scores described in Section IV-A. We can then extract how accurate a particular model (base DNN or ensemble) at recommending some class label for videos. Similar to the analysis on individual DNNs, we consider the top 100 classes.

As before, to more clearly see the performance differences between different classifiers, here between base DNNs and its ensemble, we shall plot their difference to the mean accuracy of the base DNNs. This can be seen in Fig. 10a. We find that the single coefficient DNN ensemble is significantly more accurate than individual DNNs for the most frequent classes. In fact, the more frequent the class is, the greater the accuracy improvement.

Following this, we count of the number of classes the ensemble achieves maximum accuracy when compared with other classifiers. The results and can be seen in Fig. 10b. As expected, we find that the ensemble is the classifier that has maximum accuracy for the majority of the top-100 classes.

(a) (b)
Fig. 10: (a) This graph shows the difference between classifier accuracy and the mean accuracy across all base DNNs. (b) This graph shows that in the top 100 classes, the ensemble is the most accurate for approximately 80% of the classes.

Viii Conclusions

In this paper, we have evaluated the performance of a wide range of different DNN architectures and video features. The different architectures ranged from video-level statistics to frame-level recurrent networks such as LSTMs. Additionally, we have also proposed a novel DNN architecture based on ResNet that achieves state of the art performance for individual classifiers, whilst using substantially simpler video-level statistical information. We have found that, on an individual level, each DNN achieves roughly the same range of GAP20 accuracy, from 82% to 83%. Nonethelss, analysis on their individual performances across different classes to show that they are diverse. Additionally, two ensemble level measures of diversity provide further indication that adding more classifiers will increase diversity. This in turn suggests a strong case for ensembling the different classifiers.

We proposed four different methods for performing DNN ensembling. These approaches differ mainly by the number of linear combination weights used for the ensembling process. We have found that the ensembling method with only a single coefficient assigned to each base-DNN has the best generalisation capability. When more weights are used, with each class and base-DNN assigned a different weight, overfitting becomes a serious issue. Whether adding more training data will overcome this remains an open question. We have found that the single coefficient ensemble of 13 diverse base DNNs gives us the state-of-the-art GAP20 performance from the Kaggle website of 85.12%. This is in contrast with the existing highest score of 84.9% obtained by simple averaging of 25 DNNs models. Analysis of the performance of the ensemble indicates that there is significant improvement in labelling accuracy for the most frequent classes.


  • [1] S. Abu-El-Haija, N. Kothari, J. Lee, A. P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. In arXiv:1609.08675, 2016.
  • [2] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99):1–1, 2018.
  • [3] M. Bober-Irizar, S. Husain, E. Ong, and M. Bober. Cultivating DNN diversity for large scale video labelling. CoRR, abs/1707.04272, 2017.
  • [4] L. Breiman. Bagging predictors. Mach. Learn., 24(2):123–140, Aug. 1996.
  • [5] H. Chen and X. Yao. Multiobjective neural network ensembles based on regularized negative correlation learning. IEEE Transactions on Knowledge and Data Engineering, 22:1738–1751, 2010.
  • [6] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734. Association for Computational Linguistics, Oct. 2014.
  • [7] B. Fernando, E. Gavves, J. O. M., A. Ghodrati, and T. Tuytelaars. Rank pooling for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):773–787, April 2017.
  • [8] F. A. Gers, J. A. Schmidhuber, and F. A. Cummins. Learning to forget: Continual prediction with lstm. Neural Comput., 12(10):2451–2471, Oct. 2000.
  • [9] A. Guzmán-Rivera, D. Batra, and P. Kohli. Multiple choice learning: Learning to produce multiple structured outputs. In NIPS, pages 1808–1816, 2012.
  • [10] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell., 12(10):993–1001, Oct. 1990.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732, June 2014.
  • [14] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and active learning. In Advances in Neural Information Processing Systems, pages 231–238. MIT Press, 1995.
  • [15] L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2):181–207, May 2003.
  • [16] S. Lee, S. Purushwalkam, M. Cogswell, D. J. Crandall, and D. Batra. Why M heads are better than one: Training a diverse ensemble of deep networks. CoRR, abs/1511.06314, 2015.
  • [17] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, pages 91–110, 2004.
  • [18] S. Masoudnia and R. Ebrahimpour. Mixture of experts: a literature survey. Artificial Intelligence Review, 42(2):275–293, Aug 2014.
  • [19] A. Miech, I. Laptev, and J. Sivic. Learnable pooling with context gating for video classification. CoRR, abs/1706.06905, 2017.
  • [20] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4694–4702, June 2015.
  • [21] F. Perronnin and C. R. Dance. Fisher kernels on visual vocabularies for image categorization. In IEEE Conference on Computer Vision and Pattern Recognition, 2007.
  • [22] F. Radenovic, G. Tolias, and O. Chum. Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In ECCV, 2016.
  • [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [24] R. E. Schapire and Y. Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012.
  • [25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.
  • [27] E. K. Tang, P. N. Suganthan, and X. Yao. An analysis of diversity measures. Machine Learning, 65(1):247–271, Oct 2006.
  • [28] Z. Wu, Y.-G. Jiang, X. Wang, H. Ye, and X. Xue. Multi-stream multi-class fusion of deep networks for video classification. In Proceedings of the 2016 ACM on Multimedia Conference, MM ’16, pages 791–800. ACM, 2016.
  • [29] S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian. Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology, PP(99):1–1, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description