Discriminatively Trained Latent Ordinal Model for Video Classification

# Discriminatively Trained Latent Ordinal Model for Video Classification

## 1Introduction

world is exploding with large amounts of visual data and videos are the major chunk of such data. Hundreds of hours of videos are being added to YouTube per day, and millions of surveillance cameras are recording continuous video streams day in and day out. Most of these videos are human centric and the huge scale demands systems capable of automatically processing and understanding such data. Hence, researchers in computer vision have been actively engaged in designing methods which work with human centered video data, for tasks like recognizing human actions in videos [1] and facial analysis in videos [9].

We are interested in the challenging and relevant problem of classifying videos, of faces and humans, based on the property the face is exhibiting or the actions the human is performing. We work in a weakly supervised setting where only labels for the videos are given while all individual, or any selected subset of, frames do not have any labels. The usual assumption in such weakly supervised setting is that the positive ‘bag’ contains at least one positive instance, while the negative ‘bag’ does not have any positive instances. In weakly supervised settings, Multiple Instance Learning (MIL) [15] based methods are one of the popular approaches and have been applied to the task of facial video analysis [9] with video level, and no frame level, annotations. However, the main drawbacks of most of such (MIL based) approaches are that (i) they use the maximum scoring vector to make the prediction [15], and (ii) the temporal/ordinal information is lost completely. While, in the recent work by Li and Vasconcelos [18], MIL framework has been extended to consider multiple top scoring vectors, the temporal order is still not incorporated. Intuitively, the temporal order is definitely important; recent works exploit it for related computer vision tasks, e.g. visual representation learning [19]. We propose a novel method that (i) works with weakly supervised data, (ii) mines out the prototypical and discriminative set of vectors required for the task, and (iii) learns constraints on the temporal order of such vectors. We show how modeling multiple vectors instead of the maximum one, while simultaneously considering their ordering, leads to improvements in performance.

The proposed model belongs to the family of models with structured latent variables e.g. Deformable Part Models (DPM) [21] and Hidden Conditional Random Fields (HCRF) [22]. In DPM, Felzenszwalb et al. [21] constrain the location of the parts (latent variables) to be around fixed anchor points with penalty for deviation while Wang and Mori [22] impose a tree structure on the human parts (latent variables) in their HCRF based formulation. In contrast, we are not interested in constraining our latent variables based on fixed anchors [21] or distance (or correlation) among themselves [22], but are only interested in modeling the order in which they appear. Thus, the model is stronger than models without any structure while being weaker than models with more strict structure [21].

The proposed model is also reminiscent of Actom Sequence Model (ASM) of Gaidon et al. [2], where a temporally ordered sequence of sub-events are used to perform action recognition in videos. However, while ASM requires annotation of such sub-events in the videos; the proposed model aims to find such sub-events automatically. While ASM places absolute temporal localization constraints on the sub-events, the proposed model only cares about the order in which such sub-events occur. One advantage of doing so is the flexibility of sharing appearances for two sub-events, especially when they are automatically mined. As an example, the facial expression may start, as well as end, with a neutral face. In such case, if the sub-event (neutral face) is tied to a temporal location we will need two redundant (in appearance) sub-events i.e. one at the beginning and one at the end. While, here such sub-events will merge to a single appearance model, with the symmetry encoded with similar cost for the two ordering of such sub-event, keeping the rest same.

Informally, the proposed model is a collection of discriminative templates, which capture the appearances of the sub-events in the video, along with a cost vector corresponding to all possible permutations in which the events can occur in a video. Scoring by the proposed model is done as follows. The multiple discriminative events in the model are detected and scored in the current video and they also incur a cost depending on the temporal ordering in which they appear in the video. The mining of such discriminative events and learning of the appropriate templates, along with learning the costs associated with different orders of occurrence of the events, happens fully automatically in a weakly supervised setting with the event locations (in time) being latent variables. We propose to learn the model with a max-margin loss minimization objective, optimized with efficient stochastic gradient descent. On the task of facial video analysis, we validate the model on four challenging datasets of expression recognition (CK+ [23] and Oulu-CASIA VIS [24] datasets), clinical pain prediction (UNBC-McMaster Pain dataset [10]) and intent prediction in dyadic conversations (LILiR dataset [25]). We show that the method consistently outperforms temporal pooling and MIL based competitive baselines. In combination with complementary features, we report state-of-the-art results on these datasets with the proposed model.

On the task of human video analysis, we follow previous work and propose a second variant of the method which takes into account the global temporal information as well. The model with only local discriminative sub-events assumes that the video can be factorized cleanly into such sub-events. However, for the more challenging case of human action such factorization is not always clean, in the sense that while some actions are easily factorizable, e.g. running and jumping for high-jump, others are a complex combination of local events and full global temporal information and context, e.g. hitting. This interplay of local and global factors for action recognition has been acknowledged either explicitly or implicitly by experiments in previous works as well [6]. We take inspiration from such works and extend the model to incorporate global features, obtained by some pooling operation over the features of the frames, of the video as well. We cast the final objective as a weighted (convex) combination of the local and global parts and learn the parameters jointly. We validate the model, for human analysis, on challenging datasets of Olympic Sports [6], Human Motion Database (HMDB) [5] and HighFive dataset [27] of human interactions. On human analysis as well, we show consistent improvements over challenging and relevant baselines. The method achieves results that are competitive to state-of-the-art on the three datasets. Further, qualitative analysis of the results validate the hypothesis of the method i.e. we show that the method is successful in mining out discriminative events and also learns a consistent ordering over their occurrences.

A preliminary version of this work appeared in [28] with the first variant of the model applied only to facial analysis tasks.

## 2Related Works

We now describe closely related works in the following sections. We compare and contrast our model with related models in the literature while discussing works on the tasks of face and general human action video classification.

### 2.1Facial Analysis

Facial analysis is an important area of computer vision. The representative problems include face (identity) recognition [29], identity based face pair matching [30], age estimation [31], kinship verification [33], emotion prediction [34], [35], among others. Facial analysis finds important and relevant real world applications such as human computer interaction, personal robotics, and patient care in hospitals [9]. When we work with videos of faces, we assume that face detection has been done reliably. We note that, despite reduction to just faces, the problem is still quite challenging due to variations in human faces, articulations, lighting conditions, poses, video artifacts such as blur etc. Moreover, we work in a weakly supervised setting, where only video level annotation is available and there are no annotations for individual video frames.

Early approaches for facial expression recognition used apex (maximum expression) frames [37] or pre-segmented clips, and thus were strongly supervised. Also, they were often evaluated on posed video datasets [23]. To encode the faces into numerical vectors, many successful features were proposed e.g. Gabor [39] and Local Binary Patterns (LBP) [38], fiducial points based descriptors [40]. They handled videos by either aggregating features over all frames, using average or max-pooling [1], or extending features to be spatio-temporal e.g. 3D Gabor [42] and LBPTOP [43]. Facial Action Units, representing movement of facial muscle(s) [36], were automatically detected and used as high level features for video prediction [36].

Noting that temporal dynamics are important for expressions [36], the recent focus has been more on algorithms capturing dynamics e.g. Hidden Markov Model (HMM) [?] and Hidden Conditional Random Fields (HCRF) [46] have been used for predicting expressions. Chang et al. [46] proposed a HCRF based model that included a partially observed hidden state at the apex frame, to learn a more interpretable model where hidden states had specific meaning. The models based on HCRF are also similar to latent structural SVMs [22], where the structure is defined as a linear chain over the video frames. Other discriminative methods were proposed based on Dynamic Bayesian Networks [48] or hybrids of HMM and SVM [49]. Lorincz et al. [50] explored time-series kernels e.g. based on Dynamic Time Warping (DTW) for comparing expressions. Another similar model used probabalistic kernels for classifying exemplar HMM models [41].

Nguyen et al. [51] proposed a latent SVM based algorithm for classifying and localizing events in a time-series. They later proposed a fully supervised structured SVM for predicting Action Unit segments in video sequences [13]. Our algorithm differs from [51], while they use simple MIL, we detect multiple prototypical segments and further learn their temporal ordering. MIL based algorithm has also been used for predicting pain [9]. In recent works, MIL has been used with HMM [17] and also to learn embedding for multiple concepts [16] for predicting facial expressions. Rudovic et al. [12] proposed a CRF based model that accounted for ordinal relationships between expression intensities. Our work differs from this work in handling weakly labeled data and modeling the ordinal sequence between sub-events.

### 2.2Human Analysis

There have been many works related to understanding humans in visual data. Methods have been proposed for human action and attribute recognition from still images [52], where typical appearances (e.g. sports clothes) or poses (e.g. jumping, riding a bike) may be sufficient for recognition [59]. While many actions are highly correlated with typical poses and clothes, many require some temporal information and using just still images are thus not sufficient. Motion, being an important cue for the task of human action recognition, has been exploited by many works [1]. One of the most popular approaches in the last decade relied on using local methods that extracted descriptors such Histogram of Flow (HOG) and Histograms of Gradients (HOF) over salient 3D regions such as Space-Time Interest Points [1]. The final descriptors were obtained by popular pooling methods such as the bag-of-features [65] and Fisher vector [66] frameworks. Since pixels are moving over time in videos, a fixed space-time voxel may not be able to represent complete motion of such a pixel. Therefore techniques were proposed to describe trajectories by tracking pixel(s) over time instead of interest points [68]. Wang et al. [68] showed that densely tracking motion trajectories is superior to previous approaches and proposed to describe them using Motion Boundary Histogram (MBH) for handling camera motion. Several approaches improved upon the Dense Trajectories approach by compensating for camera motion by removing background motion using affine transformation [71], clustering trajectories for identifying dominant camera motion [72], using image-stitching methods to generate stabilized video prior to computing trajectories [73]. Wang et al. proposed Improved Dense Trajectories (iDT) [3] approach that estimated and removed camera motion using homography and also removed inconsistent feature matches by human detection. In combination with using Histogram based features, Fisher Vector encoding and Spatio-Temporal pyramids, they showed significant improvements compared to previous methods. [74] improved upon these features by proposing to stack features extracted at multiple temporal frequencies.

Many works have also focussed on improving upon standard pooling pipelines by explicitly modeling spatial and temporal structure of human activities [6]. Niebles et al. [6] used a variant of DPM approach to model each activity as composed of short temporal segments with anchored locations, and penalized segments that drifted away from these anchors during inference. Improving upon their approach, Tang et al. [26] proposed a more flexible variable duration Hidden Markov model that modeled an activity as composed of temporal segments with variable durations and first order transitions. Some works have used MIL based approaches for action classification [78]. Li et al. [75] proposed to dynamically pool over relevant segments in a video. A closely related work to ours modeled videos as a set of sparse key-frames, that were learned weakly, but assumed fixed ordering between the events [77]. Gaidon et al. [2] used additional training data for learning the sub-events while assuming fixed ordering. In constrast we not only model each video as composed of sub-events but also learn them in a weakly supervised setting along with a loose ordering between them. Moving away from latent variable modeling, Gaidon et al. [73] used a tree based kernel to compare hierarchical decomposition of videos as cluster of dense trajectories. Other approaches have used attribute dynamics [80], mid-level parts [81], spatio-temporal graphs [76], higher level pooling methods such as rank pooling [82] and distribution of classifier scores [83].

As deep learning showed excellent performance for image classification problems [84], several approaches also explored their use for human action recognition [87]. Some earlier works relied on using 3D convolutional networks [93] and their extensions such as early and late fusion [94] for action recognition. Simonyan et al. [4] proposed a two-stream convolutional network that learned both spatial and temporal networks (over stacked optical flow frames). A drawback of these networks was that they were learning features over a few frames and thus were only marginally better than networks that learned features over a single frame. Ng et al. [95] proposed a network that allows training over longer periods by either using feature pooling (similar to temporal pooling) or Recurrent Neural Networks such as Long Short Term Memories (LSTMs). Several other approaches have also explored the use of LSTMs architectures [88] for action classification and video caption generation [97]. Some recent methods have also used Attention based LSTMs that does classification while focussing on discriminative parts of a video [89]. Tran et al. [87] showed that strong 3D and compact spatiotemporal features can be learned for action recognition by using small 3D filters of pixels. Several of these works have also shown the advantages of fusing deep models trained on spatial and temporal components with non-deep features such as iDT [89].

## 3Approach: Latent Ordinal Model (LOMo)

We denote a video as sequence of frames1 represented as a matrix

with being the feature vector for frame . We work in a weakly supervised binary classification setting, where we are given a training set

containing videos annotated with the presence () or absence () of a class in , without any annotations for specific frames of the video i.e. . While we confine ourselves to the task of video classification in this paper, we note that our model is applicable to general vector sequence classification in a weakly supervised setting.

The proposed model learns a set of events, as well as a cost function associated with the order of occurrence of those events. The events are defined by the associated, discriminatively learned, templates. These templates capture the appearances of different sub-events, e.g. neutral, onset or offset phase of an expression, while the cost function captures the discriminative likelihood of the different temporal orders in which the sub-events appear in the videos. The model templates and the cost function are all automatically and jointly learned, from the training data. Hence, the sub-events are not constrained to be either similar or distinct w.r.t. each other, and are not fixed according to certain expected states [2]. They are mined from the data and could potentially be a combination of the sub-events generally used by humans, e.g. to describe expressions or human activities [2].

Formally, the model is defined as

with indexing over the sub-event templates and indexing over the different temporal orders in which these templates can occur. The are similar to the SVM hyperplane parameter vectors, which have been often visualized as templates [21]. While, the ordering function is implemented as a look-up table i.e. with , with size equal to the number of permutations of the sub-events. In the following sections we make the model, and especially the cost function, more concrete and give two variants of the proposed model adapted to facial analysis and human action classification, respectively. We first describe the model that takes into account the contribution of only the local temporal information that includes scores from the detected sub-events and the ordering cost function (). This model works well for analyzing human facial behavior sequences that have a strong local temporal structure. However, this might not be the case with unconstrained human activities where certain classes can be better analyzed by either local or global motion components or their combinations. Thus, we adapt the model to include both the local and global temporal information, that are adaptively weighted and learned for different classes ().

We learn the model with a regularized max-margin hinge loss minimization, given by

is the scoring function which uses the model templates and the cost function to assign a confidence score to the example and is the concatenation of the component vectors. The decision boundary is given by . The scoring function depends on the type of model we use and is thus defined in the following sections along with the two model variants.

### 3.1Scoring Function with Local Events — LOMo

In the simplest case the model factorizes the sequence as ordered set of local events, which are relatively few compared to the number of elements in the sequence. The model consists of a set of sub-event templates and a function which assigns weights to all the different possible ordering of the events.

Deviating from a linear SVM classifier, which has a single parameter vector, the model now has multiple such vectors which act at different temporal positions. The scoring function for a video , with model , is defined as

where,

are the latent variables, and

maps to an index, with lexicographical ordering e.g. with and without loss of generality , and so on. The latent variables take the values of the frames on which the corresponding sub-event templates in the model gives maximal response while being penalized by the cost function for the sequence of occurrence of the sub-events. is an overlap function, with being a threshold, to ensure that multiple ’s do not select very close by frames. We realize the overlap function by constraining the temporal locations of different sub-events to occur at a certain distance from each other (refer to ).

### 3.2Scoring Function with Local Events and Global Information — Adaptive LOMo

The LOMo model works well for facial videos, e.g. of facial expressions, as they are expected to be composed of temporal segments such as neutral, onset and apex states for facial expressions. However, this is not always the case for human activities which are more complex and are both spatially and temporally unconstrained. These activities could span from simple actions, that are composed of a single motion segment, to complex or periodic activities such as “talk”, “longjump”, “hugging”, “highjump” etc. In other words different activities have different temporal structure. We adapt LOMo to wide range of activities by extending the model to include global information. We obtain global temporal information by using temporal pooling over all the features of a video. This has been often done in works learning decomposable temporal or spatial structure in videos [6]. Going a step further, compared to such works, we optimize an objective which is a convex combination of scorings based on local and global components, jointly over the parameters corresponding to the two components. A similar idea was used by authors in [73], who also did a weighted combination of local and global components. However, (i) the issue of weakly labelled data was not addressed and (ii) the locality was defined by clustering local spatio-temporal features rather than being learned jointly with the classifier, as in the present case.

The adapted scoring function includes weighted combination of both, the local sub-events and their ordering cost and the global temporal component. The scoring function we use is given as,

where, in addition to the notations introduced above in , and

is the global feature, obtained by pooling over all the vectors for the different frames. is the hyperplane for the global template.

In addition to the change in the scoring function, the objective function () is also modified slightly by including regularization term for the global parameter , i.e. the in becomes .

### 3.3Discussion

Intuitively, in the proposed ordinal model, we capture the idea that each video sequence is composed of a small number of prototypical sub-events, e.g. onset followed by apex phase for a smiling face and running followed by jumping for long jump. The components in our model capture the appearance of the prototypical sub-events of the class of interest. However, instead of the sub-events being manually defined, (i) they are learned within a discriminative framework and (ii) are, thus, mined automatically with a discriminative objective. The cost function effectively learns the order in which such appearances should occur. It is expected to support the likely order of sub-events while penalizing the unlikely ones. Even if a negative video gives reasonable sub-event detections, the order of occurence of such false positive detections is expected to be incorrect. Thus, the negative video is expected to be penalized by the order dependent cost despite giving sub-event detections. We validate these intuitions empirically with qualitative results in .

The second variant of the model, described in , combines local events with the global information in the video using a convex combination of the two respective terms as the optimization objective. This formulation adapts LOMo to different human action classes that could have either local or global structure, or a combination of both. The relative importance of the local versus global parts are learned using cross-validation for each class. We later discuss and show qualitative examples, in , of classes where either of the two components are important compared to the other for human actions. The Adaptive LOMo is an extension of LOMo, setting makes it same as to LOMo while makes it same as the global temporal pooling based methods.

### 3.4Learning

We propose to learn the model using a stochastic gradient descent (SGD) based algorithm with analytically calculable sub-gradients, given as

The expression for gradient w.r.t. is also similar, with replaced by respectively and if and otherwise. The algorithm, summarized in , randomly samples the training set and does stochastic updates based on the current example. Due to its stochastic nature, the algorithm is quite fast and is usable in online settings where the data is not entirely available in advance and arrives with time.

The scoring optimization can be solved exactly using Dynamic Programming (DP). However, in practice we found the DP based solver to be slow and we resorted to an approximate but much faster algorithm. In the experimental results reported in this paper, we solve the scoring optimization with the following approximate algorithm. We obtain the best scoring frame for , sequentially for , and remove from the model and frames from the video; and repeat steps times so that every has a corresponding . is a hyperparameter to ensure temporal coverage by the model – it stops multiple ’s from choosing (temporally) close frames. Using such suppression we approximately incorporate the overlap constraint in the scoring function ( & ); the hyperparameter is replaced by . Once the sub-event locations are chosen we add to their average template score i.e. . We refer the readers to appendix for discussion regarding using DP based solution for inference.

## 4Experimental Results

We evaluate the proposed algorithms on two domains of facial analysis and human actions. We now present the details of the datasets and the experimental settings, and then discuss the results.

### 4.1Facial Analysis Datasets and Settings

We empirically evaluate the proposed approach on four challenging, publicly available, facial behavior datasets, of emotions, clinical pain and non-verbal behavior, in a weakly supervised setting i.e. without frame level annotations. The four datasets range from both posed (recorded in lab setting) to spontaneous expressions (recorded in realistic settings).

#### Datasets

CK+

[23] is a benchmark dataset for expression recognition, with videos from participants posing for seven basic emotions – anger, sadness, disgust, contempt, happy, surprise and fear. We use a standard subject independent fold cross-validation and report mean of average class accuracies over the folds. It has annotations for the apex frame and thus also allows fully supervised training and testing.
Oulu-CASIA VIS [24] is another challenging benchmark for basic emotion classification. We use the subset of expressions that were recorded under the visible light condition. There are sequences (from subjects) and six classes (as CK+ except contempt). It has a higher intra-class variability as compared to CK+ due to differences among subjects. We report average multiclass accuracy and use subject independent folds provided by the dataset creators.
UNBC McMaster Shoulder Pain [10] is used to evaluate clinical pain prediction. It consists of real world videos of subjects with pain while performing guided movements of their affected and unaffected arm in a clinical interview. The videos are rated for pain intensity ( to ) by trained experts. Following [17], we label videos as “pain” for intensity above three and “no pain” for intensity zero, and discard the rest. This results in videos from subjects with positive and negative samples. Following [17] we do a standard leave-one-subject out cross-validation and report classification rate at ROC-EER.
LILiR [25] is a dataset of non-verbal behavior such as agreeing, thinking, in natural social conversations. It contains videos of subjects involved in dyadic conversations. The videos are annotated for displayed non-verbal behavior signals- agreeing, questioning, thinking and understanding, by multiple annotators. We generate positive and negative examples by thresholding the scores with a lower and higher value and discarding those in between. We then generate ten folds at random and report average Area under ROC – we will make our cross-validation folds public. This differs from Sheerman et al. [25], who used a very small subset of only video samples that were annotated with the highest and the lowest scores.

#### Features

We compute four types of facial descriptors. We extract facial landmark points and head-pose information using supervised gradient descent [100] and use them for aligning faces. The first set of descriptors are SIFT-based features, which we compute by extracting SIFT features around facial landmarks and thereafter concatenating them [100]. We align the faces into pixel and extract SIFT features (using open source vlfeat library [101] ) in a fixed window of size pixels. The SIFT features are normalized to unit norm. We chose location of landmark points around eyes (), brows (), nose () and mouth () for extracting the features. Since SIFT features are known to contain redundant information [102], we use Principal Component Analysis to reduce their dimensionality to . To each of these frame-level features, we add coarse temporal information by appending the descriptors from next consecutive frames, leading to a dimensionality of . The second features that we use are geometric features [40], that are known to contain shape or location information of permanent facial features (e.g. eyes, nose). We extract them from each frame by subtracting and coordinates of the landmark points of that frame from the first frame (assumed to be neutral) of the video and concatenating them into a single vector ( dimensions). We also compute LBP features (with radius and neighbourhood ) that represent texture information in an image as a histogram. We add spatial information to the LBP features by dividing the aligned faces into a regular grid and concatenating the histograms ( dimensions) [37]. We also consider Convolution Neural Network (CNN) features by using publicly available models of Parkhi et al. [29] that was trained on a large dataset for face recognition. We use the network output from the last fully connected layer. However, we found that these performed lower than other features e.g. on Oulu and CK+ datasets they performed about absolute lower than LBP features. We suspect that they are not adapted to tasks other than identity discrimination and did not use them further.

#### Baselines

For our experiments on human facial behavior analysis we report results with baseline approaches. For first two baselines we use average (or mean) (MnP) and max temporal pooling (MxP) [41] over per-frame facial features along with SVM. Temporal pooling is often used along with spatio-temporal features such as Bag of Words [1], LBP [43] in video event classification, as it yields vectorial representation for each video by summarizing variable length frame features. We select Multiple Instance Learning based on latent SVM [15] as the third baseline algorithm. We also compute the performance of the fully supervised (FS) algorithms for cases with known location of the frame that contains the apex expression. For making a fair comparison, we use the same implementation for SVM, MIL and LOMo.

#### Parameters

We fix and in the current implementation, for obtaining SVM baseline results with a single vector input (mean and max pooling), and report best results across both learning rate and number of iterations. For both MIL () and LOMo, which take a sequence of vectors as input, we set the learning rate to and for MIL we set . We fix the regularization parameters and for all experiments, based on initial validation experiments. We do multiclass classification using one-vs-all strategy. For ensuring temporal coverage (see ), we set the search space for finding the next sub-event to exclude and neighboring frames from the previously detected sub-events’ locations for datasets with fewer frames per video (i.e. CK+, Oulu-CASIA VIS and LILiR datasets) and UNBC McMaster dataset, respectively. We did not do (dataset specific) cross-validation for these hyperparameters as we found the results to be stable across different choice of hyperparameters owing to similar domain of the datasets. We set for all datasets except CK+ where we set since it consists of posed expressions containing only onset and apex phase spanning a few frames. For our final implementation, we combine LOMo models, learned independently on different features, using late fusion i.e. we averaged the prediction scores with equal weights.

### 4.2Human Actions Datasets and Settings

We also evaluate our approach on challenging publicly available human action classification datasets. Their videos has only video-level labels and cover wide range of activities such as sports, grooming, human interactions, that showed wide variability in appearance, temporal structure, duration, viewpoints etc.

Olympic Sports

[6] is a dataset of sports activities e.g. snatch, clean and jerk, high jump etc. It contains video samples from sports classes with most videos collected from YouTube. Most of these the activity classes are complex activities [6] in that they are composed of simpler actions, e.g. long jump activity is composed of standing, running and jumping. In addition to simple limb movements this dataset also involves interactions between humans and objects, e.g. javelin in javelin throw class. We use the train and test splits provided by the authors and report Mean Average Precision (mAP) across all classes.
High Five [27] dataset consists of videos of interactions between humans which were collected from TV shows. The datasets contains clips from classes, i.e. hug, kiss, handshake, highfive and clips from a negative class. This dataset was introduced to study two person interactions as opposed to single person interaction. We use the train and test folds provided by the authors and report mAP across two fixed cross-validation folds.

HMDB

[5] dataset contains wide range of human actions including facial actions (e.g. smile, laugh), facial actions with object manipulation (e.g. smoke, drink), general body movements (e.g. cartwheel, clap hands, climb, jump), body movements with object interactions (e.g. catch, brush hair, shoot bow), and body movements for human interactions (e.g. fencing, hug, kiss). It has around clips from classes, collected from YouTube, Prelinger archive etc. The dataset is very challenging; it has videos with significant camera motion (), poor quality (only high quality clips), and non frontal camera viewpoints ( clips have frontal viewpoint). We report both mAP and mean multiclass accuracy across cross-validation folds provided by the authors.

#### Features

We use both Improved Dense Trajectory (iDT) [3] and CNN based features [85] for our experiments on classifying human actions. We extract the iDT features using the tool provided by the authors and also use the available human bounding boxes for HMDB and Olympic dataset [3]. For highFive dataset we use the bounding boxes provided by the dataset creators [27]. We extract iDT features with trajectory length of for HMDB and highFive dataset, while for Olympic dataset we use trajectories of length as used in [105], as trajectories for fast moving motions (especially sports) are usually unable to describe salient objects for more than a few frames. The trajectories are described using four features, i.e. HOG, HOF and MBH along X-axis and Y-axis. We use PCA to reduce the dimensionality of these features by half and encoded them separately using Fisher Vector (FV) encoding with dictionaries of elements. We then perform power normalization () and normalization for the FVs [66] and concatenate the FVs for different low-level features as the final representation (dimensionality is ). We refer to these features as iDT-64-FV denoting the different components used. For both clustering via Gaussian Mixture Modeling and extracting FV we use the vlfeat library [101]. We construct segment-level local features by using a temporal window of size of around each frame and then compute the above FVs on features lying in the temporal window. This averaging results in adding temporal context to each individual feature. The length of determines the duration of local sub-events and we set it based on the average duration of each clip and an estimate of the duration of corresponding local temporal events in each dataset. We set for Olympic dataset, for highFive dataset and for HMDB dataset. The Olympic dataset in general include longer videos with average duration of frames compared to in HighFive and HMDB. Also, the local temporal events in general are longer in Olympic dataset compared to HighFive and HMDB. We use the above procedure to obtain global temporal features for the Adaptive LOMo algorithm by using a temporal window that includes the entire duration of a videos. We extract CNN features from layer VGG network [85] and layer ResNet [86], both trained for image classification on ImageNet dataset. For each frame, we resize it such that the smallest dimension is of size and then crop a central region. We then compute outputs from fc-6 layer for VGG network (dimensionality is ) and pool-5 layer of ResNet (dimensionality is ). We also extract spatio-temporal C3D features as described in [87] by using a network that was pretrained on Sports dataset. We use Caffe implementation provided by Tran et al. and extract features from fc-6 later over a frame window. We use the same procedure as outlined for FVs for obtaining local and global temporal features except we use mean temporal pooling and only perform normalization in the end.

#### Baselines

We report results with five baselines for evaluation of Adaptive LOMo (A-LOMo) on human activity classification. The first baseline is global temporal pooling (GTP) which is obtained by using global temporal features. The second baseline is LOMo that was obtained with local temporal features only. We also compare with MIL algorithm as done for facial analysis. The fourth baseline is with LOMo with the ordinal component removed and the final baseline is Adaptive LOMo with (MIL+GTP), which is essentially combining MIL with GTP.

#### Parameters

Since the human activity datasets varied in the number and type of classes, viewpoints, camera motion, duration of clips and number of samples, we opt for using fold cross validation for setting parameter. Further, since different actions have events that span different durations, we also set the temporal coverage parameter by the same cross-validation. We sweep from to and during inference we set to minimum of the cross-validated value and number of frames divided by the number of events. This is done to handle clips whose duration are smaller than the parameters. In order to reduce the number of cross validation steps we first set and using cross-validation for LOMo and then fix these values for cross-validating the parameter in Adaptive LOMo. We observed during our experiments that the results were not very sensitive to these parameters. The second regularization parameter is set to in our experiments. We set across all experiments based on our initial validation experiments where we observed that in general the results were stable across different classes for and then begin to drop. The GTP based baseline is obtained by setting in Adaptive LOMo algorithm, while the LOMo baseline is obtained by setting . We report the results for MIL baseline by setting and . For our final implementation we combine Adaptive LOMo models, learned independently on different features, by weighted averaging of z-score normalized prediction scores (late fusion). We obtained the weights by doing a coarse grid search over the set . The weights obtained for the three datasets were for ResNet based features, for C3D (except for in HighFive dataset), for Objects (used in HMDB dataset), and for for iDT.

### 4.3Quantitative Results

#### Facial Behavior Analysis

The performances of the proposed approach, along with those of the baseline methods, are shown in . In this comparison, we use SIFT-based facial features for all datasets. Since head nod information is important for identifying non-verbal behavior such as agreeing, we also append head-pose information (yaw, pitch and roll) to the SIFT-based features for the LILiR dataset.

We see performance improvements with proposed LOMo, in comparison to baseline methods, on out of prediction tasks. In comparison to MIL, we observe that LOMo outperforms the former method on all tasks. The improvements are and absolute, on CK+, Oulu-CASIA VIS and UNBC McMaster datasets, respectively. This improvement can be explained by the modeling advantages of LOMo, where it not only discovers multiple discriminative sub-events but also learns their ordinal arrangement. For the LILiR dataset, we see improvements in particular on the ‘Questioning’ ( absolute) and ‘Agreeing’ ( absolute), where temporal information is useful for recognition. In comparison to temporal pooling based approaches, LOMo outperforms both variants of temporal pooling- MnP and MxP, on out of tasks. This is not surprising since temporal pooling operations are known to add noise to discriminative segments of a video by adding information from non-informative segments [41]. Moreover, they discard any temporal ordering, which is often important for analyzing facial activity [9].

On both facial expression tasks, i.e. emotion (CK+ and Oulu-CASIA VIS) and pain prediction (UNBC McMaster), methods can be arranged in increasing order of performance as MnP, MxP, MIL, LOMo. A similar trend between temporal pooling and weakly supervised methods has also been reported by previous studies on video classification [9]. We again stress that LOMo performs better than the existing weakly supervised methods, which are the preferred choice for these tasks. In particular, we observe the difference to be higher between temporal pooling and weakly supervised methods on the UNBC McMaster dataset, i.e. for MnP, for MxP, for MIL and for LOMo. This is because the subjects exhibit both head movements and non-verbal behavior unrelated to pain, and thus focusing on the discriminative segment, cf. using a global description, leads to performance gain. However, we do not notice a similar trend on the LILiR dataset – the differences are smaller or reversed, e.g. for ‘Understanding’ mean-pooling is marginally better than MIL ( vs. ), while LOMo is modestly better than both (). This could be because most conversation videos are pre-segmented and predicting non-verbal behavior relying on a single prototypical segment might be difficult, e.g. ‘Understanding’ includes both upward and downward head nod, which cannot be captured well by detecting a single event. In such cases we see LOMo beats MIL by temporal modeling of multiple events. We also performed t-test between the results on MIL and LOMo and observed the p-values were low ( ) for Oulu and LiLiR- questioning task. The p-values were moderately low ( ) for CK+ and LiLiR-think. We account higher p-values in other cases to a small number of samples in most datasets e.g. in UBC-McMaster. We also observed the results of LOMo to be higher or equal to MIL in most test folds on all the datasets e.g. for UNBC McMaster.

#### Human Action Classification

The comparison of Adaptive LOMo with several baseline methods on datasets and two different features is shown in . When using iDT features we see that Adaptive LOMo outperforms the baseline methods on all the datasets except HMDB. The mAP score for Adaptive LOMo is on the Olympic dataset as compared to with LOMo, with MIL, with GTP, with LOMo without ordinal component, and with MIL+GTP. We also see that on HMDB dataset, both Adaptive LOMo () and LOMo () outperforms MIL () by a significant margin showing the benefits of using the proposed local structure instead of a single discriminative segment. These results clearly demonstrate that, by learning to adapt to different classes, Adaptive LOMo combines the strength of both local and global temporal structure and results in performance improvements. The baseline with MIL+GTP is also outperformed by Adaptive LOMo in majority of the cases, further demonstrating that the local ordinal structure is important in the combination of local and global information as well.

In terms of features, the improvements relative to GTP are higher for iDT features as compared to CNN features. For example, the relative improvement between Adaptive LOMo and global temporal pooling is for iDT features and for CNN features on the Olympic dataset. This trend for performance improvement is similar for the HMDB dataset ( for iDT vs. of CNN). We explain this observation by the presence of motion information in the iDT features, that results in more meaningful local temporal segments and benefits our algorithm. This is particularly true for classes where motion cues seem to be more important for discrimination compared to appearance cues [5]. For example, we observed the relative improvement between Adaptive LOMo and GTP across iDT and CNN features to be high for classes such as “chew” ( for iDT vs. none for CNN), “shootbal” ( for iDT vs. none for CNN), “handstand’ ( for iDT vs for CNN), “highjump” ( for iDT vs for CNN).

The HMDB dataset was collected by asking students to annotate parts of a clip that represented a single non-ambiguous human action [5]. Thus it is mostly composed of pre-segmented clips where for certain classes the appearance information may not vary much in comparison to the motion information. For example, in the class “shootball” appearance features such as CNN may always encode a person, basketball net and a basketball, and this representation may not vary much across frames. On the other hand, motion information can effectively encode the movement of the person and the ball. This information can then effectively represent sub-events, such as standing, jumping and shooting the ball, that describe this class. This is the reason for our method, that combines GTP with local sub-events, to gain higher performance improvements over GTP while using iDT features. On HMDB dataset we observed that Adaptive LOMo outperforms GTP on out of classes when using iDT features. On the other hand, with CNN features it outperformed only in classes. We saw a similar trend for HighFive dataset where Adaptive LOMo outperformed GTP on all classes with iDT features, and on out of classes with CNN features. On the Olympic dataset, Adaptive LOMo trained with iDT features showed improvement on of classes, while with CNN features it showed an improvement on out of classes. We also see that LOMo outperforms its counterpart without ordinal information, while using iDT features, on all datasets. These trends clearly highlight that motion information is critical for learning local sub-events and their ordinal relationship is important as well. For the remaining classes we observed that the results were either similar to GTP or lower when cross-validation was unable to estimate the correct weighting factor.

We have also shown some classes with the best relative improvement between Adaptive LOMo and GTP in . These classes show improvement over baseline due to the addition of LOMo that models each video as a collection of local sub-events with loose ordering. We observe that the classes with most improvements can either be decomposed into short temporal segments, e.g. “hug”, “highjump”, “pick”, or involve repetitions of simpler motion segments, e.g. “talk”, “walk”, “climb”. We conclude from from this observation that our method generalizes to not only activities composed of simple motion segments but also to categories that involve repetitions of these simpler actions. Motion information is crucial for discriminating between fine-grained classes such as sword drawing vs. sword fighting, and our method is able to learn and yield performance improvement from this information. Other example of such classes are “cleanandjerk” and “poleactivity” that seem similar to the class “snatch” in appearance but differ in composition of the temporal segments. We observed during our experiments that for some classes Adaptive LOMo selected either mostly local or mostly GTP. Example of classes where mostly one of the two components was selected are: (i) local () – “cleanandjerk”, “hug”, “handstand”, “talk”, “highjump”, “shootball”, “pullup”, “pushup” and (ii) global () – “turn”, “drawsword”, “kiss”, “shakehands”, “catch”, “kickball”, “eat”, “wave”. Notice, that in the later classes the action is expected to be spread out in the whole duration of the video. We believe that being able to adapt the classifier to different classes not only results in improvement, as seen with Adaptive LOMo, but also explain the underlying temporal structure.

### 4.4Qualitative Results

We now give some qualitative results in this section by showing detections on some test samples. The frames are overlaid with detections from the discovered sub-events with Event 1 in red color, Event 2 in green color and Event 3 in blue color. Since the LOMo model starts with random initialization the events can vary across classes in terms of both the most probable ordering and their semantics.

#### Facial Behavior Analysis

shows the detections of our approach, with model trained for “happy” expression, on two sequences from the Oulu-CASIA VIS dataset. The model is trained with three sub-events. As seen in , the three events seem to correspond to the expected semantic events, i.e. neutral, low-intensity (onset) and apex, in that order, for the positive example (left), while for the negative example (right) the events are incorrectly detected and are in the wrong order as well. To further illustrate the importance of learning the ordering, we observed that the ordering cost learned by the model for the ordering was which is much lower than for the correct order of , as observed in the positive happy example. This result highlights the modeling strength of LOMo, where it learns both multiple sub-events and a prior on their temporal order.

#### Human Action Classification

We have shown detections on test samples in from class “hug” in HighFive dataset, class “cleanandjerk” in Olympic dataset and classes “golf” and “shootball” from HMDB dataset. We see the consistency between sub-events detected for these classes. For example for “golf” class the three events seem to focus on backward motion with club, beginning of forward motion and hitting the ball with the club respectively. We also see that since our model is learning to focus only on (discriminative) frames that correspond to underlying activity, it can effectively filter out noisy or irrelevant frames. For example, it is filtering out the last few frames in example “hug-1” and first few frames in example “shootball-1” and “Shootball-2” where the person seems to be receiving the ball instead of shooting it. Similar to human facial behavior, we found that the ordering cost learns to penalize some orderings more than others. For example, in the case of class ”cleanandjerk” the model allows for swaps between Event 2 and Event 3, which seems to correspond to lifting up motion, but penalizes if Event 3 or Event 2 comes before Event 1 (where the person is beginning to lift).

Thus, we conclude that qualitatively our model supports our intuition, that not only the correct sub-events but their correct temporal order is critical for high performance in such tasks.

### 4.5Parameter Study

We now study the effect of some parameters on the proposed algorithms. (left) shows a plot of parameter (regularizer parameter for weights of the hyperplanes) for LOMo and our implementation of SVM and MIL on two facial behavior datasets. We clearly see that the results are not sensitive to and LOMo also shows clear improvements over the baseline algorithms. In order to show the advantages of using ordinal information inside the model, we selected the same classes (“hug”, “cleanandjerk”, “golf” and “shootball”) as used in the qualitative results in . These classes seem to have a distinct temporal structure and are composed of fine-grained sub-events that differ in their motion patterns. (right) shows relative improvements for classes by switching off the learned ordinal component in the scoring function of LOMo during testing. We observe improvements, for these classes ( on average), while using the ordinal cost in the scoring function. We also show a plot of mAP scores versus the number of events () used to train the model in . We see that for most classes the performance goes up from to and then goes down for higher vales of . The only exception is the class “shootball” where results for are higher as compared to , and this is the case since two sub-events could be better at representing the class. Also with for the “shooball” class, we found that using the ordinal part of the scoring function yields a relative improvement of .

### 4.6Comparison with State-of-the-Art

#### Human Facial Analysis

In this section we compare our approach with several existing approaches on the three facial expression datasets (CK+, Oulu-CASIA VIS and UNBC McMaster). shows our results along with many competing methods on these datasets. To obtain the best performance from the model, we exploited the complementarity of different facial features by combining LOMo models learned on three facial descriptors – SIFT based, geometric and LBP. We combined the models with late fusion by averaging the outputs of all the models. With this setup, we achieve state-of-the-art results on the three datasets.

Several initial methods pooled the spatio-temporal information in the videos, e.g. (i) LBPTOP [43] – Local Binary Patterns in three planes (XY and time), (ii) HOG3D [106] – spatio-temporal gradients, and (iii) 3D SIFT [7]. We report results from Liu et al. [107], who used a similar experimental protocol. These were initial works and we see that their performances are far from current methods, e.g. compared to for the proposed LOMo, HOG3D obtains and LBPTOP obtains on the Oulu-CASIA VIS dataset.

Approaches modeling temporal information include Exemplar-HMMs [41], STM-ExpLet [107], MS-MIL [11]. While Sikka et al. (Exemplar-HMM) [41] computed distances between exemplar HMM models, Liu et al. (STM-ExpLet) [107] learned a flexible spatio-temporal model by aligning local spatio-temporal features in an expression video with a universal Gaussian Mixture Model. LOMo outperforms such methods on both the emotion classification tasks, e.g. on Oulu-CASIA VIS dataset, LOMo achieves a performance improvement of and absolute relative to STM-ExpLet and Exemplar-HMMs respectively. Sikka et al. [9] first extracted multiple temporal segments and then used MIL based on boosting MIL [11]. Chongliang et al. [17] extended this approach to include temporal information by adapting HMM to MIL. We also note the performance in comparison to both MIL based approaches (MS-MIL [9] and MIL-HMM [17]) on the pain dataset. Both the methods reported very competitive performances of and on UNBC McMaster dataset compared to obtained by the proposed LOMo. Since having a large amount of data is difficult for many facial analysis tasks, e.g. clinical pain prediction, our results also show that combining, simple but complementary, features with a competitive model leads to higher results.

#### Human Analysis

In this section, we compare our approach with several existing approaches for human activity classification on three benchmark datasets (Olympic, HighFive and HMDB). shows our results on other competing methods on these datasets. In order to obtain the best performance from the model, we exploit the complementarity of different features by combining LOMo models learned on three different descriptors – iDT and deep features from C3D and ResNet-1522. As the standard performance metric for HMDB dataset is mean multiclass accuracy, we fix to calibrate scores from different one-vs-all classifiers. We use late fusion for combining these features by doing weighted averaging over their normalized prediction scores. Since different methods report results by combining multiple approaches, for HMDB dataset we also give brief description of the corresponding methods.

We first consider methods that relied on encoding motion and appearance information using low-level features, followed by pooling operation to obtain a fixed length vector. These approaches generally extracted variants of motion trajectories to encode motion, e.g. Wang et al. [3] extracted improved trajectories with motion stabilization, Jain et al. [111] compensated for camera motion by removing background motion, Jiang et al. [72] clustered trajectories to model high level motion patterns. We show consistent performance improvements compared to these methods e.g. on HMDB dataset we achieve versus of Wang et al. , of Jain et al. and of Jiang et al. . We also see similar improvements on the Olympic dataset ( of Wang et al. vs by Adaptive LOMo). When compared with a recently proposed method to improve pre-existing features by stacking them at multiple scales [74], we achieve similar results on Olympic ( versus ) and better results on HMDB ( versus ) datasets. Since our method uses the standard iDT we believe that further improvements are possible by using such methods.

We also compare with approaches that additionally encoded temporal or ordering information inside the classification model while training on the above features. On the Olympic dataset we achieve mAP compared to (i) by Gaidon et al. [105] using kernels to compare motion hierarchies in videos, (ii) by Li et al. [75] using dynamic pooling, and (iii) by Liu et al [114] who modeled complex activities as composed of decomposable actions. In comparison to a recent method by Hoai et al. [83], that used distribution of classifier scores and relative class scores for classification, we achieve mAP versus on HMDB dataset. On HighFive dataset our performance was lower in comparison to Hoai et al. [83] by a small margin ( for LOMo versus of Hoai et al. ). On HMDB dataset we also show improvement in comparison to a recent method that used rank pooling [82] (instead of mean pooling) over iDT-256-FV features for classification ( relative improvement).

Several recent approaches have relied on using deep architectures for action recognition. Simonyan et al. [4] proposed a two-stream convolutional network for action recognition that made predictions by late fusion over RGB and optical flow based networks. On HMDB dataset we achieve mAP as compared to by this two-stream network. We also outperform methods relying on end-to-end learning of deep features with recurrent neural networks such as soft-attention based LSTMs [96] ( versus ). We also compare our method with a recent method utilizing fully convolutional LSTM architecture that was build upon both motion and RGB information and can both classify and localize an action [89]. For a fair comparison with their method, we also fuse the classification scores of a model that used softmax scores from a CNN trained on objects as descriptors, by Jain et al. [115]. In comparison to their performance of on HMDB, we achieve without object score fusion and with object score fusion. This is interesting since their method learned an end-to-end architecture from both motion and appearance information and also fuses the score with iDT. Feichtenhofer et al. [113] report a slightly higher performance of with the fusion of an improved two-stream network with iDT features. Recently Carreira and Zisserman [116] showed the advantages of pre-training deep models with much bigger datasets ( classes with or more videos for each class), and reported large performance improvements, e.g. on HMDB dataset. We would expect to see such improvements with the proposed models as well, by using better pre-trained features. In this section we showed that our model yields results that are competitive to existing methods which use similar amounts of traning data.

## 5Conclusion

We presented a (loosely) structured latent variable model that discovers prototypical and discriminative sub-events and learn a prior on the order in which they occur in the video. We learned the model with a regularized max-margin hinge loss minimization which we optimize with an efficient stochastic gradient descent based solver. We evaluated our model on four challenging datasets of expression recognition, clinical pain prediction and intent prediction in dyadic conversations as well as three challenging datasets for human analysis in video which contain variety of actions, e.g. sports actions, human-human interactions and human-object interactions. We demonstrated with experimental results that the proposed model consistently improves over other competitive baselines based on spatio-temporal pooling and Multiple Instance Learning, while working with one type of state-of-the-art feature. Further, in combination with complementary features, we showed that the model achieves state-of-the-art results on all the facial analysis datasets while being competitive to the state-of-the-art on the human action recognition datasets. We also showed qualitative results demonstrating the improved modeling capabilities of the proposed method for both, facial and human analysis, tasks. The model is a general ordered sequence prediction model and we aim to extend it to other sequence prediction tasks. Further, the classifier learning is decoupled from the features and given the recent success of representation learning methods, we would explore end-to-end learning of the features and classifier as another future direction.

Acknowledgements. The authors thank Sanjoy Dasgupta and Harpreet Sawhney for discussions. Gaurav Sharma was supported by the Early Career Research Award from SERB India (ECR/2016/001158) and the Research-I foundation at IIT Kanpur. Karan Sikka was supported by NIH grant R01 NR013500. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Institute of Health.

## 6Appendix

### 6.1Dynamic Programming for Inference

The scoring optimization can be solved exactly using Dynamic Programming (DP). However, in practice we found the DP based solver to be slow and resorted to an approximate but much faster algorithm (see Sec. 3.4). The time complexity of the greedy solution is while that of exact DP solution is , where is the number of frames, is number of events, and is dimensionality of the features. While the greedy solution is clearly a very coarse approximation of the true scoring objective, we found it to be very fast and well performing in the experiments and hence preferred it over the exact dynamic programming based solver. We compared the performance of the greedy and DP based solution on the Olympic dataset and found the running time of the DP based solution to be of greedy solution, while giving comparable final classification performance.

### 6.2Effect of Initialization on LOMo

We studied the effect of initialization on LOMo by evaluating it on the Olympic dataset with iDT features with different random seeds. The mean and standard deviation the performance was and respectively. This mean performance is comparable to the performance of reported in Tab. 2 and the standard deviation does not change any comments on the comparison with any baseline method.

### 6.3Visual Examples from Facial Analysis Task

In order to support our arguments regarding the results in Sec. 4.3.1 we first show the confusion matrices for MIL and LOMo on Oulu dataset in . Based on the individual class performances we observe that LOMo shows substantial improvements in comparison to MIL on classes such as ‘disgust’, ‘happy’, and ’fear’. In particular the improvement on ‘disgust’ class (which is also the most confusing class) is significant ( absolute). We believe this improvement results from the ability of LOMo to capture discriminative sub-events specific to the ‘disgust’ class, which makes it easy to discriminate samples from this class from visually similar classes (‘sad’ and ‘fear’). We have shown an example from the ‘disgust’ class in along with detections for individual sub-events with LOMo and MIL. The score obtained by LOMo (1.18) is higher compared to the score obtained by MIL (-0.08). We attribute this to the ability of LOMo to both detect multiple sub-events and to model prior on their ordering (see Sec. 4.4).

shows detections on an example sequence from the UNBC McMaster dataset where subjects show multiple expressions of pain. The results show that our approach is able to detect such multiple expressions of pain as sub-events.

For better understanding the model, we show the frames corresponding to each latent sub-event as identified by LOMo across different subjects. Ideally each sub-event should correspond to a facial state and thus have a common structure across different subjects. As shown in , we see a common semantic pattern across detected events, with ‘happy’ classifier, where Event 1 seems to be similar to neutral, Event 2 to onset and Event 3 to apex phases of facial expression formation. We observed similar qualitative results across other expression classes.

### Footnotes

1. We assume, for brevity, all videos have the same number of frames, extension to different number of frames is immediate
2. We tried both VGG-16 and ResNet-152 feautures and decided to opt for the later due to higher performance for these experiments.

### References

1. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in CVPR, 2008.
2. A. Gaidon, Z. Harchaoui, and C. Schmid, “Temporal Localization of Actions with Actoms,” PAMI, vol. 35, no. 11, pp. 2782–2795, 2013.
3. H. Wang, D. Oneata, J. Verbeek, and C. Schmid, “A robust and efficient video representation for action recognition,” IJCV, pp. 1–20, 2015.
4. K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS, 2014.
5. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a large video database for human motion recognition,” in ICCV, 2011.
6. J. C. Niebles, C.-W. Chen, and L. Fei-Fei, “Modeling temporal structure of decomposable motion segments for activity classification,” in ECCV, 2010.
7. P. Scovanner, S. Ali, and M. Shah, “A 3-dimensional sift descriptor and its application to action recognition,” in ACM MM, 2007.
8. A. Kar, N. Rai, K. Sikka, and G. Sharma, “Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos,” in CVPR, 2017.
9. K. Sikka, A. Dhall, and M. S. Bartlett, “Classification and weakly supervised pain localization using multiple segment representation,” IVC, vol. 32, no. 10, pp. 659–670, 2014.
10. P. Lucey, J. Cohn, K. Prkachin, P. Solomon, and I. Matthews, “Painful data: The unbc-mcmaster shoulder pain expression archive database,” in FG, 2011.
11. P. Viola, J. Platt, and C. Zhang, “Multiple instance boosting for object detection,” NIPS, 2006.
12. O. Rudovic, V. Pavlovic, and M. Pantic, “Multi-output laplacian dynamic ordinal regression for facial expression recognition and intensity estimation,” in CVPR, 2012.
13. T. Simon, M. H. Nguyen, F. De La Torre, and J. F. Cohn, “Action unit detection with segment-based svms,” in CVPR, 2010.
14. G. Sharma and P. Perez, “Latent max-margin metric learning for comparing video face tubes,” in CVPRW, 2015.
15. S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vector machines for multiple-instance learning,” in NIPS, 2002.
16. A. Ruiz, J. Van de Weijer, and X. Binefa, “Regularized multi-concept mil for weakly-supervised facial behavior categorization,” in BMVC, 2014.
17. C. Wu, S. Wang, and Q. Ji, “Multi-instance hidden markov model for facial expression recognition,” in FG, 2015.
18. W. Li and N. Vasconcelos, “Multiple instance learning for soft bags via top instances,” in CVPR, 2015.
19. B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervised video representation learning with odd-one-out networks,” in CVPR, 2017.
20. H.-Y. Lee, J.-B. Huang, M. K. Singh, and M.-H. Yang, “Unsupervised representation learning by sorting sequence,” in ICCV, 2017.
21. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part based models,” PAMI, vol. 32, no. 9, pp. 1627–1645, 2010.
22. Y. Wang and G. Mori, “Max-margin hidden conditional random fields for human action recognition,” in CVPR, 2009.
23. P. Lucey, J. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in CVPRW, 2010.
24. G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. Pietikäinen, “Facial expression recognition from near-infrared videos,” IVC, vol. 29, no. 9, pp. 607–619, 2011.
25. T. Sheerman-Chase, E.-J. Ong, and R. Bowden, “Feature selection of facial displays for detection of non verbal communication in natural conversation,” in ICCVW, 2009.
26. K. Tang, L. Fei-Fei, and D. Koller, “Learning latent temporal structure for complex event detection,” in CVPR.1em plus 0.5em minus 0.4emIEEE, 2012.
27. A. Patron-Perez, M. Marszalek, A. Zisserman, and I. D. Reid, “High five: Recognising human interactions in tv shows.” in BMVC, 2010.
28. K. Sikka, G. Sharma, and M. Bartlett, “Lomo: Latent ordinal model for facial analysis in videos,” in CVPR, 2016.
29. O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in BMVC, 2015.
30. G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” UMass Amherst, Tech. Rep. 07-49, 2007.
31. F. Alnajar, Z. Lou, J. Alvarez, and T. Gevers, “Expression-invariant age estimation,” in BMVC, 2014.
32. B. Bhattarai, G. Sharma, A. Lechervy, and F. Jurie, “A joint learning approach for cross domain age estimation,” in ICASSP, 2016.
33. J. Lu, X. Zhou, Y.-P. Tan, Y. Shang, and J. Zhou, “Neighborhood repulsed metric learning for kinship verification,” PAMI, vol. 36, no. 2, pp. 331–345, 2014.
34. B. Fasel and J. Luettin, “Automatic facial expression analysis: a survey,” Pattern recognition, vol. 36, no. 1, pp. 259–275, 2003.
35. S. Kaltwang, O. Rudovic, and M. Pantic, “Continuous pain intensity estimation from facial expressions,” Advances in Visual Computing, pp. 368–377, 2012.
36. F. De la Torre and J. F. Cohn, “Facial expression analysis,” in Visual analysis of humans.1em plus 0.5em minus 0.4emSpringer, 2011, pp. 377–409.
37. K. Sikka, T. Wu, J. Susskind, and M. Bartlett, “Exploring bag of words architectures in the facial expression domain,” in ECCVW, 2012.
38. T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” PAMI, vol. 24, no. 7, pp. 971–987, 2002.
39. G. Littlewort, J. Whitehill, T. Wu, I. Fasel, M. Frank, J. Movellan, and M. Bartlett, “The computer expression recognition toolbox (cert),” in FG, 2011.
40. Z. Zhang, M. Lyons, M. Schuster, and S. Akamatsu, “Comparison between geometry-based and gabor-wavelets-based facial expression recognition using multi-layer perceptron,” in FG, 1998.
41. K. Sikka, A. Dhall, and M. Bartlett, “Exemplar hidden markov models for classification of facial expressions in videos,” in CVPRW, 2015.
42. T. Wu, M. S. Bartlett, and J. R. Movellan, “Facial expression recognition using gabor motion energy filters,” in CVPRW, 2010.
43. G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” PAMI, vol. 29, no. 6, pp. 915–928, 2007.
44. G. Littlewort, J. Whitehill, T.-F. Wu, N. Butko, P. Ruvolo, J. Movellan, and M. Bartlett, “The motion in emotion—a cert based approach to the fera emotion challenge,” in FG, 2011.
45. J. J. Lien, T. Kanade, J. F. Cohn, and C.-C. Li, “Automated facial expression recognition based on facs action units,” in FG, 1998.
46. K.-Y. Chang, T.-L. Liu, and S.-H. Lai, “Learning partially-observed hidden conditional random fields for facial expression recognition,” in CVPR, 2009.
47. A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell, “Hidden conditional random fields,” PAMI, vol. 29, no. 10, pp. 1848–1852, 2007.
48. Y. Zhang and Q. Ji, “Active and dynamic information fusion for facial expression understanding from image sequences,” PAMI, vol. 27, no. 5, pp. 699–714, 2005.
49. M. F. Valstar and M. Pantic, “Fully automatic recognition of the temporal phases of facial actions,” IEEE SMC, vol. 42, no. 1, pp. 28–43, 2012.
50. A. Lorincz, L. Jeni, Z. Szabo, J. F. Cohn, T. Kanade et al., “Emotional expression classification using time-series kernels,” in CVPRW, 2013.
51. M. H. Nguyen, L. Torresani, F. de la Torre, and C. Rother, “Weakly supervised discriminative localization and classification: a joint learning process,” in CVPR, 2009.
52. G. Sharma, F. Jurie, and C. Schmid, “Expanded parts model for semantic description of humans in still images,” TPAMI, vol. 39, no. 1, pp. 87–101, 2017.
53. L. Bourdev, S. Maji, and J. Malik, “Describing people: Poselet-based attribute classification,” in ICCV, 2011.
54. L. Bourdev and J. Malik, “Poselets: Body part detectors trained using 3d human pose annotations,” in ICCV, 2009.
55. S. Maji, L. Bourdev, and J. Malik, “Action recognition from a distributed representation of pose and appearance,” in CVPR, 2011.
56. G. Gkioxari, R. Girshick, and J. Malik, “Actions and attributes from wholes and parts,” in ICCV, 2015.
57. G. Sharma and F. Jurie, “Learning discriminative spatial representation for image classification,” in BMVC, 2011.
58. G. Sharma, F. Jurie, and C. Schmid, “Discriminative spatial saliency for image classification,” in CVPR, 2012.
59. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a large video database for human motion recognition,” in ICCV, 2011.
60. J. Zhu, B. Wang, X. Yang, W. Zhang, and Z. Tu, “Action recognition with actons,” in ICCV, 2013.
61. J. Wu, Y. Zhang, and W. Lin, “Towards good practices for action video encoding,” in CVPR, 2014.
62. L. Wang, Y. Qiao, and X. Tang, “Mining motion atoms and phrases for complex action recognition,” in ICCV, 2013.
63. P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatio-temporal features,” in 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.1em plus 0.5em minus 0.4emIEEE, 2005.
64. H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-temporal features for action recognition,” in BMVC, 2009.
65. G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in Intl. Workshop on Stat. Learning in Comp. Vision, 2004.
66. F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in ECCV, 2010.
67. X. Wang, L. Wang, and Y. Qiao, “A comparative study of encoding, pooling and normalization methods for action recognition,” in ACCV, 2012.
68. H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, “Dense trajectories and motion boundary descriptors for action recognition,” IJCV, vol. 103, no. 1, pp. 60–79, 2013.
69. J. Sun, Y. Mu, S. Yan, and L.-F. Cheong, “Activity recognition using dense long-duration trajectories,” in ICME, 2010.
70. R. Messing, C. Pal, and H. Kautz, “Activity recognition using the velocity histories of tracked keypoints,” in ICCV.1em plus 0.5em minus 0.4emIEEE, 2009.
71. A. Jain, A. Gupta, M. Rodriguez, and L. S. Davis, “Representing videos using mid-level discriminative patches,” in CVPR, 2013.
72. Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo, “Trajectory-based modeling of human actions with motion reference points,” in ECCV.1em plus 0.5em minus 0.4emSpringer, 2012.
73. A. Gaidon, Z. Harchaoui, and C. Schmid, “Activity representation with motion hierarchies,” IJCV, vol. 107, no. 3, pp. 219–238, 2014.
74. Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj, “Beyond gaussian pyramid: Multi-skip feature stacking for action recognition,” in CVPR, 2015.
75. W. Li, Q. Yu, A. Divakaran, and N. Vasconcelos, “Dynamic pooling for complex event recognition,” in CVPR, 2013.
76. W. Brendel and S. Todorovic, “Learning spatiotemporal graphs of human activities,” in ICCV, 2011.
77. M. Raptis and L. Sigal, “Poselet key-framing: A model for human activity recognition,” in CVPR, 2013.
78. H. Bilen, V. P. Namboodiri, and L. Van Gool, “Object and action classification with latent variables,” in BMVC, vol. 25, no. 17, 2011, pp. 1–11.
79. S. Satkin and M. Hebert, “Modeling the temporal extent of actions,” in ECCV, 2010.
80. J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by attributes,” in CVPR, 2011.
81. S. Ma, L. Sigal, and S. Sclaroff, “Space-time tree ensemble for action recognition,” in CVPR, 2015.
82. B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars, “Modeling video evolution for action recognition,” in CVPR, 2015, pp. 5378–5387.
83. M. Hoai and A. Zisserman, “Improving human action recognition using score distribution and ranking,” in ACCV.1em plus 0.5em minus 0.4emSpringer, 2014.
84. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
85. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” ICLR, 2015.
86. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
87. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015.
88. N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into convolutional networks for learning video representations,” arXiv preprint arXiv:1511.06432, 2015.
89. Z. Li, E. Gavves, M. Jain, and C. G. Snoek, “Videolstm convolves, attends and flows for action recognition,” arXiv preprint arXiv:1607.01794, 2016.
90. M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database for fine grained activity detection of cooking activities,” in CVPR, 2012.
91. C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” arXiv preprint arXiv:1604.06573, 2016.
92. W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao, “A key volume mining deep framework for action recognition,” in CVPR, 2016.
93. S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” PAMI, vol. 35, no. 1, pp. 221–231, 2013.
94. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in CVPR, 2014.
95. J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in CVPR, 2015.
96. S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” arXiv preprint arXiv:1511.04119, 2015.
97. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015.
98. M. Pandey and S. Lazebnik, “Scene recognition and weakly supervised object localization with deformable part-based models,” in Computer Vision (ICCV), 2011 IEEE International Conference on.1em plus 0.5em minus 0.4emIEEE, 2011, pp. 1307–1314.
99. H. Bilen, V. Namboodiri, and L. Van Gool, “Classification with global, local and shared features,” Pattern Recognition, pp. 134–143, 2012.
100. X. Xiong and F. De la Torre, “Supervised descent method and its applications to face alignment,” in CVPR, 2013.
101. A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library of computer vision algorithms,” http://www.vlfeat.org/, 2008.
102. Y. Ke and R. Sukthankar, “Pca-sift: A more distinctive representation for local image descriptors,” in CVPR, 2004.
103. S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006.
104. P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine recognition of human activities: A survey,” TCSVT, vol. 18, no. 11, pp. 1473–1488, 2008.
105. A. Gaidon, Z. Harchaoui, and C. Schmid, “Activity representation with motion hierarchies,” IJCV, vol. 107, no. 3, pp. 219–238, 2014.
106. A. Klaser, M. Marszaek, and C. Schmid, “A spatio-temporal descriptor based on 3d-gradients,” in BMVC, 2008.
107. M. Liu, S. Shan, R. Wang, and X. Chen, “Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition,” in CVPR, 2014.
108. Y. Guo, G. Zhao, and M. Pietikäinen, “Dynamic facial expression recognition using longitudinal facial expression atlases,” in ECCV, 2012.
109. P. Lucey, J. Howlett, J. Cohn, S. Lucey, S. Sridharan, and Z. Ambadar, “Improving pain recognition through better utilisation of temporal information,” in AVSP, 2008.
110. M. Hoai and A. Zisserman, “Talking heads: Detecting humans and recognizing their interactions,” in CVPR, 2014.
111. M. Jain, H. Jegou, and P. Bouthemy, “Better exploiting motion for better action recognition,” in CVPR, 2013.
112. H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, “Dynamic image networks for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3034–3042.
113. C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in CVPR, 2016.
114. C. Liu, X. Wu, and Y. Jia, “A hierarchical video description for complex activity understanding,” IJCV, pp. 1–16, 2016.
115. M. Jain, J. C. van Gemert, and C. G. Snoek, “What do 15,000 object categories tell us about classifying and localizing actions?” in CVPR, 2015.
116. J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the Kinetics dataset,” in CVPR, 2017.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters