Weakly Supervised Learning of Heterogeneous Concepts in Videos

Weakly Supervised Learning of Heterogeneous Concepts in Videos

Sohil Shah University of Maryland, College Park; Arizona State University; Xerox Research Centre India Kuldeep Kulkarni University of Maryland, College Park; Arizona State University; Xerox Research Centre India Arijit Biswas University of Maryland, College Park; Arizona State University; Xerox Research Centre India Ankit Gandhi University of Maryland, College Park; Arizona State University; Xerox Research Centre India Om Deshmukh and Larry Davis University of Maryland, College Park; Arizona State University; Xerox Research Centre India
Abstract

Typical textual descriptions that accompany online videos are ‘weak’: i.e., they mention the main concepts in the video but not their corresponding spatio-temporal locations. The concepts in the description are typically heterogeneous (e.g., objects, persons, actions). Certain location constraints on these concepts can also be inferred from the description. The goal of this paper is to present a generalization of the Indian Buffet Process (IBP) that can (a) systematically incorporate heterogeneous concepts in an integrated framework, and (b) enforce location constraints, for efficient classification and localization of the concepts in the videos. Finally, we develop posterior inference for the proposed formulation using mean-field variational approximation. Comparative evaluations on the Casablanca and the A2D datasets show that the proposed approach significantly outperforms other state-of-the-art techniques: 24% relative improvement for pairwise concept classification in the Casablanca dataset and 9% relative improvement for localization in the A2D dataset as compared to the most competitive baseline.

1 Introduction

Watching and sharing videos on social media has become an integral part of everyday life. We are often intrigued by the textual description of the videos and intend to fast-forward to the segments of interest without watching the entire video. However, these textual descriptors usually do not specify the exact segment of the video associated with a particular description. For example, someone describing a movie clip as “head-on collision between cars while Chris Cooper is driving” neither provide the time-stamps for the collision or driving events nor the spatial locations of the cars or Chris Cooper. Such descriptions are referred to as ‘weak labels’. For efficient video navigation and consumption, it is important to automatically determine the spatio-temporal locations of these concepts (such as ‘collision’ or ‘cars’). However, it is prohibitively expensive to train concept-specific models for all concepts of interest in advance and use them for localization. This shortcoming has triggered a great amount of interest in jointly learning concept-specific classification models as well as localizing concepts from multiple weakly labeled images [1, 2, 3] or videos [4, 5].

Video descriptions include concepts which may refer to persons, objects, scenes and/or actions and thus a typical description is a combination of heterogeneous concepts. In the running example, extracted heterogeneous concepts are ‘car’ (object), ‘head-on collision’ (action), ‘Chris Cooper’ (person) and ‘driving’ (action). Learning classifiers for these heterogeneous concepts along with localization is an extremely challenging task because: (a) the classifiers for different kinds of concepts are required to be learned simultaneously, e.g., a face classifier, an object classifier, an action classifier etc., and (b) the learning model must take into account the spatio-temporal location constraints imposed by the descriptions while learning these classifiers. For example, the concepts ‘head-on collision’ and ‘cars’ should spatio-temporally co-occur at least once and there should be at least one car in the video.

Recently there has been growing interest to jointly learn concept classifiers from weak labels [1, 5]. Bojanowski et al [5] proposed a discriminative clustering framework to jointly learn person and action models from movies using weak supervision provided by the movie scripts. Since weak labels are extracted from scripts, each label can be associated with a particular shot in the movie which may last only for a few seconds, i.e., the labels are well localized and that makes the overall learning easier. However, in real world videos, one does not have access to such shot-level labels but only to video-level labels. Therefore in our work, we do not assume availability of such well localized labels, and tackle a more general problem of learning concepts from the weaker video-level labels. The framework in [5], when extended to long videos does not give satisfactory results (see section 4). Such techniques, which are based on a linear mapping from features to labels and model background using only a single latent factor, are usually inadequate to capture all the inter-class and intra-class variations. Shi et al [1] jointly learn object and attribute classifiers from images using weakly supervised Indian Buffet Process (IBP). Note that IBP [6, 7] allows observed features to be explained by a countably infinite number of latent factors. However, the framework in [1] is not designed to handle heterogeneous concepts and location constraints, which leads to a significant degradation in performance (section 4.3). [8] and [9] propose IBP based cross-modal categorization/query image retrieval models which learn semantically meaningful abstract features from multimodal (image, speech and text) data. However, these unsupervised approaches do not incorporate any location constraints which naturally arise in the weakly supervised setting with heterogeneous labels.

We propose a novel Bayesian Non-parametric (BNP) approach called WSC-SIIBP (Weakly Supervised, Constrained & Stacked Integrative IBP) to jointly learn heterogeneous concept classifiers and localize these concepts in videos. BNP models are a class of Bayesian models where the hidden structure that may have generated the observed data is not assumed to be fixed. Instead, a framework is provided that allows the complexity of the model to increase as more data is observed  [10]. Specifically, we propose:

  1. A novel generalization of IBP which for the first time incorporates weakly supervised spatio-temporal location constraints and heterogeneous concepts in an integrated framework.

  2. Posterior inference of WSC-SIIBP model using mean-field variational approximation.

We assume that the weak video labels come in the form of tuples: in the running example, the extracted heterogeneous concept tuples are ({car, head-on collision}, {Chris Cooper, driving})111Extracting the concept tuples from textual descriptions of the videos is an interesting research problem in itself and is beyond the scope of this paper.. We perform experiments on two video datasets (a) the Casablanca movie dataset [5] and (b) the A2D dataset [11]. We show that the proposed approach WSC-SIIBP outperforms several state-of-the-art methods for heterogeneous concept classification and localization in a weakly supervised setting. For example, WSC-SIIBP leads to a relative improvement of 7%, 5% and 24% on person, action and pairwise classification accuracies, respectively, over the most competitive baselines on the Casablanca dataset. Similarly, the relative improvement on localization accuracy is 9% over the next best approach on the A2D dataset.

2 Related Work

Figure 1: Pipeline of WSC-SIIBP. Multiple videos with heterogeneous weak labels are provided as input and localization and classification of the concepts are performed in these videos.

In this section, we discuss relevant prior work in two broad categories.

Weakly Supervised Learning: Localizing concepts and learning classifiers from weakly annotated data is an active research topic. Researchers have learned models for various concepts from weakly labeled videos using Multi-Instance Learning (MIL) [12, 13] for human action recognition[14], visual tracking [15] etc. Cour et al [16] uses a novel convex formulation to learn face classifiers from movies and TV series using multimodal features which are obtained from finely aligned screenplay, speech and video data. In [17, 4], the authors propose discriminative clustering approaches for aligning videos with temporally ordered text descriptions or predefined tags and in the process also learn action classifiers. In our approach, we consider weak labels which are neither ordered nor aligned to any specific video segment. [18] proposes a method for learning object class detectors from real world web videos known to contain only the target class by formulating the problem as a domain adaptation task. [19] learns weakly supervised object/action classifiers using a latent-SVM formulation where the objects or actions are localized in training images/videos using latent variables. We note that - both [18, 19] consider only a single weak label per video and unlike our approach, do not jointly learn the heterogeneous concepts. The authors in [20, 21] use dialogues, scene and character identification to find an optimal mapping between a book and movie shots using shortest path or CRF approach. However, these approaches neither jointly model heterogeneous concepts nor spatio-temporally localized them. Although [22] proposes a discriminative clustering model for coreference resolution in videos, only faces are considered in their experiments.

Heterogeneous concept learning: There are prior works on automatic image [23, 24, 25] and video [26, 27, 28] caption generation, where models are trained on pairs of image/video and text that contain heterogeneous concept descriptions to predict captions for novel images/videos. While most of these approaches rely on deep learning methods to learn a mapping between an image/video and the corresponding text description, [25] uses MIL to learn visual concept detectors (spatial localization in images) for nouns, verbs and adjectives. However, none of these approaches spatio-temporally localize points of interests in videos. Perhaps the available video datasets are not large enough to train such a weakly supervised deep learning model.

To the best of our knowledge there is no prior work that jointly classifies and localizes heterogeneous concepts in weakly supervised videos.

3 WSC-SIIBP: Model and Algorithm

In this section, we describe the details of WSC-SIIBP (see figure 1 for the pipeline). We first introduce notations and motivate our approach in sections 3.1 and 3.2 respectively. This is followed by section 3.3 where we introduce stacked non-parametric graphical model - IBP and its corresponding posterior computation. In sections 3.4 and 3.5, we formulate an extension of the stacked IBP model which can generalize to heterogeneous concepts as well as incorporate the constraints obtained from weak labels. In section 3.6, we briefly describe the inference procedure using truncated mean-field variational approximation and summarize our entire algorithm. Finally, we discuss how one can classify and localize concepts in new test videos using WSC-SIIBP.

3.1 Notation

Assume we are given a set of weakly labeled videos denoted by , where indicates a video and denotes the heterogeneous weak labels corresponding to the -th video. Although the proposed approach can be used for any number of heterogeneous concepts, for readability, we restrict ourselves to two concepts and call them subjects and actions. We also have a closed set of class labels for these heterogeneous concepts: for subjects and for actions . Let , , , indicate that the corresponding subject or action class label is not present and represents the number of videos. The video-level annotation simply indicates that the paired concepts can occur anywhere in the video and at multiple locations.

Assume that spatio-temporal tracks are extracted from each video i where each track j is represented as an aggregation of multiple local features, . The spatio-temporal tracks could be face tracks, 3-D object proposals or action proposals (see section 4.1 for more details). We associate the track in video i to an infinite binary latent coefficient vector [6, 1]. Each video i is represented by a bag of spatio-temporal tracks . Similarly, .

3.2 Motivation

Our objective is to learn (a) a mapping between each of the tracks in video and the labels in and (b) the appearance model for each label identity such that the tracks from new test videos can be classified. To achieve these, it is important for any model to discover the latent factors that can explain similar tracks across a set of videos with a particular label. In general, the number of latent factors are not known apriori and must be inferred from the data. In Bayesian framework, IBP treats this number as a random variable that can grow with new observations, thus letting the model to effectively explain the unbounded complexity in the data. Specifically, IBP defines a prior distribution over an equivalence class of binary matrices of bounded rows (indicating spatio-temporal tracks) and infinite columns (indicating latent coefficients). To achieve our goals, we build on IBP and introduce WSC-SIIBP model which can effectively learn the latent factors corresponding to each heterogeneous concept and utilize prior location constraints to reduce the ambiguity in the learning through the knowledge of other latent coefficients.

3.3 Indian Buffet Process (IBP)

The spatio-temporal tracks in the videos are obtained from an underlying generative process. Specifically, we consider a stacked IBP model [1] as described below.

  • For each latent factor ,

    1. Draw an appearance distribution with mean

  • For each video ,

    1. Draw a sequence of i.i.d. random variables, Beta

    2. Construct the prior on the latent factors, ,

    3. For subject track in video, where ,

      1. Sample state of each latent factor, Bern,

      2. Sample track appearance,

where is the prior controlling the sparsity of latent factors, and are the prior appearance and noise variance shared across all factors, respectively. Each forms row of and the value of the latent coefficient indicates whether data contains the latent factor or not. In the above model, we have used stick-breaking construction [29] to generate the s.
Posterior: Now, we describe how the posterior is obtained for the above graphical model. Let and denote hidden variables and prior parameters, respectively. denotes the concatenation of all the spatio-temporal tracks in all videos, . Given prior distribution and likelihood function , the posterior probability is given by (using Bayes theorem),

(1)

where is the marginal likelihood. For simplicity, we denote as . Apart from the significance of inferring for identifying track-level labels, inferring prior for each video helps to identify video-level labels, while the inference of appearance model will be used to classify new test samples (see section 3.6). Thus, learning in our model requires computing the full posterior distribution over .
Regularized posterior: We note that it is difficult to infer regularized posterior distributions using Equation (1). Zellner in [30] demonstrated that the posterior distribution in (1) can also be obtained as the solution of the following optimization problem,

(2)

where denotes the Kullback-Liebler divergence and is the probability simplex. As we will see later, this procedure enables us to learn posterior distribution using constrained optimization framework.

3.4 Integrative IBP

Our objective is to model heterogeneous concepts (such as subjects and actions) using a graphical model. However, the IBP model described above can not handle multiple concepts because it is highly unlikely that the subject and the action features can be explained by the same statistical model. Hence, we propose an extension of stacked IBP for heterogeneous concepts, where different concept types are modeled using different appearance models.

Let the subject and action types corresponding to the spatio-temporal track j in video i be denoted by and , respectively, with each having different dimensions ()222We often use as a replacement of and throughout the paper.. Unlike the IBP model, and are now represented using two different gaussian noise models and respectively where denotes prior noise variance and are matrices (K ). The mean of the subject and action appearance models for each latent factor are also sampled independently from gaussian distributions of different variances . The new posterior probability is given by,

(3)

3.5 Integrative IBP with Constraints

Although the graphical model described above is capable of handling heterogeneous features, the location constraints inferred from the weak labels still need to be incorporated into the graphical model. As motivated in section 1, the concepts ‘head-on collision’ and ‘cars’ should spatio-temporally co-occur at least once and there should be at least one car in the full video. Imposing these location constraints in the inference algorithm can lead to more accurate parameter estimation of the graphical model and faster convergence of the inference procedure. These constraints can be generalized as follows,

  1. Every label tuple in , is associated with at least one spatio-temporal track (i.e., the event occurs in the video).

  2. Spatio-temporal tracks should be assigned a label only from the list of weak labels assigned to the video. Concepts present in the video but not in the label will be subsumed in the background models.

Ideally in the case of noiseless labels, these constraints should be strictly followed. However, we assume that real-world labels could be noisy and noise is independent of the videos. Hence, we allow constraints to be violated but penalize the violations using additional slack variables.

We associate the first and the following latent factors (the rows of ) to the subject and action classes in and respectively. The inferred values of their corresponding latent coefficients in are used to determine the presence/absence of the associated concept in a particular spatio-temporal track. The remaining unbounded number of latent factors are used to explain away the background tracks from unknown action and subject classes in a video. With these assignments, we enforce the following constraints on latent factors which are sufficient to satisfy the conditions mentioned earlier.

To satisfy 1, we introduce the following constraints, and ,

(4)
(5)
(6)

where is the slack variable, and are the latent factor coefficients corresponding to subject class and action class respectively.

To satisfy 2, we use the following constraints, and ,

(7)
(8)

The constraints defined in (4)-(8) have been used in the context of discriminative clustering  [22, 5]. However, our model is the first to use these constraints in a Bayesian setup. In their simplest form, they can be enforced using the point estimate of e.g., MAP estimation. However, is defined over the entire probability space. To enforce the above constraints in a Bayesian framework, we need to account for the uncertainty in . Following [31, 32], we define effective constraints as an expectation of the original constraints in (4)-(8), where the expectation is computed w.r.t. the posterior distribution in (3) (see supplementary material for the expectation constraints). The proposed graphical model, incorporating heterogeneous concepts as well as the location constraints provided by the weak labels, is shown in figure 2.

Figure 2: WSC-SIIBP: Graphical Model using two heterogeneous concepts, subjects and actions. Each video (described by video-level labels ) is independently modeled using latent factor prior and contains tracks. Each track is represented using subject and action features and respectively, which are modeled using Gaussian appearance models and . are the binary latent variables indicating the presence or absence of the latent factors in each track. denotes the set of location constraints extracted from the video labels.

We restrict the search space for posterior distribution in Equation (3) by using the expectation constraints. In order to obtain the regularized posterior distribution of the proposed model, we solve the following optimization problem under these expectation constraints,

(9)

3.6 Learning and Inference

Note that the variational inference for true posterior (in Equation (3)) is intractable over the general space of probability functions. To make our problem easier to solve, we establish truncated mean-field variational approximation [29] to the desired posterior , such that the search space is constrained by the following tractable parametrised family of distributions,

(10)

where , and . In Equation (10), we note that all the latent variables are modeled independently of all other variables, hence simplifying the inference procedure. The truncated stick breaking process of ’s is bounded at , wherein for . indicates the number of latent factors chosen to explain background tracks.

The optimization problem in Equation (9) is solved using the posterior distribution from Equation (10). We obtain the parameters (see supplementary material for details) , and for the optimal posterior distribution using iterative update rules as summarized in Algorithm 1. The mean of binary latent coefficients , denoted by , has an update rule which will lead to several interesting observations.

(11)
(12)

where is the digamma function, is an indicator function, is an indicator variable and is a lower bound for . The indicates whether a concept (action/subject) is part of video label set or not. If , all the corresponding binary latent coefficients , are forced to 0, which is equivalent to enforcing the constraints in Equation (7) and (8). Note that the value of increases with . The terms (i)-(iii) in the update rule for (Equation (12)), which are obtained due to the location constraints in Equation (4)-(6), act as the coupling terms between ’s. For example, for any action concept, term (ii) suggests that if the location constraints are not satisfied, better localization of all the coupled subject concepts (high value of ) will drive up the value of . This implies that the strong localization of one concept can lead to better localization of other concepts.

The hyperparameter and can be set apriori or estimated from data. Similar to the maximization step of EM algorithm, their empirical estimation can easily be obtained by maximizing the expected log-likelihood (see supplementary material).

1:Input: data , constant ,
2:Output: distribution and hyper-parameters and
3:Initialize:
4:repeat
5:     repeat
6:          update and , , ;
7:          update and , and 1 to M;
8:          update using Equation (11) and (12), and 1 to M;
9:     until  T iterations or
10:      update the hyperparameters , , ,
11:until  T’ iterations or
Algorithm 1 Learning Algorithm of WSC-SIIBP

Given the input features and , the inferred latent coefficients estimate presence/absence of associated classes in a video. One can classify each spatio-temporal track by estimating the track-level labels using . Here the maximization is over the latent coefficients corresponding to either the subject or action concepts depending upon the label which we are interested in extracting. For the concept localization task in a video with label pair , the best track in the video is selected using .

Test Inference: Although the above formulation is proposed for concept classification and localization in a given set of videos (transductive setting), the same algorithm can also be applied to unseen test videos. The latent coefficients for the tracks of test videos can be learned alongside the training data except that the parameters , , and are updated only using training data. In the case of free annotation, i.e., absence of labels for test video , we run the proposed approach by setting in eq (11), indicating that the tracks in a video can belong to any of the classes in or (i.e., no constraints as defined by (4)-(8) are enforced).

4 Experimental Results

In this section, we present an evaluation of WSC-SIIBP on two real-world databases: Casablanca movie and A2D dataset, which represent typical ‘in-the-wild’ videos with weak labels on heterogeneous concepts.

4.1 Datasets

Casablanca dataset: This dataset, introduced in [5], has 19 persons (movie actors) and three action classes (sitdown, walking, background). The heterogeneous concepts used in this dataset are persons and actions. The Casablanca movie is divided into shorter segments of duration either 60 or 120 seconds. We manually annotate all the tracks in each video segment which may contain multiple persons and actions. Given a video segment and the corresponding video-level labels (extracted from all ground truth track labels), our algorithm maps each of these labels to one or more tracks in that segment, i.e., converts the weak labels to strong labels. Our main objective of evaluation on this dataset is to compare the performance of various algorithms in classifying tracks from videos of varying length.

For our setting, we consider face and action as the two heterogeneous concepts and thus it is required to extract the face and the corresponding action track features. We extract 1094 facial tracks from the full 102 minute Casablanca video. The face tracks are extracted by running the multi-view face detector from [33] in every frame and associating detections across frames using point tracks [34]. We follow [35] to generate the face track feature representations: Dense rootSIFT features are extracted for each face in the track followed by PCA and video-level Fisher vector encoding. The action tracks corresponding to 1094 facial tracks are obtained by extrapolating the face bounding-boxes using linear transformation [5]. For action features, we compute Fisher vector encoding on dense trajectories [36] extracted from each action track.

On an average, each 60 sec. segment contains 11 face-action tracks and 4 face-action annotations while each 120 sec. video contains 21 tracks and 6 annotations. Note that, our experimental setup is more difficult compared to the experimental setting considered in [5]. In [5], the Casablanca movie is divided into numerous bags based on the movie script, where on average each segment is of duration 31 sec. containing only 6.27 face-action tracks.

A2D dataset: This dataset [11] contains 3782 YouTube videos (on average 7-10 seconds long) covering seven objects (ball, bird, car etc.) performing one of nine actions (fly, jump, roll etc.). The heterogeneous concepts considered are objects and actions. This dataset provides the bounding box annotations for every video label pair of object and action. Using the A2D dataset, we aim to analyze the track localization performance on weakly labeled videos as well as the track classification accuracy on a held-out test dataset.

We use the method proposed in [37] to generate spatio-temporal object track proposals. For computational purpose, we consider only 10 tracks per video and use the Imagenet pretrained VGG CNN-M network [38] to generate object feature representation. We extract convolutional layer conv-4 and conv-5 features for each track image followed by PCA and video-level Fisher vector encoding. In this dataset, the corresponding action tracks are kept similar to the object tracks (proposals) and the action features are extracted using the same approach as used for the Casablanca dataset.

4.2 Baselines

We compare WSC-SIIBP to several state-of-the-art approaches using the same features.

  1. WS-DC [5]: This approach uses similar weak constraints as in (4)-(6), but in a discriminative setup where the constraints are incorporated in a biconvex optimization framework.

  2. WS-SIBP [1]: This is a weakly supervised stacked IBP model which does not consider integrative framework for heterogeneous data and only enforces constraints equivalent to (7)-(8). For each spatio-temporal track, the features extracted for heterogeneous concepts are concatenated while using this approach.

  3. WS-S / WS-A: This is similar to WS-SIBP except that instead of concatenating features from multiple concepts they are treated independently in two different IBP. WS-S is used to model only the person/object features and WS-A is used to model the action features.

  4. WS-SIIBP: This model integrates WS-SIBP with heterogeneous concepts.

  5. WSC-SIBP: This model is similar to WS-SIBP, but unlike WS-SIBP, it additionally enforces the location constraints obtained from weak labels.

Implementation details: For each dataset, the Fisher encoded features are PCA reduced to an appropriate dimension, . We select the best feature length and other algorithm specific hyper-parameters for each algorithm using cross-validation on a small set of input videos. For the IBP based models, the cross-validation range for hyper-parameters are , and . For all IBP based models, the parameters , , and are set as 32, 100, 30 and 0.5 respectively for the Casablanca dataset and as 128, 160, 50 and 5 respectively for the A2D dataset. For WS-DC, is set as 1024.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Figure 3: Comparison of results for the Casablanca movie dataset. (a) Classification accuracy for 60 sec. segments. (b) Recall for background vs non-background class (60 sec., person). (c) Recall for background vs non-background (60 sec., action). (d) Classification accuracy for 120 sec. segments. (e) Recall for background vs non-background class (120 sec., person). (f) Recall for background vs non-background (120 sec., action). (g),(h) Mean Average Precision for 60, 120 sec. segments. (i) Classification accuracy obtained with and without constraints (7) and (8)

4.3 Results on Casablanca

The track-level classification performance is compared in Figure 3. From Figures (a)a and (d)d, it can be seen that WSC-SIIBP significantly outperforms other methods for person and action classification in almost all of the scenarios. For instance, in the 120 second video segments, person classification improves by 4% (relative improvement is 7%) compared to the most competitive approach WS-SIIBP. We also compare pairwise label accuracy to gain insight into the importance of the constraints in eq (4)-(6). For any given track with non-background person and action label, the classification is assumed to be correct only if both person and action labels are correctly assigned. Even in this scenario WSC-SIIBP performs 8.1% better (24% relative improvement) than the most competitive baseline. Since we combine the heterogeneous concepts along with location constraints in an integrated framework, WSC-SIIBP outperforms all other baselines. The weak results of WS-DC in pairwise classification, though surprising, can be attributed to their action classification results which are significantly biased towards one particular action ‘sitdown’ (figure (d)d, note that WS-DC performs very poorly in ‘walking’ classification). Indeed, it should be noted that nearly 40% and 89% of person and action labels respectively belong to the background class. Thus, for fair evaluation of both background and non-background classes, we also plot the recall of background class against the recall of nonbackground classes for person and action classification in Figure (b)b, (c)c, (e)e, (f)f. These curves were obtained by simultaneously computing recall for background and non-background classes at a range of threshold values on score, . The mean average precision (mAP) of WSC-SIIBP along with all other baselines are plotted in Figure (g)g and (h)h. The mAP values also clearly demonstrate the effectiveness of the proposed approach. From the performance of WS-SIIBP (integrative concepts, no constraints) and WSC-SIBP (no integrative concepts, constraints) (Figure (a)a and (d)d), it is clear that the improvement in performance in the WSC-SIIBP can be attributed to both addition of integrative concepts and the location constraints.

Effect of constraints (7), (8): We note that, regardless of other differences, every weakly supervised IBP model considered here enforces constraints (7), (8). However, these constraints are not part of original WS-DC. To make a fair comparison between WS-DC and WSC-SIIBP, we analyze the effect of these constraints in Figure (i)i. Although, these additional constraints improve WS-DC performance, they do not supersede the performance of WSC-SIIBP. Further we observe that these constraints have improved the performance of all the weakly supervised IBP models.

4.4 Results on A2D

First, we evaluate localization performance on the full A2D dataset. We experiment with 37,820 tracks extracted from 3,782 videos with around 5000 weak labels. For every given object-action label pair our algorithm selects the best track from the corresponding video using the approach outlined in (section 3.6). The localization accuracy is measured by calculating the average IoU (Intersection over Union) of the selected track (3-D bounding box) with the ground truth bounding box. The class-wise IoU accuracy and the mean IoU accuracy for all classes are tabulated in Table 3 and 3 respectively. In this task also WSC-SIIBP leads to a relative improvement of 9% above the next best baseline. We also evaluate how accurately the extracted object proposals match with the ground truth bounding boxes to estimate an upper bound on the localization accuracy (referred as Upper Bound in Table 3 and 3). In this case, the track maximizing the average IoU with the ground truth annotation is selected and the corresponding IoU is reported. We plot the correct localization accuracy with varying IoU thresholds in figure (a)a, which also shows the effectiveness of the proposed approach. Figure (b)b-(c)c shows some qualitative localization results using the proposed approach on a few track images.
Test Inference: We evaluate the classification performance on held-out test samples using the same train/test partition as in [11]. We consider two setups for the evaluation, (a) using video-level labels for the test samples and (b) free annotation where no test video labels are provided. The proposed approach is compared with GT-SVM, which is a fully supervised linear SVM that uses ground truth bounding boxes and their corresponding strong labels during training. The results are tabulated in Table 3. Note that the performance of WSC-SIIBP is close to that of the fully supervised setup.

(a)
(b)
(c)
Figure 4: (a) Correct localization accuracy at various IOU thresholds. (b) and (c) Qualitative results: green boxes show the concept localization using our proposed approach.

adult

baby

ball

bird

car

cat

dog

climb

crawl

eat

fly

jump

roll

run

walk

WSC-SIIBP 28.4 43.6 9.8 37.8 37.4 40.8 42.0 37.5 47.6 46.1 24.5 29.4 50.9 25.6 37.2
Upper Bound 39.9 53.9 16.4 48.2 48.7 52.8 51.4 50.0 59.2 57.2 33.9 41.0 59.1 38.1 47.9
Table 1: Per class mean IoU on A2D dataset.

Random WS-P WS-A WS-SIBP WS-SIIBP WSC-SIBP WSC-SIIBP Upper Bound
IoU 25.5 29.7 30.43 31.1 31.55 31.69 34.38 45.05
Table 2: Average IoU comparison with other approaches on A2D dataset.
WSC-SIIBP GT-SVM
Setup Obj Act Obj Act
Using video Labels 94.77 90.68 98.20 94.92
Free Annotation 76.62 64.77 85.18 73.26
Table 3: mAP classification test accuracy on A2D dataset.

5 Conclusion

We developed a Bayesian non-parametric approach that integrates Indian Buffet Process with heterogeneous concepts and spatio-temporal location constraints arising from weak labels. We perform experimental results on two recent datasets containing heterogeneous concepts such as persons, objects and actions and show that our approach outperforms the best state of the art method. In future work, we will extend the WSC-SIIBP model to additionally localize audio concepts from speech input and develop an end-to-end deep neural network for joint feature learning and Bayesian inference.

6 Appendix

6.1 Expectation Constraints

In an Bayesian framework, the effective constraints for equations (4)-(8) are defined as an expectation [31, 32] of the original constraints and can be rewritten as,

(S13)
(S14)
(S15)
(S16)
(S17)

where the expectation is taken w.r.t. the posterior distribution in (3). From (3) one may note that through , the samples of depends on the previously sampled latent coefficients such as . This complicates the applicability of constraint in equation (S13). However due to the independency assumption, the search space over the family of tractable posterior distribution in (10) simplifies the constraint in equation (S13)-(S17) to,

(S18)
(S19)
(S20)
(S21)
(S22)

6.2 Derivation of Posterior Update Equations

Now, note that the constraints in (S18)-(S20) can be rewritten as hinge loss function and added as part of the objective function in equation (9). Hence the final formulation is given by,

(S23)

The objective function in eq. (S23) can be rewritten as,

(S24)

where represent KL-divergence term, denote the likelihood term and is the term corresponding to hinge loss function for . Expanding , we get,

(S25)
(S26)

where is matrix, ; ; and

(S27)

For KL-divergence term, we get , where the individual terms are,

(S28)
(S29)
(S30)

where is the digamma function. As shown for original IBP in [29], the term is approximated by its lower bound,

(S31)

where the variational parameter is k-point probability mass function and denotes entropy of . The tightest upper bound is obtained by setting,

where is the normalization factor to enable to be a distribution. On replacing the term