Weakly Supervised Learning of Heterogeneous Concepts in Videos
Abstract
Typical textual descriptions that accompany online videos are ‘weak’: i.e., they mention the main concepts in the video but not their corresponding spatiotemporal locations. The concepts in the description are typically heterogeneous (e.g., objects, persons, actions). Certain location constraints on these concepts can also be inferred from the description. The goal of this paper is to present a generalization of the Indian Buffet Process (IBP) that can (a) systematically incorporate heterogeneous concepts in an integrated framework, and (b) enforce location constraints, for efficient classification and localization of the concepts in the videos. Finally, we develop posterior inference for the proposed formulation using meanfield variational approximation. Comparative evaluations on the Casablanca and the A2D datasets show that the proposed approach significantly outperforms other stateoftheart techniques: 24% relative improvement for pairwise concept classification in the Casablanca dataset and 9% relative improvement for localization in the A2D dataset as compared to the most competitive baseline.
1 Introduction
Watching and sharing videos on social media has become an integral part of everyday life. We are often intrigued by the textual description of the videos and intend to fastforward to the segments of interest without watching the entire video. However, these textual descriptors usually do not specify the exact segment of the video associated with a particular description. For example, someone describing a movie clip as “headon collision between cars while Chris Cooper is driving” neither provide the timestamps for the collision or driving events nor the spatial locations of the cars or Chris Cooper. Such descriptions are referred to as ‘weak labels’. For efficient video navigation and consumption, it is important to automatically determine the spatiotemporal locations of these concepts (such as ‘collision’ or ‘cars’). However, it is prohibitively expensive to train conceptspecific models for all concepts of interest in advance and use them for localization. This shortcoming has triggered a great amount of interest in jointly learning conceptspecific classification models as well as localizing concepts from multiple weakly labeled images [1, 2, 3] or videos [4, 5].
Video descriptions include concepts which may refer to persons, objects, scenes and/or actions and thus a typical description is a combination of heterogeneous concepts. In the running example, extracted heterogeneous concepts are ‘car’ (object), ‘headon collision’ (action), ‘Chris Cooper’ (person) and ‘driving’ (action). Learning classifiers for these heterogeneous concepts along with localization is an extremely challenging task because: (a) the classifiers for different kinds of concepts are required to be learned simultaneously, e.g., a face classifier, an object classifier, an action classifier etc., and (b) the learning model must take into account the spatiotemporal location constraints imposed by the descriptions while learning these classifiers. For example, the concepts ‘headon collision’ and ‘cars’ should spatiotemporally cooccur at least once and there should be at least one car in the video.
Recently there has been growing interest to jointly learn concept classifiers from weak labels [1, 5]. Bojanowski et al [5] proposed a discriminative clustering framework to jointly learn person and action models from movies using weak supervision provided by the movie scripts. Since weak labels are extracted from scripts, each label can be associated with a particular shot in the movie which may last only for a few seconds, i.e., the labels are well localized and that makes the overall learning easier. However, in real world videos, one does not have access to such shotlevel labels but only to videolevel labels. Therefore in our work, we do not assume availability of such well localized labels, and tackle a more general problem of learning concepts from the weaker videolevel labels. The framework in [5], when extended to long videos does not give satisfactory results (see section 4). Such techniques, which are based on a linear mapping from features to labels and model background using only a single latent factor, are usually inadequate to capture all the interclass and intraclass variations. Shi et al [1] jointly learn object and attribute classifiers from images using weakly supervised Indian Buffet Process (IBP). Note that IBP [6, 7] allows observed features to be explained by a countably infinite number of latent factors. However, the framework in [1] is not designed to handle heterogeneous concepts and location constraints, which leads to a significant degradation in performance (section 4.3). [8] and [9] propose IBP based crossmodal categorization/query image retrieval models which learn semantically meaningful abstract features from multimodal (image, speech and text) data. However, these unsupervised approaches do not incorporate any location constraints which naturally arise in the weakly supervised setting with heterogeneous labels.
We propose a novel Bayesian Nonparametric (BNP) approach called WSCSIIBP (Weakly Supervised, Constrained & Stacked Integrative IBP) to jointly learn heterogeneous concept classifiers and localize these concepts in videos. BNP models are a class of Bayesian models where the hidden structure that may have generated the observed data is not assumed to be fixed. Instead, a framework is provided that allows the complexity of the model to increase as more data is observed [10]. Specifically, we propose:

A novel generalization of IBP which for the first time incorporates weakly supervised spatiotemporal location constraints and heterogeneous concepts in an integrated framework.

Posterior inference of WSCSIIBP model using meanfield variational approximation.
We assume that the weak video labels come in the form of tuples: in the running example, the extracted heterogeneous concept tuples are ({car, headon collision}, {Chris Cooper, driving})^{1}^{1}1Extracting the concept tuples from textual descriptions of the videos is an interesting research problem in itself and is beyond the scope of this paper.. We perform experiments on two video datasets (a) the Casablanca movie dataset [5] and (b) the A2D dataset [11]. We show that the proposed approach WSCSIIBP outperforms several stateoftheart methods for heterogeneous concept classification and localization in a weakly supervised setting. For example, WSCSIIBP leads to a relative improvement of 7%, 5% and 24% on person, action and pairwise classification accuracies, respectively, over the most competitive baselines on the Casablanca dataset. Similarly, the relative improvement on localization accuracy is 9% over the next best approach on the A2D dataset.
2 Related Work
In this section, we discuss relevant prior work in two broad categories.
Weakly Supervised Learning: Localizing concepts and learning classifiers from weakly annotated data is an active research topic. Researchers have learned models for various concepts from weakly labeled videos using MultiInstance Learning (MIL) [12, 13] for human action recognition[14], visual tracking [15] etc. Cour et al [16] uses a novel convex formulation to learn face classifiers from movies and TV series using multimodal features which are obtained from finely aligned screenplay, speech and video data. In [17, 4], the authors propose discriminative clustering approaches for aligning videos with temporally ordered text descriptions or predefined tags and in the process also learn action classifiers. In our approach, we consider weak labels which are neither ordered nor aligned to any specific video segment. [18] proposes a method for learning object class detectors from real world web videos known to contain only the target class by formulating the problem as a domain adaptation task. [19] learns weakly supervised object/action classifiers using a latentSVM formulation where the objects or actions are localized in training images/videos using latent variables. We note that  both [18, 19] consider only a single weak label per video and unlike our approach, do not jointly learn the heterogeneous concepts. The authors in [20, 21] use dialogues, scene and character identification to find an optimal mapping between a book and movie shots using shortest path or CRF approach. However, these approaches neither jointly model heterogeneous concepts nor spatiotemporally localized them. Although [22] proposes a discriminative clustering model for coreference resolution in videos, only faces are considered in their experiments.
Heterogeneous concept learning: There are prior works on automatic image [23, 24, 25] and video [26, 27, 28] caption generation, where models are trained on pairs of image/video and text that contain heterogeneous concept descriptions to predict captions for novel images/videos. While most of these approaches rely on deep learning methods to learn a mapping between an image/video and the corresponding text description, [25] uses MIL to learn visual concept detectors (spatial localization in images) for nouns, verbs and adjectives. However, none of these approaches spatiotemporally localize points of interests in videos. Perhaps the available video datasets are not large enough to train such a weakly supervised deep learning model.
To the best of our knowledge there is no prior work that jointly classifies and localizes heterogeneous concepts in weakly supervised videos.
3 WSCSIIBP: Model and Algorithm
In this section, we describe the details of WSCSIIBP (see figure 1 for the pipeline). We first introduce notations and motivate our approach in sections 3.1 and 3.2 respectively. This is followed by section 3.3 where we introduce stacked nonparametric graphical model  IBP and its corresponding posterior computation. In sections 3.4 and 3.5, we formulate an extension of the stacked IBP model which can generalize to heterogeneous concepts as well as incorporate the constraints obtained from weak labels. In section 3.6, we briefly describe the inference procedure using truncated meanfield variational approximation and summarize our entire algorithm. Finally, we discuss how one can classify and localize concepts in new test videos using WSCSIIBP.
3.1 Notation
Assume we are given a set of weakly labeled videos denoted by , where indicates a video and denotes the heterogeneous weak labels corresponding to the th video. Although the proposed approach can be used for any number of heterogeneous concepts, for readability, we restrict ourselves to two concepts and call them subjects and actions. We also have a closed set of class labels for these heterogeneous concepts: for subjects and for actions . Let , , , indicate that the corresponding subject or action class label is not present and represents the number of videos. The videolevel annotation simply indicates that the paired concepts can occur anywhere in the video and at multiple locations.
Assume that spatiotemporal tracks are extracted from each video i where each track j is represented as an aggregation of multiple local features, . The spatiotemporal tracks could be face tracks, 3D object proposals or action proposals (see section 4.1 for more details). We associate the track in video i to an infinite binary latent coefficient vector [6, 1]. Each video i is represented by a bag of spatiotemporal tracks . Similarly, .
3.2 Motivation
Our objective is to learn (a) a mapping between each of the tracks in video and the labels in and (b) the appearance model for each label identity such that the tracks from new test videos can be classified. To achieve these, it is important for any model to discover the latent factors that can explain similar tracks across a set of videos with a particular label. In general, the number of latent factors are not known apriori and must be inferred from the data. In Bayesian framework, IBP treats this number as a random variable that can grow with new observations, thus letting the model to effectively explain the unbounded complexity in the data. Specifically, IBP defines a prior distribution over an equivalence class of binary matrices of bounded rows (indicating spatiotemporal tracks) and infinite columns (indicating latent coefficients). To achieve our goals, we build on IBP and introduce WSCSIIBP model which can effectively learn the latent factors corresponding to each heterogeneous concept and utilize prior location constraints to reduce the ambiguity in the learning through the knowledge of other latent coefficients.
3.3 Indian Buffet Process (IBP)
The spatiotemporal tracks in the videos are obtained from an underlying generative process. Specifically, we consider a stacked IBP model [1] as described below.

For each latent factor ,

Draw an appearance distribution with mean


For each video ,

Draw a sequence of i.i.d. random variables, Beta

Construct the prior on the latent factors, ,

For subject track in video, where ,

Sample state of each latent factor, Bern,

Sample track appearance,


where is the prior controlling the sparsity of latent factors, and are the prior appearance and noise variance shared across all factors, respectively. Each forms row of and the value of the latent coefficient indicates whether data contains the latent factor or not. In the above model, we have used stickbreaking construction [29] to generate the s.
Posterior: Now, we describe how the posterior is obtained for the above graphical model. Let and denote hidden variables and prior parameters, respectively. denotes the concatenation of all the spatiotemporal tracks in all videos, . Given prior distribution and likelihood function , the posterior probability is given by (using Bayes theorem),
(1)  
where is the marginal likelihood. For simplicity, we denote as . Apart from the significance of inferring for identifying tracklevel labels, inferring prior for each video helps to identify videolevel labels, while the inference of appearance model will be used to classify new test samples (see section 3.6). Thus, learning in our model requires computing the full posterior distribution over .
Regularized posterior: We note that it is difficult to infer regularized posterior distributions using Equation (1). Zellner in [30] demonstrated that the posterior distribution in (1) can also be obtained as the solution of the following optimization problem,
(2)  
where denotes the KullbackLiebler divergence and is the probability simplex. As we will see later, this procedure enables us to learn posterior distribution using constrained optimization framework.
3.4 Integrative IBP
Our objective is to model heterogeneous concepts (such as subjects and actions) using a graphical model. However, the IBP model described above can not handle multiple concepts because it is highly unlikely that the subject and the action features can be explained by the same statistical model. Hence, we propose an extension of stacked IBP for heterogeneous concepts, where different concept types are modeled using different appearance models.
Let the subject and action types corresponding to the spatiotemporal track j in video i be denoted by and , respectively, with each having different dimensions ()^{2}^{2}2We often use as a replacement of and throughout the paper.. Unlike the IBP model, and are now represented using two different gaussian noise models and respectively where denotes prior noise variance and are matrices (K ). The mean of the subject and action appearance models for each latent factor are also sampled independently from gaussian distributions of different variances . The new posterior probability is given by,
(3)  
3.5 Integrative IBP with Constraints
Although the graphical model described above is capable of handling heterogeneous features, the location constraints inferred from the weak labels still need to be incorporated into the graphical model. As motivated in section 1, the concepts ‘headon collision’ and ‘cars’ should spatiotemporally cooccur at least once and there should be at least one car in the full video. Imposing these location constraints in the inference algorithm can lead to more accurate parameter estimation of the graphical model and faster convergence of the inference procedure. These constraints can be generalized as follows,

Every label tuple in , is associated with at least one spatiotemporal track (i.e., the event occurs in the video).

Spatiotemporal tracks should be assigned a label only from the list of weak labels assigned to the video. Concepts present in the video but not in the label will be subsumed in the background models.
Ideally in the case of noiseless labels, these constraints should be strictly followed. However, we assume that realworld labels could be noisy and noise is independent of the videos. Hence, we allow constraints to be violated but penalize the violations using additional slack variables.
We associate the first and the following latent factors (the rows of ) to the subject and action classes in and respectively. The inferred values of their corresponding latent coefficients in are used to determine the presence/absence of the associated concept in a particular spatiotemporal track. The remaining unbounded number of latent factors are used to explain away the background tracks from unknown action and subject classes in a video. With these assignments, we enforce the following constraints on latent factors which are sufficient to satisfy the conditions mentioned earlier.
To satisfy 1, we introduce the following constraints, and ,
(4)  
(5)  
(6) 
where is the slack variable, and are the latent factor coefficients corresponding to subject class and action class respectively.
To satisfy 2, we use the following constraints, and ,
(7)  
(8) 
The constraints defined in (4)(8) have been used in the context of discriminative clustering [22, 5]. However, our model is the first to use these constraints in a Bayesian setup. In their simplest form, they can be enforced using the point estimate of e.g., MAP estimation. However, is defined over the entire probability space. To enforce the above constraints in a Bayesian framework, we need to account for the uncertainty in . Following [31, 32], we define effective constraints as an expectation of the original constraints in (4)(8), where the expectation is computed w.r.t. the posterior distribution in (3) (see supplementary material for the expectation constraints). The proposed graphical model, incorporating heterogeneous concepts as well as the location constraints provided by the weak labels, is shown in figure 2.
We restrict the search space for posterior distribution in Equation (3) by using the expectation constraints. In order to obtain the regularized posterior distribution of the proposed model, we solve the following optimization problem under these expectation constraints,
(9) 
3.6 Learning and Inference
Note that the variational inference for true posterior (in Equation (3)) is intractable over the general space of probability functions. To make our problem easier to solve, we establish truncated meanfield variational approximation [29] to the desired posterior , such that the search space is constrained by the following tractable parametrised family of distributions,
(10) 
where , and . In Equation (10), we note that all the latent variables are modeled independently of all other variables, hence simplifying the inference procedure. The truncated stick breaking process of ’s is bounded at , wherein for . indicates the number of latent factors chosen to explain background tracks.
The optimization problem in Equation (9) is solved using the posterior distribution from Equation (10). We obtain the parameters (see supplementary material for details) , and for the optimal posterior distribution using iterative update rules as summarized in Algorithm 1. The mean of binary latent coefficients , denoted by , has an update rule which will lead to several interesting observations.
(11) 
(12)  
where is the digamma function, is an indicator function, is an indicator variable and is a lower bound for . The indicates whether a concept (action/subject) is part of video label set or not. If , all the corresponding binary latent coefficients , are forced to 0, which is equivalent to enforcing the constraints in Equation (7) and (8). Note that the value of increases with . The terms (i)(iii) in the update rule for (Equation (12)), which are obtained due to the location constraints in Equation (4)(6), act as the coupling terms between ’s. For example, for any action concept, term (ii) suggests that if the location constraints are not satisfied, better localization of all the coupled subject concepts (high value of ) will drive up the value of . This implies that the strong localization of one concept can lead to better localization of other concepts.
The hyperparameter and can be set apriori or estimated from data. Similar to the maximization step of EM algorithm, their empirical estimation can easily be obtained by maximizing the expected loglikelihood (see supplementary material).
Given the input features and , the inferred latent coefficients estimate presence/absence of associated classes in a video. One can classify each spatiotemporal track by estimating the tracklevel labels using . Here the maximization is over the latent coefficients corresponding to either the subject or action concepts depending upon the label which we are interested in extracting. For the concept localization task in a video with label pair , the best track in the video is selected using .
Test Inference: Although the above formulation is proposed for concept classification and localization in a given set of videos (transductive setting), the same algorithm can also be applied to unseen test videos. The latent coefficients for the tracks of test videos can be learned alongside the training data except that the parameters , , and are updated only using training data. In the case of free annotation, i.e., absence of labels for test video , we run the proposed approach by setting in eq (11), indicating that the tracks in a video can belong to any of the classes in or (i.e., no constraints as defined by (4)(8) are enforced).
4 Experimental Results
In this section, we present an evaluation of WSCSIIBP on two realworld databases: Casablanca movie and A2D dataset, which represent typical ‘inthewild’ videos with weak labels on heterogeneous concepts.
4.1 Datasets
Casablanca dataset: This dataset, introduced in [5], has 19 persons (movie actors) and three action classes (sitdown, walking, background). The heterogeneous concepts used in this dataset are persons and actions. The Casablanca movie is divided into shorter segments of duration either 60 or 120 seconds. We manually annotate all the tracks in each video segment which may contain multiple persons and actions. Given a video segment and the corresponding videolevel labels (extracted from all ground truth track labels), our algorithm maps each of these labels to one or more tracks in that segment, i.e., converts the weak labels to strong labels. Our main objective of evaluation on this dataset is to compare the performance of various algorithms in classifying tracks from videos of varying length.
For our setting, we consider face and action as the two heterogeneous concepts and thus it is required to extract the face and the corresponding action track features. We extract 1094 facial tracks from the full 102 minute Casablanca video. The face tracks are extracted by running the multiview face detector from [33] in every frame and associating detections across frames using point tracks [34]. We follow [35] to generate the face track feature representations: Dense rootSIFT features are extracted for each face in the track followed by PCA and videolevel Fisher vector encoding. The action tracks corresponding to 1094 facial tracks are obtained by extrapolating the face boundingboxes using linear transformation [5]. For action features, we compute Fisher vector encoding on dense trajectories [36] extracted from each action track.
On an average, each 60 sec. segment contains 11 faceaction tracks and 4 faceaction annotations while each 120 sec. video contains 21 tracks and 6 annotations. Note that, our experimental setup is more difficult compared to the experimental setting considered in [5]. In [5], the Casablanca movie is divided into numerous bags based on the movie script, where on average each segment is of duration 31 sec. containing only 6.27 faceaction tracks.
A2D dataset: This dataset [11] contains 3782 YouTube videos (on average 710 seconds long) covering seven objects (ball, bird, car etc.) performing one of nine actions (fly, jump, roll etc.). The heterogeneous concepts considered are objects and actions. This dataset provides the bounding box annotations for every video label pair of object and action. Using the A2D dataset, we aim to analyze the track localization performance on weakly labeled videos as well as the track classification accuracy on a heldout test dataset.
We use the method proposed in [37] to generate spatiotemporal object track proposals. For computational purpose, we consider only 10 tracks per video and use the Imagenet pretrained VGG CNNM network [38] to generate object feature representation. We extract convolutional layer conv4 and conv5 features for each track image followed by PCA and videolevel Fisher vector encoding. In this dataset, the corresponding action tracks are kept similar to the object tracks (proposals) and the action features are extracted using the same approach as used for the Casablanca dataset.
4.2 Baselines
We compare WSCSIIBP to several stateoftheart approaches using the same features.

WSSIBP [1]: This is a weakly supervised stacked IBP model which does not consider integrative framework for heterogeneous data and only enforces constraints equivalent to (7)(8). For each spatiotemporal track, the features extracted for heterogeneous concepts are concatenated while using this approach.

WSS / WSA: This is similar to WSSIBP except that instead of concatenating features from multiple concepts they are treated independently in two different IBP. WSS is used to model only the person/object features and WSA is used to model the action features.

WSSIIBP: This model integrates WSSIBP with heterogeneous concepts.

WSCSIBP: This model is similar to WSSIBP, but unlike WSSIBP, it additionally enforces the location constraints obtained from weak labels.
Implementation details: For each dataset, the Fisher encoded features are PCA reduced to an appropriate dimension, . We select the best feature length and other algorithm specific hyperparameters for each algorithm using crossvalidation on a small set of input videos. For the IBP based models, the crossvalidation range for hyperparameters are , and . For all IBP based models, the parameters , , and are set as 32, 100, 30 and 0.5 respectively for the Casablanca dataset and as 128, 160, 50 and 5 respectively for the A2D dataset. For WSDC, is set as 1024.
4.3 Results on Casablanca
The tracklevel classification performance is compared in Figure 3. From Figures (a)a and (d)d, it can be seen that WSCSIIBP significantly outperforms other methods for person and action classification in almost all of the scenarios. For instance, in the 120 second video segments, person classification improves by 4% (relative improvement is 7%) compared to the most competitive approach WSSIIBP. We also compare pairwise label accuracy to gain insight into the importance of the constraints in eq (4)(6). For any given track with nonbackground person and action label, the classification is assumed to be correct only if both person and action labels are correctly assigned. Even in this scenario WSCSIIBP performs 8.1% better (24% relative improvement) than the most competitive baseline. Since we combine the heterogeneous concepts along with location constraints in an integrated framework, WSCSIIBP outperforms all other baselines. The weak results of WSDC in pairwise classification, though surprising, can be attributed to their action classification results which are significantly biased towards one particular action ‘sitdown’ (figure (d)d, note that WSDC performs very poorly in ‘walking’ classification). Indeed, it should be noted that nearly 40% and 89% of person and action labels respectively belong to the background class. Thus, for fair evaluation of both background and nonbackground classes, we also plot the recall of background class against the recall of nonbackground classes for person and action classification in Figure (b)b, (c)c, (e)e, (f)f. These curves were obtained by simultaneously computing recall for background and nonbackground classes at a range of threshold values on score, . The mean average precision (mAP) of WSCSIIBP along with all other baselines are plotted in Figure (g)g and (h)h. The mAP values also clearly demonstrate the effectiveness of the proposed approach. From the performance of WSSIIBP (integrative concepts, no constraints) and WSCSIBP (no integrative concepts, constraints) (Figure (a)a and (d)d), it is clear that the improvement in performance in the WSCSIIBP can be attributed to both addition of integrative concepts and the location constraints.
Effect of constraints (7), (8): We note that, regardless of other differences, every weakly supervised IBP model considered here enforces constraints (7), (8). However, these constraints are not part of original WSDC. To make a fair comparison between WSDC and WSCSIIBP, we analyze the effect of these constraints in Figure (i)i. Although, these additional constraints improve WSDC performance, they do not supersede the performance of WSCSIIBP. Further we observe that these constraints have improved the performance of all the weakly supervised IBP models.
4.4 Results on A2D
First, we evaluate localization performance on the full A2D dataset. We experiment with 37,820 tracks extracted from 3,782 videos with around 5000 weak labels. For every given objectaction label pair our algorithm selects the best track from the corresponding video using the approach outlined in (section 3.6). The localization accuracy is measured by calculating the average IoU (Intersection over Union) of the selected track (3D bounding box) with the ground truth bounding box. The classwise IoU accuracy and the mean IoU accuracy for all classes are tabulated in Table 3 and 3 respectively. In this task also WSCSIIBP leads to a relative improvement of 9% above the next best baseline. We also evaluate how accurately the extracted object proposals match with the ground truth bounding boxes to estimate an upper bound on the localization accuracy (referred as Upper Bound in Table 3 and 3). In this case, the track maximizing the average IoU with the ground truth annotation is selected and the corresponding IoU is reported. We plot the correct localization accuracy with varying IoU thresholds in figure (a)a, which also shows the effectiveness of the proposed approach. Figure (b)b(c)c shows some qualitative localization results using the proposed approach on a few track images.
Test Inference: We evaluate the classification performance on heldout test samples using the same train/test partition as in [11]. We consider two setups for the evaluation, (a) using videolevel labels for the test samples and (b) free annotation where no test video labels are provided. The proposed approach is compared with GTSVM, which is a fully supervised linear SVM that uses ground truth bounding boxes and their corresponding strong labels during training. The results are tabulated in Table 3. Note that the performance of WSCSIIBP is close to that of the fully supervised setup.
adult 
baby 
ball 
bird 
car 
cat 
dog 
climb 
crawl 
eat 
fly 
jump 
roll 
run 
walk 

WSCSIIBP  28.4  43.6  9.8  37.8  37.4  40.8  42.0  37.5  47.6  46.1  24.5  29.4  50.9  25.6  37.2 
Upper Bound  39.9  53.9  16.4  48.2  48.7  52.8  51.4  50.0  59.2  57.2  33.9  41.0  59.1  38.1  47.9 
Random  WSP  WSA  WSSIBP  WSSIIBP  WSCSIBP  WSCSIIBP  Upper Bound  
IoU  25.5  29.7  30.43  31.1  31.55  31.69  34.38  45.05 
WSCSIIBP  GTSVM  

Setup  Obj  Act  Obj  Act 
Using video Labels  94.77  90.68  98.20  94.92 
Free Annotation  76.62  64.77  85.18  73.26 
5 Conclusion
We developed a Bayesian nonparametric approach that integrates Indian Buffet Process with heterogeneous concepts and spatiotemporal location constraints arising from weak labels. We perform experimental results on two recent datasets containing heterogeneous concepts such as persons, objects and actions and show that our approach outperforms the best state of the art method. In future work, we will extend the WSCSIIBP model to additionally localize audio concepts from speech input and develop an endtoend deep neural network for joint feature learning and Bayesian inference.
6 Appendix
6.1 Expectation Constraints
In an Bayesian framework, the effective constraints for equations (4)(8) are defined as an expectation [31, 32] of the original constraints and can be rewritten as,
(S13)  
(S14)  
(S15)  
(S16)  
(S17) 
where the expectation is taken w.r.t. the posterior distribution in (3). From (3) one may note that through , the samples of depends on the previously sampled latent coefficients such as . This complicates the applicability of constraint in equation (S13). However due to the independency assumption, the search space over the family of tractable posterior distribution in (10) simplifies the constraint in equation (S13)(S17) to,
(S18)  
(S19)  
(S20)  
(S21)  
(S22) 
6.2 Derivation of Posterior Update Equations
Now, note that the constraints in (S18)(S20) can be rewritten as hinge loss function and added as part of the objective function in equation (9). Hence the final formulation is given by,
(S23)  
The objective function in eq. (S23) can be rewritten as,
(S24) 
where represent KLdivergence term, denote the likelihood term and is the term corresponding to hinge loss function for . Expanding , we get,
(S25)  
(S26) 
where is matrix, ; ; and
(S27) 
For KLdivergence term, we get , where the individual terms are,
(S28)  
(S29)  
(S30) 
where is the digamma function. As shown for original IBP in [29], the term is approximated by its lower bound,
(S31)  
where the variational parameter is kpoint probability mass function and denotes entropy of . The tightest upper bound is obtained by setting,
where is the normalization factor to enable to be a distribution. On replacing the term