Multimodal Visual Concept Learning with Weakly Supervised Techniques

Multimodal Visual Concept Learning with Weakly Supervised Techniques


Despite the availability of a huge amount of video data accompanied by descriptive texts, it is not always easy to exploit the information contained in natural language in order to automatically recognize video concepts. Towards this goal, in this paper we use textual cues as means of supervision, introducing two weakly supervised techniques that extend the Multiple Instance Learning (MIL) framework: the Fuzzy Sets Multiple Instance Learning (FSMIL) and the Probabilistic Labels Multiple Instance Learning (PLMIL). The former encodes the spatio-temporal imprecision of the linguistic descriptions with Fuzzy Sets, while the latter models different interpretations of each description’s semantics with Probabilistic Labels, both formulated through a convex optimization algorithm. In addition, we provide a novel technique to extract weak labels in the presence of complex semantics, that consists of semantic similarity computations. We evaluate our methods on two distinct problems, namely face and action recognition, in the challenging and realistic setting of movies accompanied by their screenplays, contained in the COGNIMUSE database. We show that, on both tasks, our method considerably outperforms a state-of-the-art weakly supervised approach, as well as other baselines.


1 Introduction

Automatic video understanding has become one of the most essential and demanding challenges and research directions. The problems that span from this field, such as activity recognition, saliency and scene analysis, comprise detecting events and extracting high level semantics in realistic video sequences. So far, the majority of the methods designed for these tasks deal with visual data ignoring the presence of other modalities, such as text and sound. Nonetheless, the exploitation of the information they provide can lead to better understanding of the underlying semantics. In addition, most of these techniques are fully supervised and are trained on diverse and usually large-scale datasets. Recently, in an attempt to avoid the significant cost of manual annotation, there has been an increasing interest in exploring learning techniques that reduce human intervention.

Figure 1: Example of a video segment described by the text shown below the pictures. During the time interval [0:19:21 - 0:19:34] three actions take place (“standing up”, “walk”, “answering phone”) performed by the same person (Colin). The corresponding text mentions the actions as “gets up”, “walks back” and “opens his cell phone”, respectively.

Motivated by the above, in this paper we approach video understanding multimodally, where our goal is to recognize visual concepts mining their labels from an accompanying descriptive document. Visual concepts could be loosely defined as spatio-temporally localized video segments that carry a specific structure in the visual domain, which allows them to be classified in various categories. Some specific examples are human faces, actions, scenes, objects etc. The main reason for using text as a complementary modality is the convenience that natural language provides in expressing semantics. Nowadays, there is a plethora of video data with natural language descriptions, \ievideos on YouTube [27, 28, 41], TV broadcasts including captions [7], videos from parliament or court sessions accompanied by transcripts [22] and TV series or movies accompanied by their subtitles, scripts, or audio descriptions [5, 9, 16, 17, 21, 35, 45]. The last category has recently gathered much interest, mainly because of the descriptiveness of these texts and the realistic nature of the visual data. Inspired by such work, we apply our algorithms to movies accompanied by their scripts. In Figure 1 we illustrate an example that was extracted from a movie, in which different instances of the action visual concept are described by an accompanying text segment.

Towards this goal, we use a unidirectional model, where information flows from text to video data. This is modeled in terms of weak supervision, while no prior knowledge is used. Specifically, in order to extract the label from the text for each instance of a visual concept, we face two distinct problems. (i) The first is the absence of specific spatio-temporal correspondence between visual and textual elements. In particular, in the tasks mentioned above, the descriptions are never provided with spatial boundaries and the temporal ones are usually imprecise. (ii) The second major issue is the semantic ambiguity of each textual element. This means that, when it comes to inferring complex semantics from the video such as actions or emotions, the extraction of the label from the text is no longer a straightforward procedure. For example, various expressions could be used to describe the action labeled as “walking”, such as “lurching” or “going for a stroll”.

Most of the work so far has dealt only to an extent with the spatio-temporal ambiguity, while the semantic one was totally ignored [5, 14, 21]. In this work, we introduce two novel weakly supervised techniques extending the Multiple Instance Learning (MIL)1 discriminative clustering framework of [5]. The first one accounts for the temporal ambiguity variations, which are modeled by Fuzzy Sets (Fuzzy Sets MIL - FSMIL), while the second models the semantic ambiguities by probability mass functions (p.m.f) over the label set (Probabilistic Labels MIL - PLMIL). To the best of our knowledge, this is the first time that both methods are formulated in the context of MIL. In addition, we propose a method of extracting labels in complex tasks using semantic similarity computations. We further improve the recognition, from the perspective of visual representations using features learned from pre-trained deep architectures. The combination of all the above ideas leads to superior performance compared to previous work.

Finally, we focus on the recognition of faces and actions and the evaluation is performed on the COGNIMUSE database [45]. It is important to mention that our methods can be applied to other categories of concepts as long as they can be explicitly described in both modalities (video & text).

2 Related Work

During the last few years there have been various approaches of understanding videos or images using natural language. Specifically, many have approached the problem as machine translation, such as in [15], where image regions are matched to words of an accompanying caption and in [28, 35], where representations that translate video to sentences and vice-versa are learned. Others have tackled it using video-to-text alignment algorithms [6, 37].

Several works have considered text as means of supervision. In the problem of naming faces, Berg \etal[3, 4] use Linear Discriminant Analysis (LDA) followed by a modified k-means, to classify faces in newspaper images, while the labels are obtained from captions. In [5, 9, 10, 17, 29, 31, 38] the authors tackle a similar problem classifying faces in TV series or movies using the names of the speakers provided by the corresponding scripts. The proposed methods are based either on semi-supervised alike techniques using exemplar tracks [17, 38], ambiguous labeling [9, 10] or MIL [5, 29, 31].

The problem of automatically annotating actions in videos has recently drawn the attention of several researchers, because of the need to create diverse and realistic datasets of human activities. For this purpose, Laptev \etalused movie scripts to collect and learn realistic actions [21]. Later on, this work has been improved by incorporating information from the context, leading to the creation of the Hollywood2 dataset [25], and by a more accurate temporal localization using MIL [14]. In these, a Bag-of-Words text classifier is trained with annotated sentences in order to locate specific actions in the scripts. On the contrary, our work is based only on semantic similarity eliminating the cost of annotation. In Bojanowski \etal[5], MIL is also used to jointly learn names and actions. The extraction of labels from the text is performed using SEMAFOR [11], a semantic role labeling parser, searching for two action frames. This method, despite its promising results, cannot be easily generalized to custom actions. Finally, all the above end up in considering only the most certain labels that the text provides, ignoring possible paraphrases or synonyms. This allows an automatic collection of data with limited noise, but in general it leads to understanding a small proportion of each individual video.

In order to learn from partially labeled data, there has been an extensive study on weakly supervised classification [19]. Learning with probabilistic labels has been examined in [20] under a probabilistic framework. Cour \etal[9] formulated a sub-category of this method, where all possible labels are distributed uniformly (candidate labels) and the classification is performed by minimizing a convex loss function. Both papers concern a single instance setting, namely a p.m.f over the label set is assigned to individual instances. On the contrary, we assign a p.m.f to bags-of instances, generalizing previous formulations. MIL has been largely studied in the machine learning community starting from Diettrich \etal[13], where drug activity was predicted. Except for the efforts on naming faces mentioned before, MIL has been used in detecting objects [44] and classifying scenes [24] in images, where annotation lacks specific spatial localization. While the definition of MIL is sufficient for most of its applications, it is important sometimes to make discriminations between instances in each bag. In order to model this case, we redefine MIL using Fuzzy Sets.

3 Multimodal Learning of Concepts

Given a video and a descriptive text that are crudely aligned [17], namely each portion of the text is attributed with temporal boundaries, we aim to localize and identify all the existing instances of a chosen visual concept, such as faces, actions or scenes. The adversities of such a task are clearly illustrated in Figure 1, where the concept examined is that of human actions. Our approach breaks down the problem into three subproblems. (a) First of all, the exact position in space and time of each visual concept is unknown, thus it needs to be detected automatically. (b) Secondly, concepts are usually expressed in the text in a different way than their original definition. For instance, as shown in Figure 1, the action “standing up” is mentioned by the phrase “gets up”, while the action “answering phone” is mentioned by the phrase “opens his cell phone”. In order to tackle this problem, we need to detect the part of the text that implies a concept and then mine the label information. (c) Finally, following the alignment procedure, the text is divided into segments that describe specific time intervals of the video. Each one of them might mention more than one instances of a visual concept. Thus, we need to apply a learning procedure that matches the mined labels with the detected concepts. Note here that sometimes a concept described in the text might not appear in the video or vice-versa. As a result, we need to design an algorithm that learns the visual concepts globally without restricting the matching only in the corresponding time intervals.

Solving (a) and (b) requires task dependent systems, which are both described in section 4. The outputs of these systems are perceived as visual and linguistic objects ( and , respectively) with their temporal boundaries determined. Following the computation of these, we address (c) and we formulate the learning algorithms.

3.1 Problem Statement

We assume a dual modality scheme, where both modalities carry the same semantics. This can be modeled with two data streams flowing in parallel as time evolves (Figure 2). The first data stream consists of the unidentified visual objects that we want to recognize. We denote as the set of visual objects. The second modality consists of the linguistic objects that carry in some way the information for the identification of each , namely they describe the . We denote as the set of linguistic objects (\iewords or sentences).

Figure 2: Illustration of the two modalities as parallel data streams.
Figure 3: (a): Example of the two data streams formed by linguistic and visual objects concerning the concept of human actions. Next to each visual object we demonstrate its ground truth. The formation of the streams is carried out by solving subproblems (a) and (b). (b): The construction of bags under the MIL setting with Fuzzy Sets and Probabilistic Labels. The label set here is [walk, turn, running, driving car, getting out the car].

We assume that each is represented in a feature space and its representation is a vector . We define a matrix containing all the visual features. The time interval of each is denoted as .

Let be the label set of discrete labels. Each is mapped to a label through a mapping . This can be either deterministic, matching each to a sole label , or probabilistic, matching each to a p.m.f over the label set (see section 3.3.2). The time interval of each is denoted as .

Our goal is to assign a specific label to each , drawn from . We denote the indicator matrix , which means that iff the label assigned to equals . We want to infer given the visual feature matrix , the mapping and the temporal intervals .

3.2 Clustering Model

Our model is based on DIFFRAC [2], a discriminative clustering method. In particular, Bach and Harchaoui, in order to assign labels to unsupervised data, form a ridge regression loss function using a linear classifier , where and , which is optimized by the following:


where stands for the regularization parameter. Eq. (1) can be solved analytically \wrtthe classifier leading to a new objective function that needs to be minimized only \wrtthe assignment matrix : , where is a matrix that depends on the parameter and the Gram matrix , which can be replaced by any kernel (see [2]). Relaxing the matrix to , the objective becomes a convex quadratic function constrained by the following:


where denotes the element of matrix .

3.3 Weakly Supervised Methods

In order to incorporate in the model the weak supervision that the complementary modality provides, we have to resolve two kinds of ambiguities:

  • Which visual object is described by each linguistic object ?

  • Which label is implied from each ?

Fuzzy Sets Multiple Instance Learning - FSMIL

In an attempt to address the first question, similar to [5], we assume that each should describe at least one of the that temporally overlaps with it. This leads to a multi-class MIL approach, where for each a bag-of-instances is created containing all the overlapping :


We extend this framework in order to discriminate between visual objects with different temporal overlaps. In fact, the longer the overlap, the more likely it is for a visual object to be described by the corresponding linguistic . For example, during a dialogue, in the video stream the camera usually focuses on the current speaker longer than the silent person, while the document mentions the first. Thus, we need to encode this observation on our MIL sets. This is done by defining a novel type of MIL sets using fuzzy logic (see Figure 3). Each member of the set is accompanied by a value that demonstrates its membership grade :


where is an increasing membership function with . In addition, we note that, in order to compensate for the crude alignment mistakes, we can add a hyper-parameter that adjusts the linguistic object time interval as follows: , where and is the average value of , over all .

Probabilistic Labels Multiple Instance Learning - PLMIL

As mentioned before, the labels extracted from the complementary modality involve a level of uncertainty. This happens due to the fact that the extraction procedure is a classification problem on its own. Solving this problem is equivalent to inferring the mapping . Obtaining the label that the classifier predicts for each linguistic object , renders the mapping deterministic, while obtaining the posterior probabilities that the classifier gives, renders it probabilistic.

In this work, we use a probabilistic mapping using the posterior probabilities . In order to match them with the visual objects , we perceive them as Probabilistic Labels (PLs). As mentioned in [20], matching a PL to an instance that needs to be classified, accounts for an initial estimation of its true label. In our problem, we generalize the definition of [20], matching PLs to bags-of-instances, meaning that at least one instance of the set should be described by this measure of initial confidence. In this case, the model’s input data is formed as follows: .

We address the classification problem of text segments in an unsupervised manner. Specifically, we calculate the semantic similarity of each with the linguistic representation of each label using the algorithm of [18]. We also apply a threshold to each similarity value in order to eliminate the noisy that do not imply any of the labels. Thus, for each we obtain a similarity vector , which is then normalized to constitute a p.m.f :

Integration of the Weak Supervision in the Clustering Model

In the MIL case each bag is matched to a single label and is represented by the following constraint:


For the purpose of accounting for noise, slack variables are used to reformulate both the objective function and the constraints:


In our FSMIL, we intend to add different weights to the elements of each bag depending on the membership grade:


For the PLMIL case, let be the set of the labels for which the p.m.f is non-zero. For each label we construct a constraint formed as in (9), \ie:


The discrimination between the various labels of is carried out by the slack variables. In particular, we rewrite the objective function as follows:


In this way, we manage to relax the constraints inversely proportional to the probability of the corresponding label. As a result, a constraint is harder to be violated as long as the probability is high.

Rounding: Similarly to [5] we choose a simple rounding procedure for that accounts for taking the maximum values along its rows and replacing it with 1. The rest of the values are replaced with 0.

4 Experiments

4.1 Dataset

The COGNIMUSE database is a video-oriented database multimodally annotated with audio-visual events, saliency, cross-media relations and emotion [45]. It is a generic database that can be used for training and evaluation of event detection and summarization tasks, for classification and recognition of audio-visual concepts and others. Other existing databases such as the MPII-MD [34], the M-VAD [39], the MSVD [8], the MSR-VTT [42], the VTW [43], the TACoS [32, 36], the TACoS Multi-Level [33] and the YouCook [12] are not annotated in terms of specific visual concepts, but in terms of sentence descriptions. Moreover, the datasets used in [9, 17, 31, 38] are only annotated with human faces. Finally, the Hollywood2 [25] and the Casablanca [5] datasets were not sufficient for this task, due to the fact that only automatically collected labels from the text are provided rather than the text itself. On the contrary, the COGNIMUSE database consists of long videos that are continuously annotated with action labels and are accompanied by texts in a raw format. In addition, we manually annotated the detected face tracks in order to evaluate the face recognition task. All the above, render COGNIMUSE more relevant and useful for the tasks that we are dealing with. In this work, we used 5 out of the 7 annotated 30-minutes movie clips, which are: “A Beautiful Mind” (BMI), “Crash” (CRA), “The Departed” (DEP), “Gladiator” (GLA) and “Lord of the Rings - the Return of the King” (LOR).

4.2 Implementation

Detection and Feature Extraction: We spatio-temporally detect and track faces similarly to [5], where face tracks are represented by SIFT descriptors and the kernels are computed separately for each facial feature taking into account whether a face is frontal or profile. Contrary to this, we use deep features extracted by the last fully connected layer of the VGG-face pre-trained CNN [30], while a single kernel is computed on each pair of face tracks regardless to the faces’ poses. Similarly to [5, 17, 38] the kernel applied is a min-min RBF. For the problem of action recognition, we use the temporal boundaries provided by the dataset. We represent them through the C3D pre-trained CNN, following the methodology stated in [40] .

Set Development Test All
Text+MIL 0.379 0.540 0.459 0.419 0.323 0.383 0.375 0.409
SIFT+MIL [5] 0.630 0.879 0.755 0.757 0.641 0.683 0.694 0.718
SIFT+FSMIL 0.693 0.883 0.788 0.793 0.690 0.747 0.743 0.761
VGG+MIL 0.834 0.956 0.895 0.828 0.696 0.833 0.786 0.830
VGG+FSMIL (Ours) 0.864 0.953 0.909 0.864 0.731 0.902 0.833 0.863
Table 1: The Average Precision (AP) scores of each movie in the Development and the Test Set for the Face Recognition Task. The Mean Average Precision (MAP) calculated in the two sets separately and for the database as a whole is also shown.

Label Mining from Text: Prior to applying the label extraction algorithms, we perform a crude alignment between the script and the subtitles through a widely used DTW algorithm [17]. The label set for the face recognition task is defined using the cast list of each movie (this information was downloaded from the Website TMDB [1]). The character labels are then extracted using regular expression matching, where the query expressions are the names included in the cast list. We define the label set for the action recognition task using a subset of the total classes of the COGNIMUSE database. We locate the linguistic objects by composing short sentences constituted by each sentence’s verb as well as words that are linked to the verb through specific dependencies, such as the direct object and adverbs. We use the toolbox coreNLP [23] in order to perform the document’s dependency parsing. Finally, we calculate the semantic similarities on every label - short sentence pair applying an off-the-shelf sentence similarity algorithm [18]. This comprises a hybrid approach between Latent Semantic Analysis (LSA) and knowledge from WordNet [26]. The similarities that do not exceed a specific threshold , experimentally set to 0.4, are discarded.

4.3 Learning Experiments

In the following experiments we evaluate our methods on the tasks of (i) face and (ii) action recognition. For the FSMIL setting, after extensive experimentation, we concluded in using as a membership function a specific case of -shaped functions defined as follows:


where is the membership threshold and is a parameter that controls how abrupt the increase above the threshold will be. We need to assign a large value (above 1000) in order to have . For those values there are no significant changes in the results. We tune the hyperparameters and on the development set independently for the two tasks, yielding , for task (i) and , for task (ii) .

Face Recognition

We evaluate each method’s performance using the Average Precision (AP) previously used in [5, 9]. We compare our model (VGG+FSMIL) to the methodology of Bojanowski \etal[5] - that has outperformed other weakly supervised methods, such as [9] and [38] - as well as with other baselines described next:

  1. Text+MIL: We solve the problem by minimizing only the factor relating to the slack variables. This method, converges to an optimal point that best satisfies the constraints posed by the text, while the visual features are not taken into account. The constraints are formed using the simple MIL setting.

  2. SIFT+MIL [5]: The algorithm of Bojanowski \etalthat uses SIFT descriptors as feature vectors and simple MIL setting, without taking into account the temporal overlaps, namely the bags are constructed as noted by (4).

  3. SIFT+FSMIL: Our proposed learning method implemented with SIFT descriptors.

  4. VGG+MIL: The algorithm of Bojanowski \etalimplemented with VGG-face descriptors.

  5. VGG+FSMIL (Ours): The proposed learning method implemented with VGG-face descriptors.

The comparative results for each movie are shown in Table 1. As it can be clearly seen, our method demonstrates superior performance than [5] concerning every case.

First of all, the inferior performance of the Text+MIL method shows the inefficiency of using only textual information in tackling the problem. The higher accuracy accomplished by the methods implemented with VGG proves the benefits of deep learning over hand-crafted features as means of representing faces. Moreover, incorporating the information given by the overlaps of visual and linguistic objects, we improve the accuracy regardless of the nature of the representation. In particular, due to the fact that our method reduces the ambiguity in each bag-of-instances, we outperform the baseline even without the use of deep features. As expected, the combination of the above (VGG and FSMIL) shows the highest accuracy. This can be easily explained as each one of the methods improves different aspects of the learning procedure.

Action Recognition

Regarding the task of action recognition, several experiments were carried out for each movie, while changing the cardinality of the label set. In particular, the performance is evaluated using the 2, 4, 6, 8 and 10 most frequent action classes. The evaluation is performed using the Mean AP metric, which stands for averaging the APs over each movie set. The results are demonstrated both for the development and the test set in the Tables 2 and 3, respectively. We also illustrate the performances of the methods on the whole dataset with the per sample accuracy vs proportion of total instances curves of Figure 4.

We choose as baseline the aforementioned Text+MIL, as well as a similar methodology to Bojanowski \etal[5]. In this experiment, we focus on the different ways of learning from the text, rather than the visual features, thus in all cases we use the C3D descriptor for the representation of actions. The methods compared are:

  1. Text+MIL: Same as the one described in section 4.3.1. The action labels are extracted by locating the sentences that are semantically identical to one of the labels of the set (similarity = 1).

  2. MIL ([5] modified): The learning algorithm of Bojanowski \etalmentioned in section 4.3.1. We replace the dense trajectories action descriptors with C3D. Again, we use only the sentences that are semantically identical to some label.

  3. Sim+MIL: The same learning algorithm, but labels are extracted from sentences that are semantically similar to one of the labels of the set (). Each sentence is assigned a single label, the one with the maximum similarity.

  4. Sim+PLMIL: Our PLMIL method. We assign a probabilistic label to each sentence.

  5. Sim+FSMIL: Our FSMIL method. We construct the bags-of-instances as fuzzy sets.

  6. Sim+FSMIL+PLMIL (Ours): The combination of our contributions using semantically similar sentences, probabilistic labels and fuzzy bags-of-instances.

Figure 4: The curves show the per sample accuracy plotted against the proportion of total instances, concerning the whole dataset. Each figure corresponds to a different experiment concerning the number of classes. Note that as the number of classes increases, the model that combines our methods greatly outperforms each one of them individually as well as the baseline. Please see color version for better visibility.

max width=0.5 action classes 2 4 6 8 10 Text+MIL 0.270 0.275 0.178 0.160 0.157 MIL ([5] modified) 0.434 0.417 0.284 0.267 0.193 Sim+MIL 0.837 0.298 0.243 0.339 0.202 Sim+PLMIL 0.837 0.304 0.309 0.348 0.230 Sim+FSMIL 0.945 0.614 0.436 0.405 0.309 Sim+FSMIL+PLMIL (Ours) 0.945 0.617 0.520 0.484 0.449

Table 2: The Mean Average Precision (AP) scores over the Development set for five independent experiments for the Action Recognition Task.

max width=0.5 action classes 2 4 6 8 10 Text+MIL 0.212 0.124 0.112 0.092 0.063 MIL ([5] modified) 0.451 0.428 0.183 0.189 0.212 Sim+MIL 0.631 0.591 0.182 0.094 0.129 Sim+PLMIL 0.585 0.563 0.299 0.141 0.149 Sim+FSMIL 0.793 0.458 0.230 0.167 0.145 Sim+FSMIL+PLMIL (Ours) 0.730 0.527 0.286 0.253 0.265

Table 3: The Mean Average Precision (AP) scores over the Test set for five independent experiments for the Action Recognition Task.

First note that the proposed combined model demonstrates superior performance over the Text + MIL baseline, confirming the importance of using visual information, as previously mentioned in 4.3.1. Higher performance is also reported over the baseline of [5] in every case, leading to an improvement of 20% – 51% in the development set and 5% – 28% in the test set. Moreover, Figure 4 shows that it outperforms all methods in the whole dataset, except for the case of two classes. Next, we examine each of our contributions independently.

The method of extracting labels through similarity measurements outperforms the baseline mainly when the number of classes is small (2-4), as shown in Tables 2, 3. In this case, the concepts implied by the labels, in terms of semantics, are rarely confused, hence most of the similarity measurements produce correct labels. However, as this number increases the Sim+MIL method does not prove very efficient on its own. A possible explanation is that the semantically identical labels of the baseline usually consist of a more clean set, while the confusion introduced to the model with semantically similar labels rises. As a result, despite the fact that a small amount of bags-of-instances are annotated, the baseline algorithm will still be able to make a few correct predictions with large confidence. This is illustrated in Figure 4 (c) and (d), where the most confident predictions of the baseline are accurate, contrary to those of Sim+MIL.

This confusion is compensated partially by either PLMIL or FSMIL. Regarding the first one, when the classes are few, a sentence is rarely similar to more than one concepts, hence the labels are mainly deterministic. However, modeling labels in a probabilistic way achieves better disambiguation of the sentences’ meanings as the number of classes grows larger, which is proved by the fact that Sim+PLMIL outperforms Sim+MIL for 6-10 classes in both sets. As far as the FSMIL is concerned, this method is expected to perform better on its own for the reasons mentioned in section 4.3.1, regardless of the number of classes. Indeed, Sim+FSMIL outperforms Sim+MIL in most of the cases.

Interestingly, the combination of our contributions manages to outperform the baseline, even if none of them could do so independently. This can be explained by the fact that the algorithm leverages each one of them to resolve different kinds of ambiguities. Regarding the lower results in the test set compared to the development, we noticed that the scripts of the test movies are not sufficiently aligned to the videos, while a significant amount of actions occur in the background, consequently are not described in the text.

5 Conclusion

In this work we tackled the problem of automatically learning video concepts by combining visual and textual information. We proposed two novel weakly supervised techniques that can be easily generalized to other Multimodal Learning tasks, that efficiently deal with temporal ambiguities (FSMIL), as well as semantic ones (PLMIL). Contrary to previous work, we acquire richer information from the text using semantic similarity. We evaluated our models on the COGNIMUSE dataset, containing densely annotated movies accompanied by their scripts. Our techniques provide significant improvement over a state-of-the-art weakly supervised method, in both face and action recognition tasks. Regarding our future work, we plan to extend our uni-directional model to a bi-directional, where information will flow from text to video and vice-versa, jointly learning video and linguistic concepts. Finally, the generality of our formulation motivates us in exploring its potential in learning from other modalities such as the audio channel.


  1. In this paper, the term MIL does not concern only binary classification problems with positive and negative bags, as in its original definition [13], but also the multi-class case.


  2. F. R. Bach and Z. Harchaoui. DIFFRAC: a discriminative and flexible framework for clustering. In NIPS, 2008.
  3. T. L. Berg, A. C. Berg, J. Edwards, and D. A. Forsyth. Who’s in the picture. In NIPS, 2005.
  4. T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, Y.-W. Teh, E. Learned-Miller, and D. A. Forsyth. Names and faces in the news. In CVPR, 2004.
  5. P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic. Finding actors and actions in movies. In ICCV, 2013.
  6. P. Bojanowski, R. Lajugie, E. Grave, F. Bach, I. Laptev, J. Ponce, and C. Schmid. Weakly-supervised alignment of video with text. In ICCV, 2015.
  7. H. Bredin, C. Barras, and C. Guinaudeau. Multimodal person discovery in broadcast tv at mediaeval 2016. In MediaEval, 2016.
  8. D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011.
  9. T. Cour, B. Sapp, C. Jordan, and B. Taskar. Learning from ambiguously labeled images. In CVPR, 2009.
  10. T. Cour, B. Sapp, and B. Taskar. Learning from partial labels. JMLR, 2011.
  11. D. Das, A. F. Martins, and N. A. Smith. An exact dual decomposition algorithm for shallow semantic parsing with constraints. In SEM, 2012.
  12. P. Das, C. Xu, R. F. Doell, and b. J. J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, 2013.
  13. T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artif Intell, 1997.
  14. O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Automatic annotation of human actions in video. In ICCV, 2009.
  15. P. Duygulu, K. Barnard, J. F. de Freitas, and D. A. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In ECCV, 2002.
  16. G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, and Y. Avrithis. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE TMM, 2013.
  17. M. Everingham, J. Sivic, and A. Zisserman. “Hello! My name is… Buffy” – automatic naming of characters in TV video. In BMVC, 2006.
  18. L. Han, A. Kashyap, T. Finin, J. Mayfield, and J. Weese. Umbc ebiquity-core: Semantic textual similarity systems. In *SEM, 2013.
  19. J. Hernández-González, I. Inza, and J. A. Lozano. Weak supervision and other non-standard classification problems: a taxonomy. Pattern Recogn Lett, 2016.
  20. R. Jin and Z. Ghahramani. Learning with multiple labels. In NIPS, 2003.
  21. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.
  22. S. Maji and R. Bajcsy. Fast unsupervised alignment of video and text for indexing/names and faces. In Workshop on multimedia information retrieval on The many faces of multimedia semantics, 2007.
  23. C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In ACL, 2014.
  24. O. Maron and A. L. Ratan. Multiple-instance learning for natural scene classification. In ICML, 1998.
  25. M. Marszalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009.
  26. G. A. Miller. Wordnet: a lexical database for english. ACM, 1995.
  27. T. S. Motwani and R. J. Mooney. Improving video activity recognition using object recognition and text mining. In ECAI, 2012.
  28. S. Naha and Y. Wang. Beyond verbs: Understanding actions in videos with text. In ICPR, 2016.
  29. O. M. Parkhi, E. Rahtu, and A. Zisserman. It’s in the bag: Stronger supervision for automated face labelling. In ICCV Workshop: Describing and Understanding Video & The Large Scale Movie Description Challenge, 2015.
  30. O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face recognition. In BMVC, 2015.
  31. V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Linking people in videos with “their” names using coreference resolution. In ECCV, 2014.
  32. M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. Grounding action descriptions in videos. TACL, 2013.
  33. A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. In GCPR, 2014.
  34. A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A dataset for movie description. In CVPR, 2015.
  35. A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. Movie description. IJCV, 2017.
  36. M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In ICCV, 2013.
  37. P. Sankar, C. V. Jawahar, and A. Zisserman. Subtitle-free movie to script alignment. In BMVC, 2009.
  38. J. Sivic, M. Everingham, and A. Zisserman. “who are you?”-learning person specific classifiers from video. In CVPR, 2009.
  39. A. Torabi, C. Pal, H. Larochelle, and A. Courville. Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070, 2015.
  40. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  41. Z. Wang, K. Kuan, M. Ravaut, G. Manek, S. Song, F. Yuan, K. Seokhwan, N. Chen, L. F. D. Enriquez, L. A. Tuan, et al. Truly multi-modal youtube-8m video classification with video, audio, and text. arXiv:1706.05461, 2017.
  42. J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
  43. K.-H. Zeng, T.-H. Chen, J. C. Niebles, and M. Sun. Generation for user generated videos. In ECVV, 2016.
  44. C. Zhang, J. C. Platt, and P. A. Viola. Multiple instance boosting for object detection. In NIPS, 2006.
  45. A. Zlatintsi, P. Koutras, G. Evangelopoulos, N. Malandrakis, N. Efthymiou, K. Pastra, A. Potamianos, and P. Maragos. Cognimuse: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization. EURASIP, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description