Hierarchical Conditional Relation Networks for Video Question Answering

Hierarchical Conditional Relation Networks for Video Question Answering


Video question answering ( is challenging as it requires modeling capacity to distill dynamic visual artifacts and distant relations and to associate them with linguistic concepts. We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) that serves as a building block to construct more sophisticated structures for representation and reasoning over video. CRN takes as input an array of tensorial objects and a conditioning feature, and computes an array of encoded output objects. Model building becomes a simple exercise of replication, rearrangement and stacking of these reusable units for diverse modalities and contextual information. This design thus supports high-order relational and multi-step reasoning. The resulting architecture for VideoQA is a CRN hierarchy whose branches represent sub-videos or clips, all sharing the same question as the contextual condition. Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.


1 Introduction

Answering natural questions about a video is a powerful demonstration of cognitive capability. The task involves acquisition and manipulation of spatio-temporal visual representations guided by the compositional semantics of the linguistic cues [7, 17, 20, 30, 33, 36]. As questions are potentially unconstrained, VideoQA requires deep modeling capacity to encode and represent crucial video properties such as object permanence, motion profiles, prolonged actions, and varying-length temporal relations in a hierarchical manner. For VideoQA, the visual representations should ideally be question-specific and answer-ready.

The current approach toward modeling videos for QA is to build neural architectures in which each sub-system is either designed for a specific tailor-made purpose or for a particular data modality. Because of this specificity, such hand crafted architectures tend to be non-optimal for changes in data modality [17], varying video length [24] or question types (such as frame QA [20] versus action count [6]). This has resulted in proliferation of heterogeneous networks.

(a) Question: What does the girl do 9 times? Baseline: walk HCRN: blocks a person’s punch Ground truth: blocks a person’s punch

(b) Question: What does the man do before turning body to left? Baseline: pick up the man’s hand HCRN: breath Ground truth: breath
Figure 1: Example questions for which frame relations are key toward correct answers. (a) Near-term frame relations are required for counting of fast actions. (b) Far-term frame relations connect the actions in long transition. HCRN with the ability to model hierarchical conditional relations handles successfully, while baseline struggles. See more examples in supplemental materials.

In this work we propose a general-purpose reusable neural unit called Conditional Relation Network ( that encapsulates and transforms an array of objects into a new array conditioned on a contextual feature. The unit computes sparse high-order relations between the input objects, and then modulates the encoding through a specified context (See Fig. 2). The flexibility of CRN and its encapsulating design allow it to be replicated and layered to form deep hierarchical conditional relation networks (HCRN) in a straightforward manner. The stacked units thus provide contextualized refinement of relational knowledge from video objects – in a stage-wise manner it combines appearance features with clip activity flow and linguistic context, and follows it by integrating in context from the whole video motion and linguistic features. The resulting HCRN is homogeneous, agreeing with the design philosophy of networks such as InceptionNet [31], ResNet [9] and FiLM [27].

The hierarchy of the CRNs are as follows – at the lowest level, the CRNs encode the relations between frame appearance in a clip and integrate the clip motion as context; this output is processed at the next stage by CRNs that now integrate in the linguistic context; in the following stage, the CRNs capture the relation between the clip encodings, and integrate in video motion as context; in the final stage the CRN integrates the video encoding with the linguistic feature as context (See Fig. 3). By allowing the CRNs to be stacked hierarchically, the model naturally supports modeling hierarchical structures in video and relational reasoning; by allowing appropriate context to be introduced in stages, the model handles multimodal fusion and multi-step reasoning. For long videos further levels of hierarchy can be added enabling encoding of relations between distant frames.

We demonstrate the capability of HCRN in answering questions in major VideoQA datasets. The hierarchical architecture with four-layers of CRN units achieves favorable answer accuracy across all VideoQA tasks. Notably, it performs consistently well on questions involving either appearance, motion, state transition, temporal relations, or action repetition demonstrating that the model can analyze and combine information in all of these channels. Furthermore HCRN scales well on longer length videos simply with the addition of an extra layer. Fig. 1 demonstrates several representative cases those were difficult for the baseline of flat visual-question interaction but can be handled by our model.

Our model and results demonstrate the impact of building general-purpose neural reasoning units that support native multimodality interaction in improving robustness and generalization capacities of VideoQA models.

2 Related Work

Our proposed HCRN model advances the development of VideoQA by addressing two key challenges: (1) Efficiently representing videos as amalgam of complementing factors including appearance, motion and relations, and (2) Effectively allows the interaction of such visual features with the linguistic query.

Spatio-temporal video representation is traditionally done by variations of recurrent networks (RNNs) among which many were used for VideoQA such as recurent encoder-decoder [49, 47], bidirectional LSTM [15] and two-staged LSTM [44]. To increase the memorizing ability, external memory can be added to these networks [7, 44]. This technique is more useful for videos that are longer [40] and with more complex structures such as movies [33] and TV programs [17] with extra accompanying channels such as speech or subtitles. On these cases, memory networks [15, 24, 35] were used to store multimodal features [36] for later retrieval. Memory augmented RNNs can also compress video into heterogenous sets [6] of dual appearance/motion features. While in RNNs, appearance and motion are modeled separately, 3D and 2D/3D hybrid convolutional operators [34, 28] intrinsically integrates spatio-temporal visual information and are also used for VideoQA [10, 20]. Multiscale temporal structure can be modeled by either mixing short and long term convolutional filters [37] or combining pre-extracted frame features non-local operators [32, 18]. Within the second approach, the TRN network [48] demonstrates the role of temporal frame relations as an another important visual feature for video reasoning and VideoQA [16]. Relations of predetected objects were also considered in a separate processing stream [11] and combined with other modalities in late-fusion [29]. Our HCRN model emerges on top of these trends by allowing all three channels of video information namely appearance, motion and relations to iteratively interact and complement each other in every step of a hierarchical multi-scale framework.

Earlier attempts for generic multimodal fusion for visual reasoning includes bilinear operators, either applied directly [12] or through attention [12, 43]. While these approaches treat the input tensors equally in a costly joint multicative operation, HCRN separates conditioning factors from refined information, hence it is more efficient and also more flexible on adapting operators to conditioning types.

Temporal hierarchy has been explored for video analysis [22], most recently with recurrent networks [25, 1] and graph networks [23]. However, we believe we are the first to consider hierarchical interaction of multi-modalities including linguistic cues for VideoQA.

Linguistic query–visual feature interaction in VideoQA has traditionally been formed as a visual information retrieval task in a common representation space of independently transformed question and refered video [44]. The retrieval is more convenient with heterogeneous memory slots [6]. On top of information retrieval, co-attention between the two modalities provides a more interactive combination [10]. Developments along this direction include attribute-based attention [42], hierarchical attention [21, 45, 46], multi-head attention [14, 19], multi-step progressive attention memory [13] or combining self-attention with co-attention [20]. For higher order reasoning, question can interacts iteratively with video features via episodic memory or through switching mechanism [41]. Multi-step reasoning for VideoQA is also approach by [39] and [30] with refined attention.

Unlike these techniques, our HCRN model supports conditioning video features with linguistic clues as a context factor in every stage of the multi-level refinement process. This allows linguistic cue to involve earlier and deeper into video presentation construction than any available methods.

Neural building blocks - Beyond the VideoQA domain, CRN unit shares the idealism of uniformity in neural architecture with other general purpose neural building blocks such as the block in InceptionNet [31], Residual Block in ResNet [9], Recurrent Block in RNN, conditional linear layer in FiLM [27], and matrix-matrix-block in neural matrix net [5]. Our CRN departs significantly from these designs by assuming an array-to-array block that supports conditional relational reasoning and can be reused to build networks of other purposes in vision and language processing.

3 Method

Figure 2: Conditional Relation Network. a) Input array of objects are first processed to model -tuple relations from sub-sampled size- subsets by sub-network . The outputs are further conditioned with the context via sub-network and finally aggregated by to obtain a result vector which represents -tuple conditional relations. Tuple sizes can range from to , which outputs an -dimensional output array.

The goal of VideoQA is to deduce an answer from a video in response to a natural question . The answer can be found in an answer space which is a pre-defined set of possible answers for open-ended questions or a list of answer candidates in case of multi-choice questions. Formally, VideoQA can be formulated as follows:


where is the model parameters of scoring function .

Visual representation We begin by dividing the video into equal length clips . Each clip is represented by two sources of information – frame-wise appearance feature vectors which are the pool5 output of ResNet [9] features in later experiments, and the motion feature vector at clip level derived by ResNeXt-101 [38, 8].

Subsequently, linear feature transformations are applied to project both appearance features and motion features into a -dimensions feature space.

Linguistic representation All words in the question and answer candidates in case of multi-choice questions are first embedded into vectors of 300 dimensions, which are initialized with pre-trained GloVe word embeddings [26]. We further pass these context-independent embedding vectors through a biLSTM. Output hidden states of the forward and backward LSTM passes are finally concatenated to form the question representation .

With these representations, we now describe our new hierarchical architecture for VideoQA (see Fig. 3). We first present the core compositional computation unit that serves as building blocks for the architecture in Section 3.1. In the following sub-section, we propose to design as a layer-by-layer network architecture that can be built by simply stacking the core units in a particular manner.

Figure 3: Hierarchical Conditional Relation Networks (HCRN) Architecture for VideoQA. The CRNs are stacked in a hierarchy, embedding the video input at different granularities including frame, short clip and entire video levels. The video feature embedding is conditioned on the linguistic cue at each level of granularity. The visual-question joint representation is fed into an output classifier for prediction.

3.1 Conditional Relation Network Unit

Notation Role
Input array of objects (e.g. frames, clips)
Conditioning feature (e.g. query, motion feat.)
Maximum subset (also tuple) size considered
Each subset size from to
Set of all size- subsets of
Number of subsets randomly selected from
Set of selected subsets from
Sub-network processing each size- subset
Conditioning sub-network
Aggregating sub-network
Result array of CRN unit on given
Member result vector of -tuple relations
Table 1: Notations of CRN unit operations
Input : Array , conditioning feature
Output : Array
Metaparams : 
1 Build all sets of subsets where is set of all size- subsets of Initialize for  to  do
2        randomly select subsets from for  subset  do
4        end for
5        add to
6 end for
Algorithm 1 CRN Unit

We introduce a reusable computation unit, termed Conditional Relation Network (CRN), which takes as input an array of objects and a conditioning feature , and computes an output as an array of objects of the same dimensions (See Fig. 2). We note that the objects and the conditioning feature are in the same vector space or tensor space, for instance, . Our CRN focuses on modeling high-order object relations given a global context of input features. For the sake of clarity and ease of implementation, we introduce notations being used in our CRN unit in Table 1 and describe the unit working algorithmically (See Alg.1).

In later uses of CRN units in the context of VideoQA, the input array is composed of features at either frame or short-clip levels. The objects greatly share mutual information and it is redundant to consider all possible combinations of given objects. Therefore, applying a sampling scheme is crucial for redundancy reduction and computational efficiency. We borrow the sampling trick in [48] to build sets of selected subsets . Regarding the choice of , we choose in later experiments, resulting in the output array of size if and array of size if .

As a choice in implementation, the sub-networks are simple average-pooling; is a nonlinear transformation on top of feature concatenation. We tie parameters of the conditioning sub-network across the subsets of the same size . For generic operators, and can be simple linear sub-networks. On the other hand, is expected to model the non-linear relationships between multiple input modalities. If we define as the concatenate operation of two tensors and , a simple design of for fusing appearance and motion features is:


where ELU is exponential linear unit [3], is network parameter.

It may be of concern that the relation formed by a particular subset may be unnecessary to model -tuple relations, we optionally design a self-gating mechanism similar to [4] to regulate the feature flow to go through each CRN module. Formally, the conditioning function in that case is given by:


where is sigmoid function, and are parameters.

3.2 Hierarchical Conditional Relation Networks

We use CRN blocks to build a deep network architecture to exploit inherent characteristics of a video sequence namely temporal relations, motion, and the hierarchy of video structure, and to support reasoning guided by linguistic questions. We term the proposed network architecture Hierarchical Conditional Relation Networks (HCRN) (see Fig. 3 ). The design of the HCRN by stacking reusable core units is partly inspired by modern CNN network architectures, of which InceptionNet [31] and ResNet [9] are the most well-known examples.

A model for VideoQA should distill the visual content in the context of the question, given the fact that much of the visual information is usually not relevant to the question. Drawing inspiration from the hierarchy of video structure, we boil down the problem of VideoQA into a process of video representation in which a given video is encoded progressively at different granularities, including short clip (consecutive frames) and entire video levels. It is crucial that the whole process conditions on linguistic cue. In particular, at each hierarchy level, we use two stacked CRN units, one conditioned on motion features followed by one conditioned on linguistic cues. Intuitively, the motion feature serves as a dynamic context shaping the temporal relations found among frames (at the clip level) or clips (at the video level). As the shaping effect is applied to all relations, self-gating is not needed, and thus the simple fusion in Eq. (2) suffices. On the other hand, the linguistic cues are by nature selective, that is, not all relations are equally relevant to the question. Thus we utilize the self-gating mechanism in Eq. (3) for the CRN units which condition on question representation.

With this particular design of network architecture, the input array at clip level consists of frame-wise appearance feature vectors, while that at a video level is the output at the clip level. Recall from earlier description that the motion feature vector at clip level comes from ResNeXt-101. These motion features are then served as inputs for an LSTM, whose final state is used as video-level motion features. This particular implementation is not the only option. We believe we are the first to progressively incorporate multiple modalities of input in such a hierarchical manner in contrast to the typical approach of treating appearance features and motion features as a two-stream network.

To handle a long video of thousand frames, which is equivalent to dozens of short-term clips, there are two options to reduce the computational cost of CRN in handling a large powerset of input array : limit the maximum size of the query subset or extend the HCRN to deeper hierarchy. For the former option, this choice of sparse sampling may have potential to lose critical relation information of specific subsets. The latter, is able to densely sample subsets for relation modeling. Specifically, we can group short-term clips into hyper-clips, of which is the number of the hyper-clips and is the number of short-term clips in one hyper-clip. By doing this, our HCRN now becomes a 3-level of hierarchical network architecture.

At the end of the HCRN, we compute the average visual feature based on conditioning to the question representation . Assume output of the last CRN unit at video level are an array , we first stack them together, resulting in an output tensor , and further vectorize this output tensor to obtain the final output . The weighted average information is given by:


where, denotes concatenation operation, and is the Hadamard product.

3.3 Answer Decoders and Loss Functions

Following [10, 30, 6], we adopt different answer decoders depending on the task. Open-ended questions are treated as multi-label classification problems. For these, we employ a classifier which takes as input the combination of the retrieved information from visual cue and the question representation , and computes label probabilities :


The cross-entropy is used as the loss function.

For repetition count task, we use a linear regression function following a rounding function for integer count results:


The loss for this task is Mean Squared Error (MSE).

For multi-choice question types (such as repeating action and state transition in TGIF-QA), each answer candidate is processed in the same way with the question. In detail, we use the shared parameter HCRNs with either question or each answer candidate as language cues. As a result, we have a set of HCRN outputs, one conditioned on question (), and the others conditioned on answer candidates (. Subsequently, , question representation and answer candidates are fed into a final classifier with a linear regression to output an answer index, as follows:


We use the popular hinge loss [10] of pairwise comparisons, , between scores for incorrect and correct answers to train the network.

3.4 Complexity Analysis

We provide a brief analysis here, leaving detailed derivations in Supplement. For a fixed sampling resolution , a single forward pass of CRN would take quadratic time in . For an input array of length , feature size , the unit produces an output array of size of the same feature dimensions. The overall complexity of HCRN depends on design choice for each CRN unit and specific arrangement of CRN units. For clarity, let and , which are found to work well in later experiments. Suppose there are clips of length , making a video of length . A 2-level architecture of Fig. 3 needs time to compute the CRNs at the lowest level, and time to compute the second level, totalling time.

Let us now analyze a 3-level architecture that generalizes the one in Fig. 3. The clips are organized into sub-videos, each has clips, i.e., . The clip-level CRNs remain the same. At the next level, each sub-video CRN takes as input an array of length , whose elements have size . Using the same logic as before, the set of sub-video-level CRNs cost time. A stack of two sub-video CRNs now produces an output array of size , serving as an input object in an array of length for the video-level CRNs. Thus the video-level CRNs cost . Thus the total cost for 3-level HCRN is in the order of .

Compared to the 2-level HCRN, the a 3-level HCRN reduces computation time by assuming . As , this reduces to . In practice is often fixed, thus the saving scales quadratically with video length , suggesting that hierarchy is computational efficient for long videos.

4 Experiments

4.1 Datasets

Tgif-Qa [10] This is currently the most prominent dataset for VideoQA, containing 165K QA pairs and 72K animated GIFs. The dataset covers four tasks addressing unique properties of video. Of which, the first three require strong spatio-temporal reasoning abilities: Repetition Count - open-ended task to retrieve number of occurences of an action, Repeating Action- multi-choice task to identify the action repeated for a given number of times, State Transition - multi-choice tasks regarding temporal order of events. The last task - Frame QA - is akin to image QA where a particular frame in a video is sufficient to answer the questions.

Msvd-Qa [39] This is a small dataset of 50,505 question answer pairs annotated from 1,970 short clips. Questions are of five types, including what, who, how, when and where.

Msrvtt-Qa [40] The dataset contains 10K videos and 243K question answer pairs. Similar to MSVD-QA, questions are of five types. Compared to the other two datasets, videos in MSRVTT-QA contain more complex scenes. They are also much longer, ranging from 10 to 30 seconds long, equivalent to 300 to 900 frames per video.

We use accuracy to be the evaluation metric for all experiments, except those for repetition count on TGIF-QA dataset where Mean Square Error (MSE) is applied.

4.2 Implementation Details

Videos are segmented into 8 clips, each contains 16 frames by default. Long videos in MSRVTT-QA are additionally segmented into 24 clips for evaluating the ability of handling very long sequences. Unless otherwise stated, the default setting is with a 2-level HCRN as depicted in Fig. 3, and , . We train the model initially at learning rate of and decay by half after every 10 epochs. All experiments are terminated after 25 epochs and the results are reported at the epoch giving the best validation accuracy.

4.3 Results

Benchmarking against SoTAs

We compare our proposed model with state-of-the-art methods (SoTAs) on aforementioned datasets. For TGIF-QA, we compare with most recent SoTAs, including [6, 7, 10, 20], over four tasks. These works, except for [20], make use of motion features extracted from optical flow or 3D CNNs.

Model Action Trans. Frame Count
ST-TP [10] 62.9 69.4 49.5 4.32
Co-mem [7] 68.2 74.3 51.5 4.10
PSAC [20] 70.4 76.9 55.7 4.27
HME [6] 73.9 77.8 53.8 4.02
HCRN 75.0 81.4 55.9 3.82
Table 2: Comparison with the state-of-the-art methods on TGIF-QA dataset. For count, the lower the better.
Figure 4: Performance comparison on MSVD-QA and MSRVTT-QA dataset with state-of-the-art methods: Co-mem [7], HME [6], HRA [2], and AMU [39].

The results are summarized in Table 2 for TGIF-QA, and in Fig. 4 for MSVD-QA and MSRVTT-QA. Reported numbers of the competitors are taken from the original papers and [6]. It is clear that our model consistently outperforms or is competitive with SoTA models on all tasks across all datasets. The improvements are particularly noticeable when strong temporal reasoning is required, i.e., for the questions involving actions and transitions in TGIF-QA. These results confirm the significance of considering both near-term and far-term temporal relations toward finding correct answers.

The MSVD-QA and MSRVTT-QA datasets represent highly challenging benchmarks for machine compared to the TGIF-QA, thanks to their open-ended nature. Our model HCRN outperforms existing methods on both datasets, achieving 36.1% and 35.6% accuracy which are 1.7 points and 0.6 points improvement on MSVD-QA and MSRVTT-QA, respectively. This suggests that the model can handle both small and large datasets better than existing methods.

Finally, we provide a justification for the competitive performance of our HCRN against existing rivals by comparing model features in Table 3. Whilst it is not straightforward to compare head-to-head on internal model designs, it is evident that effective video modeling necessiates handling of motion, temporal relation and hierarchy at the same time. We will back this hypothesis by further detailed studies in Section 4.3.2 (for motion, temporal relations, shallow hierarchy) and Section 4.3.3 (deep hierarchy).

Model Appear. Motion Hiera. Relation
ST-TP [10]
Co-mem [7]
PSAC [20]
HME [6]
Table 3: Model design choices and input modalities in comparison. See Table 2 for corresponding performance on TGIF-QA dataset.

Ablation Studies

Model Act. Trans. F.QA Count
65.2 75.5 54.9 3.97
66.2 76.2 55.7 3.95
65.4 76.7 56.0 3.91
65.6 75.6 56.3 3.92
65.4 75.1 56.3 3.91
67.2 76.6 56.7 3.94
66.3 76.7 56.5 3.92
64.0 75.9 56.2 3.87
66.3 75.6 55.8 4.00
73.3 81.7 56.1 3.89
72.5 81.1 56.6 3.82
75.0 81.4 55.9 3.82
75.1 81.5 55.5 3.91
73.6 82.0 54.7 3.84
75.4 81.4 55.6 3.86
74.1 81.9 54.7 3.87
-level, vid CRN only 66.2 78.4 56.6 3.94
-level, clipspool 70.4 80.5 56.6 3.94
Motion conditioning
w/o motion 70.8 79.8 56.4 4.38
w/o short-term motion 74.9 82.1 56.5 4.03
w/o long-term motion 75.1 81.3 56.7 3.92
Linguistic conditioning
w/o linguistic cond. 66.5 75.7 56.2 3.97
w/o quest.@clip level 74.3 81.1 55.8 3.95
w/o quest.@video level 74.0 80.5 55.9 3.92
w/o gate 74.1 82.0 55.8 3.93
w/ gate quest.+motion 73.3 80.9 55.3 3.90
Full -level HCRN 75.1 81.2 55.7 3.88
Table 4: Ablation studies on TGIF-QA dataset. For count, the lower the better. Act.: Action; Trans.: Transition; F.QA: Frame QA. When not explicitly specified, we use for relation order and sampling resolution.

To provide more insight about our model, we conduct extensive ablation studies on TGIF-QA with a wide range of configurations. The results are reported in Table 4. Full -level HCRN denotes the full model of Fig. 3 with . Overall we find that ablating any of design components or CRN units would degrade the performance for temporal reasoning tasks (actions, transition and action counting). The effects are detailed as follows.

Effect of relation order and resolution Without relations () the performance drops significantly on actions and events reasoning. This is expected since those questions often require putting actions and events in relation with a larger context (e.g., what happens before something else). In this case, the frame QA benefits more from increasing sampling resolution because of better chance to find a relevant frame. However, when taking relations into account (), we find that HCRN is robust against sampling resolution but depends critically on the maximium relation order . The relative independence w.r.t. can be due to visual redundancy between frames, so that resampling may capture almost the same information. On the other hand, when considering only low-order object relations, the performance is significantly dropped in all tasks, except frame QA. These results confirm that high-order relations are required for temporal reasoning. As the frame QA task requires only reasoning on a single frame, incorporating temporal information might confuse the model.

Effect of hierarchy We design two simpler models with only one CRN layer: -level, CRN vid on key frames only: Using only one CRN at the video-level whose input array consists of key frames of the clips. Note that video-level motion features are still maintained. -level, clip CRNs pooling: Only the clip-level CRNs are used, and their outputs are mean-pooled to represent video. The pooling operation represents a simplistic relational operation across clips. The results confirm that a hierarchy is needed for high performance on temporal reasoning tasks.

Effect of motion conditioning We evaluate the following settings: w/o short-term motions: Remove all CRN units that condition on the short-term motion features (clip level) in the HCRN. w/o long-term motions: Remove the CRN unit that conditions on the long-term motion features (video level) in the HCRN. w/o motions: Remove motion feature from being used by HCRN. We find that motion, in agreeing with prior arts, is critical to detect actions, hence computing action count. Long-term motion is particularly significant for counting task, as this task requires maintaining global temporal context during the entire process. For other tasks, short-term motion is usually sufficient. E.g. in action task, wherein one action is repeatedly performed during the entire video, long-term context contributes little. Not suprisingly, motion does not play the positive role in answering questions on single frames as only appearance information needed.

Effect of linguistic conditioning and gating Linguistic cues represent a crucial context for selecting relevant visual artifacts. For that we test the following ablations: w/o question@clip level: Remove the CRN unit that conditions on question representation at clip level. w/o question@video level: Remove the CRN unit that conditions on question representation at video level. w/o linguistic cond: Exclude all CRN units conditioning on linguistic cue while the linguistic cue is still in the answer decoder. Likewise, gating offers a selection mechanism. Thus we study its effect as follows: wo/ gate: Turn off the self-gating mechanism in all CRN units. w/ gate quest.+motion: Turn on the self-gating mechanism in all CRN units.

We find that the conditioning question provides an important context for encoding video. Conditioning features (motion and language), through the gating mechanism in Eq. (3), offers further performance gain in action and counting tasks, possibly by selectively passing question-relevant information up the inference chain.

Deepening model hierarchy

Depth of hierarchy Overall Acc.
-level, clips vid 35.6
-level, clips sub-vids vid 35.6
Table 5: Results for going deeper hierarchy on MSRVTT-QA dataset. Run time is reduced by factor of for going from 2-level to 3-level hierarchy.

We test the scability of the HCRN on long videos in the MSRVTT-QA dataset, which are organized into 24 clips (3 times longer than other two datasets). We consider two settings: -level hierarchy, clips vid: The model is as illustrated in Fig. 3, where 24 clip-level CRNs are followed by a video-level CRN. -level hierarchy, clips sub-vids vid: Starting from the 24 clips as in the -level hierarchy, we group 24 clips into 4 sub-videos, each is a group of 6 consecutive clips, resulting in a -level hierarchy. These two models are designed to have similar number of parameters, approx. 50M.

The results are reported in Table 5. Unlike existing methods which usually struggle with handling long videos, our method is scalable for them by offering deeper hierarchy, as analysed theoretically in Section 3.4. Using a deeper hierarchy is expected to significantly reduce the training time and inference time for HCRN, especially when the video is long. In our experiments, we achieve 4 times reduction in training and inference time by going from 2-level HCRN to 3-level counterpart whilst maintaining the same performance.

5 Discussion

We introduced a general-purpose neural unit called Conditional Relational Networks (CRNs) and a method to construct hierarchical networks for VideoQA using CRNs as building blocks. A CRN is a relational transformer that encapsulates and maps an array of tensorial objects into a new array of the same kind, conditioned on a contextual feature. In the process, high-order relations among input objects are encoded and modulated by the conditioning feature. This design allows flexible construction of sophisticated structure such as stack and hierarchy, and supports iterative reasoning, making it suitable for QA over multimodal and structured domains like video. The HCRN was evaluated on multiple VideoQA datasets (TGIF-QA, MSVD-QA, MSRVTT-QA) demonstrating competitive reasoning capability.

Different to temporal attention based approaches which put effort into selecting objects, HCRN concentrates on modeling relations and hierarchy in video. This difference in methodology and design choices leads to distinctive benefits. CRN units can be further augmented with attention mechanisms to cover better object selection ability, so that related tasks such as frame QA can be further improved.

The examination of CRN in VideoQA highlights the importance of building generic neural reasoning unit that supports native multimodal interaction in improving robustness of visual reasoning. We wish to emphasize that the unit is general-purpose, and hence is applicable for other reasoning tasks, which we will explore. These includes an extension to consider the accompanying linguistic channels which are crucial for TVQA [17] and MovieQA [33] tasks.


  1. L. Baraldi, C. Grana and R. Cucchiara (2017) Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1657–1666. Cited by: §2.
  2. M. I. H. Chowdhury, K. Nguyen, S. Sridharan and C. Fookes (2018) Hierarchical relational attention for video question answering. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 599–603. Cited by: Figure 4.
  3. D. Clevert, T. Unterthiner and S. Hochreiter (2015) Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv preprint arXiv:1511.07289. Cited by: §3.1.
  4. Y. N. Dauphin, A. Fan, M. Auli and D. Grangier (2017) Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 933–941. Cited by: §3.1.
  5. K. Do, T. Tran and S. Venkatesh (2018) Learning deep matrix representations. arXiv preprint arXiv:1703.01454. Cited by: §2.
  6. C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang and H. Huang (2019) Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR, pp. 1999–2007. Cited by: §1, §2, §2, §3.3, Figure 4, §4.3.1, §4.3.1, Table 2, Table 3.
  7. J. Gao, R. Ge, K. Chen and R. Nevatia (2018) Motion-appearance co-memory networks for video question answering. CVPR. Cited by: §1, §2, Figure 4, §4.3.1, Table 2, Table 3.
  8. K. Hara, H. Kataoka and Y. Satoh (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555. Cited by: §3.
  9. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. CVPR. Cited by: §1, §2, §3, §3.2.
  10. Y. Jang, Y. Song, Y. Yu, Y. Kim and G. Kim (2017) Tgif-qa: toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2758–2766. Cited by: §2, §2, §3.3, §3.3, §4.1, §4.3.1, Table 2, Table 3.
  11. W. Jin, Z. Zhao, M. Gu, J. Yu, J. Xiao and Y. Zhuang (2019) Multi-interaction network with object relation for video question answering. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1193–1201. Cited by: §2.
  12. J. Kim, J. Jun and B. Zhang (2018) Bilinear attention networks. In Advances in Neural Information Processing Systems, pp. 1564–1574. Cited by: §2.
  13. J. Kim, M. Ma, K. Kim, S. Kim and C. D. Yoo (2019) Progressive attention memory network for movie story question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8337–8346. Cited by: §2.
  14. K. Kim, S. Choi, J. Kim and B. Zhang (2018) Multimodal dual attention memory for video story question answering. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 673–688. Cited by: §2.
  15. K. Kim, M. Heo, S. Choi and B. Zhang (2017) DeepStory: video story QA by deep embedded memory networks. In IJCAI, pp. 2016–2022. Cited by: §2.
  16. T. M. Le, V. Le, S. Venkatesh and T. Tran (2019) Learning to reason with relational video representation for question answering. arXiv preprint arXiv:1907.04553. Cited by: §2.
  17. J. Lei, L. Yu, M. Bansal and T. L. Berg (2018) Tvqa: localized, compositional video question answering. Conference on Empirical Methods in Natural Language Processing. Cited by: §1, §1, §2, §5.
  18. F. Li, C. Gan, X. Liu, Y. Bian, X. Long, Y. Li, Z. Li, J. Zhou and S. Wen (2017) Temporal modeling approaches for large-scale youtube-8m video understanding. CVPR workshop. Cited by: §2.
  19. X. Li, L. Gao, X. Wang, W. Liu, X. Xu, H. T. Shen and J. Song (2019) Learnable aggregating net with diversity learning for video question answering. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1166–1174. Cited by: §2.
  20. X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He and C. Gan (2019) Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering. In AAAI, Cited by: §1, §1, §2, §2, §4.3.1, Table 2, Table 3.
  21. J. Liang, L. Jiang, L. Cao, L. Li and A. G. Hauptmann (2018) Focal visual-text attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6135–6143. Cited by: §2.
  22. R. Lienhart (1999) Abstracting home video automatically. In Proceedings of the seventh ACM international conference on Multimedia (Part 2), pp. 37–40. Cited by: §2.
  23. F. Mao, X. Wu, H. Xue and R. Zhang (2018) Hierarchical video frame sequence representation with deep convolutional graph network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2.
  24. S. Na, S. Lee, J. Kim and G. Kim (2017) A read-write memory network for movie story understanding. In International Conference on Computer Vision (ICCV 2017). Venice, Italy, Cited by: §1, §2.
  25. P. Pan, Z. Xu, Y. Yang, F. Wu and Y. Zhuang (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038. Cited by: §2.
  26. J. Pennington, R. Socher and C. D. Manning (2014) Glove: global vectors for word representation.. In EMNLP, Vol. 14, pp. 1532–1543. Cited by: §3.
  27. E. Perez, F. Strub, H. De Vries, V. Dumoulin and A. Courville (2018) Film: visual reasoning with a general conditioning layer. In AAAI, Cited by: §1, §2.
  28. Z. Qiu, T. Yao and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541. Cited by: §2.
  29. G. Singh, L. Sigal and J. J. Little (2019) Spatio-temporal relational reasoning for video question answering. In BMVC, Cited by: §2.
  30. X. Song, Y. Shi, X. Chen and Y. Han (2018) Explore multi-step reasoning in video question answering. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 239–247. Cited by: §1, §2, §3.3.
  31. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1, §2, §3.2.
  32. Y. Tang, X. Zhang, L. Ma, J. Wang, S. Chen and Y. Jiang (2018) Non-local netvlad encoding for video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2.
  33. M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun and S. Fidler (2016) Movieqa: understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4631–4640. Cited by: §1, §2, §5.
  34. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §2.
  35. A. Wang, A. T. Luu, C. Foo, H. Zhu, Y. Tay and V. Chandrasekhar (2019) Holistic multi-modal memory network for movie question answering. IEEE Transactions on Image Processing 29, pp. 489–499. Cited by: §2.
  36. B. Wang, Y. Xu, Y. Han and R. Hong (2018) Movie question answering: remembering the textual cues for layered visual contents. AAAI’18. Cited by: §1, §2.
  37. C. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl and R. Girshick (2019) Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 284–293. Cited by: §2.
  38. S. Xie, R. Girshick, P. Dollár, Z. Tu and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §3.
  39. D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He and Y. Zhuang (2017) Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1645–1653. Cited by: §2, Figure 4, §4.1.
  40. J. Xu, T. Mei, T. Yao and Y. Rui (2016) Msr-vtt: a large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288–5296. Cited by: §2, §4.1.
  41. T. Yang, Z. Zha, H. Xie, M. Wang and H. Zhang (2019) Question-aware tube-switch network for video question answering. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1184–1192. Cited by: §2.
  42. Y. Ye, Z. Zhao, Y. Li, L. Chen, J. Xiao and Y. Zhuang (2017) Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 829–832. Cited by: §2.
  43. Z. Yu, J. Yu, J. Fan and D. Tao (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 1821–1830. Cited by: §2.
  44. K. Zeng, T. Chen, C. Chuang, Y. Liao, J. C. Niebles and M. Sun (2017) Leveraging video descriptions to learn video question answering. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2, §2.
  45. Z. Zhao, X. Jiang, D. Cai, J. Xiao, X. He and S. Pu (2018) Multi-turn video question answering via multi-stream hierarchical attention context network.. In IJCAI, pp. 3690–3696. Cited by: §2.
  46. Z. Zhao, Q. Yang, D. Cai, X. He and Y. Zhuang (2017) Video question answering via hierarchical spatio-temporal attention networks.. In IJCAI, pp. 3518–3524. Cited by: §2.
  47. Z. Zhao, Z. Zhang, S. Xiao, Z. Xiao, X. Yan, J. Yu, D. Cai and F. Wu (2019) Long-form video question answering via dynamic hierarchical reinforced networks. IEEE Transactions on Image Processing 28 (12), pp. 5939–5952. Cited by: §2.
  48. B. Zhou, A. Andonian, A. Oliva and A. Torralba (2018) Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 803–818. Cited by: §2, §3.1.
  49. L. Zhu, Z. Xu, Y. Yang and A. G. Hauptmann (2017) Uncovering the temporal context for video question answering. International Journal of Computer Vision 124 (3), pp. 409–421. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description