Crowd Video Captioning
††thanks: This work is supported in part by the National Science Foundation China (NSFC)
61761136005, and ARC through DP190100887 and DP160104500.
Describing a video automatically with natural language is a challenging task in the area of computer vision. In most cases, the on-site situation of great events is reported in news, but the situation of the off-site spectators in the entrance and exit is neglected which also arouses people’s interest. Since the deployment of reporters in the entrance and exit costs lots of manpower, how to automatically describe the behavior of a crowd of off-site spectators is significant and remains a problem.
To tackle this problem, we propose a new task called crowd video captioning (CVC) which aims to describe the crowd of spectators. We also provide baseline methods for this task and evaluate them on the dataset WorldExpo’10. Our experimental results show that captioning models have a fairly deep understanding of the crowd in video and perform satisfactorily in the CVC task.
With the rapid development of the deep neural network, computers can describe the content of the video in a reasonably deep way. Video captioning has important practicality and wide potential application, and a typical one is news broadcasting.
In most great events, there are professional commentators in the stadium to broadcast the situation in real time, and some studies about automatic sports video commentary [1, 2] have been carried out. Outside the venue, entry and exit of spectators are also important. Reports such as “The line of people snaked into the theater with joy in their faces.” often appear in the news, but deliberately assigning reporters to wait for spectators wastes manpower. Therefore, to report the situation of off-site spectators in real time, we use the surveillance camera to analyze the spectators’ crowd and generate captions with the deep learning methods.
Recently, there has been some works on crowd counting [3, 4, 5] and classification [6, 7, 8, 9], but their output is a number of pedestrians, or the state of mobility and abnormal behaviors. None of these work can produce a descriptive sentence for the crowd as a news report.
A caption of crowd video needs to describe various attributes of a crowd, such as the number of people in the crowd, the situation of movement, direction of flow, etc. Therefore, we use a captioning framework to describe the crowd in videos. In major events, if the surveillance camera uses our system, it can be directly connected to the news broadcasting system to broadcast the real-time off-site situation to the news media.
Our framework uses convolutional neural networks to extract these crowd features, then feeds them into a classifier or a language model to produce the summary. All attributes and situations of the crowd should be included in the output descriptive words of the language model.
To validate our system, we create a crowd video captioning dataset, which is based on the crowd counting dataset: WorldExpo’10. We select some of the videos in this dataset and make captions for them. Several experiments using the proposed models have been carried out to evaluate the performances of those methods.
The main contribution of our work is the proposal of a new task called crowd video captioning (CVC) which aims to generate captions for the crowd video. We provide baselines and a system framework for this task, and the results of the experiments prove the feasibility of our system.
Ii Related Work
Ii-a Crowd Counting
In recent years, many models and datasets have been proposed for crowd counting. For example, Chan et al.  collect UCSD in the University of California, San Diego, and it is one of the earliest datasets for crowd counting. Chan et al.  have used Dynamic Textures and Gaussian methods to count the crowd in videos. After that, many new models have been proposed for this task, such as CNN  and ACSVP  (a GAN-based, U-net structured model).
WorldExpo’10 proposed in  is another large-scaled dataset for crowd counting. It includes more than a thousand labeled videos captured by over one hundred monitoring cameras, all from the Shanghai World Expo in 2010. We have used it in this research.
Ii-B Crowd Behaviors Analysis
In addition to counting, researches on behavior analysis of crowd are also underway. MED  has been carried out as the crowd emotion dataset, but it only has 31 videos, and the people in the crowd are just walking around and making some specific movements, such as fighting, hugging.
The newest dataset, Crowd-11 proposed in , has been provided to classify the fine-grained crowd behaviors. It categorizes the flow mainly by the direction of each one in the crowd. Models including LSTM , C3D, V3G , ConvLSTM  and so on, have been used to analyze the crowd abnormal behaviors in those datasets. While these systems work efficiently, they do have significant disadvantages: the accuracy of fine-grained classification for flow is generally low and they are more suitable for abnormal behavior monitoring.
Ii-C Fine-grained Video Captioning
There are also some works about fine-grained video captioning, including broadcasting for tennis videos , and Fine-grained Sports Narrative dataset . Models like LSTM-YT  and S2VT  have been used to complete those tasks. But the aims of those works are all for professional sports broadcasting.
A conventional video captioning pipeline can be divided into two stages: feature extraction and caption generation.
In the first stage, we can use the model for image classification as a frame feature extractor, or the model for video classification as a video feature extractor. The features extracted by these two methods are different, so suitable models should be selected for different tasks..
In the second stage, a sequence to sequence model is fed with the extracted features and generate sentences. Therefore, a language model such as a recurrent neural network (RNN) can be used to construct this decoder.
Iii-a Frame Feature Extraction
The high-level 2D features of the frame can be extracted by a convolutional neural network (CNN) for image classification, feeding it with frames of the video one after another, and getting the feature of each frame from the last layer of the model.
Inception V3  is one of those networks, evolved from GoogLeNet . It decomposes a 2D convolution mask which is into two 1D masks which are and respectively. It can not only accelerate the calculation but also increase the depth of the network.
Iii-B Video Feature Extraction
C3D  is a network for extracting features from videos. Unlike the 2D convolution in 3D space which can not slide in the temporal domain, 3D convolution can extract the features of the same region in different periods.
C3D network has eight convolution and five pooling layers. Using Principal Component Analysis (PCA), the output of the last layer (fc7) in C3D is shown in Fig. 1, where every point represents the feature of its corresponding video, and the 8 different colors represent 8 different classes. Before training (step=0), all points from different categories are mingled together. But after training (step=89), points of the same category are collected together. There are 70 points in the figure, the length of their tensor is 4096.
Iii-C Caption Generation
S2VT  is a typical sequence-to-sequence network for generating captions for videos. It’s made up of two-layer recurrent neural network (RNN), and long short term memory (LSTM)  is used as the cells of this RNN.
As Fig. 2 shows, S2VT takes the features extracted from the video frames as the input sequences. The “words” of those embeddings are fed into the LSTM cells of the first layer one by one, then, after several iterations, the words for captioning are continuously generated by the second layer.
S2VT generates words by taking the words which are already generated. It uses to indicate the begin-of-sentence and for the end-of-sentence tag. And is used when there is no input at the time step. Those labels are all utilized as the references to produce the next word.
Iv Details of Models
In this section, we first introduce the definition of our crowd video captioning task and the overview of our frameworks. Then, we describe two alternative models to caption the crowd video.
Iv-a Task Definition and Overview
Crowd video captioning aims to describe attributes of a crowd in natural language, such as the number of people in the crowd, the situation of movement, the direction of flow, etc. Our model can be divided into two parts: encoder and decoder. First, we need to identify the crowd from videos. It is easy to detect the changing areas between frames because the pixels representing pedestrians tend to move together as a whole. Then, frames are randomly selected from a video. the attributes and situations of the crowd can be extracted from the following features: crowd extent size, density, individual movements, pedestrian situations and so on. Our framework employs convolutional neural networks as the feature extractor to get these features embeddings.
The features extracted from the videos include the attributes of the crowd. Therefore, crowd’s feature can be directly extracted from the extractor frame by frame, and then joined into one sequence . Crowd’s features can also be encoded into a vector sequence through the video feature extractor. Finally, this sequence is fed to a classifier or a language model as the caption generator to produce the description of the crowd.
Iv-B Classification Model
If there are words in the dataset, the number of -word sentences can be formed theoretically is . But very few of those randomly generated sentences are grammatical. So we use grammatical sentences as the tags that classifier needs to recognize. In this model, after the video features are extracted from the encode, a classifier is used as the decoder.
Let’s take p-category classifier based on C3D as example, a linear layer is used as the classifier followed in the rear of C3D, as shown in Fig. 3. The linear layer converts dimensional vectors to p-dimensional outputs, then the probabilities for which label to be output can be calculated by softmax as follow:
where , is the total number of categories. After that, the corresponding value of the category with the maximal probability is set to 1.
Iv-C Captioning Model
Since classification model can only generate captions within the label set, we adopt the captioning model in our CVC system framework. To caption the video, our captioning model needs to generate natural language sequences. Given an frame feature sequence , the goal of captioning model is to generate proper words: . The model estimates the probability:
where represents all the generated words before the word. When captioning model outputs each word, it chooses the word with the highest probability of this position on the basis of all the words that have been output before.
The overall design of our CVC system framework of captioning model is shown in Fig. 2.
Firstly, the video frames feature or the whole video feature about the crowd is obtained from the encoder, such as C3D, ResNet or Inception.
After that, we take the language model, like S2VT shown in Fig. 2, as the decoder for captioning. In addition to the LSTM, we apply a gated recurrent neural network (GRU)  as the cell of S2VT, which has fewer parameters than LSTM and can avoid over-fit. Researches and practical experiences show that these two cells each have their own advantages and disadvantages.
The following steps show how the model produces the captions.
First of all, vocabulary which contains all the words in the dataset is built, and every word is encoded into a vector.
Next, the features are input to the model, and then which word to be selected from the vocabulary depends on the hidden state and the network parameters :
Finally, supposing that all the hidden states of the network are , from the beginning of the input sentence to the end of the output sentence, the model optimizes the parameters by maximizing the sum of a log-likelihood probability of the generated words as follow:
Iv-D Loss Function
During training, we use the Cross-Entropy Loss as the loss function, which is defined as follow:
where is the ideal embedding of the words created by humans and the is the computer-generated words embedding. The greater the difference between the predicted results and the ground truth is, the higher the gradient of the loss function is, and the faster the convergence rate is.
To caption the crowd videos, we select WorldExpo’10 from a series of datasets for crowd analysis. And in order to evaluate our baseline and captioning methods in our dataset, C3D and other feature extractors have been chosen as the encoder, and experiments on S2VT with LSTM and GRU have been token.
Crowds in most videos of the WorldExpo’10 dataset are messy because this dataset is mainly used for crowd counting. In order to simplify the task, we select 98 videos of it and caption them based on the crowd. These videos are captured by 7 surveillance cameras.
Keywords of captions in our dataset are shown in Fig. 4, they describe the number of people in the crowd, the situation of movement and the direction of flow respectively. Because “running” accounts for a small percentage in the WorldExpo’10 dataset, so it only accounts for 21% in our dataset. We define the direction to be close to the camera as “in”, and the direction to be far away from the camera as “out”.
The size of the vocabulary is 6, the number of attribute pairs that make up the descriptive sentences is 3. So captions are formed in our dataset.
V-B Baseline Based on C3D
The first baseline is using C3D directly as -category classifier. Since the words in our dataset can make up sentences, we divide all the videos into eight categories. The label of each category is one of these sentences, such as “Many people walk in”.
The dimension of the last layer in C3D is , we add a linear full connected layer after it as the classifier. The dimension of the output of this linear layer the total number of categories, which is 8 in our experiment.
The model is first pre-trained with UCF101 . We then fine-tune the network on our dataset. The split for training, validating and testing is 70:19:9. In order to fit the C3D model, frames are resized into , and the randomly cropped to .
The accuracy and loss curves for training, validating and testing epochs are shown in Fig. 5, where the number of frames inputted to the model is 16, the learning rate is set to , and the schedule is set to divide the learning rate by 2 every 10 epochs. And the curves in Fig. 5 are smoothed by 0.8, while the original values are reported in faint polylines.
The curves show that in the training epochs, the loss almost converges to zero, and the accuracy can achieve fast convergence as well. It reaches 0.9714 and 0.6842 on training and validating corpus, but it’s only 0.4444 on testing one. Their loss can be reduced to approximately zero on the training set, but not on the validation set or the test set.
V-C Evaluation Metrics for Captioning
In this section, we introduce several frequently used evaluation metrics for video captioning.
BLEU. It is based on modified n-gram precision. To begin with, the modified precision is defined as the candidate counts clipped by their corresponding reference maximum value, summed, and divided by the total number of candidate n-grams. In the second place, supposing is the length of the candidate translation and is the effective reference corpus length, we compute the brevity penalty:
Ultimately, we use n-grams from 1-gram up to length N to calculate the BLEU score with the weighted precision:
CIDEr. It measures the similarity of a sentence to the majority, or consensus of how most people describe the image. For instance, sentences such as “Mike has a baseball and Jenny has basketball” is more representative of the consensus descriptions than the sentence “Jenny brought a bigger ball than Mike”. CIDEr is proposed to capture those sentences with more broad consensus.
METEOR. It is based on the weighted precision and recall of the matched content-function words in hypothesis and reference. Fragmentation penalty is defined to account for differences in word order, where the chunks defined as a series of matches that is contiguous and identically ordered in both sentences, and is the average number of matched words over hypothesis and reference. After the parameterized harmonic mean of and is calculated, the METEOR score is computed as follow:
It’s always small than 1 even if the sentences predicted are the same as the references, because the never equals zero.
ROUGE. It counts the number of overlapping units such as n-gram, word sequences, and word pairs between the predicted captions and the referential summaries.
V-D Details and Results of Model based on S2VT
We use C3D, ResNet-152 and Inception V3 (V4) pre-trained on UCF101  or ImageNet  as the feature extractor of every frame from our dataset, then train the S2VT model directly with those features. LSTM and GRU are used as the RNN cell of the S2VT. For the best performance, the split for training and testing is adjusted to 45:4. We follow the experimental setting from the default. We set the dimension of features of video frames to 2048 and set the number of hidden layers to 512.
|C3D and LSTM||82.76||76.89||74.24||75.64||63.46||50.30||82.83||0.625|
|C3D and GRU||92.86||88.84||86.97||90.06||72.53||56.97||91.67||0.75|
|ResNet152 and LSTM||78.52||69.13||59.37||50.92||47.02||43.24||80.57||0.625|
|ResNet152 and GRU||92.86||88.84||83.97||81.63||70.44||56.97||92.71||0.75|
|Incept.v3 and LSTM||86.21||83.54||81.27||80.95||72.07||58.38||86.99||0.75|
|Incept.v3 and GRU||96.43||95.71||94.34||95.73||81.15||67.81||95.83||0.875|
|Incept.v4 and LSTM||82.09||81.62||80.68||84.35||68.65||57.90||83.33||0.75|
|Incept.v4 and GRU||92.86||88.84||83.97||81.63||70.44||56.97||92.71||0.75|
We report BLEU, CIEDEr, METEOR, and ROUGE_L captioning scores for this method, the results on the testing set are provided in TABLE. I. The learning rate is set to , the scheduler is set to decay the learning rate by 0.8 every 200 epochs.
Results obtained from the S2VT with different RNN cell and feature are shown in TABLE. I, best performances are presented in bold, and second best performances are underlined. For our task, the most appropriate feature extractor is inception V3, followed by C3D. And GRU has fewer parameters than LSTM, so it converges more easily than LSTM when the dataset size is small. It is the reason why GRU works better than LSTM on our dataset.
The accuracy is calculated in the purpose of being compared with the results of the C3D method, defined as the proportion of complete correct sentences to all of them. It is higher than the counterpart of the C3D classifier. And the two incorrect hypotheses generated by LSTM from Inception V3 features are just predicted with the mistake on verb and direction respectively, as shown in Fig. 6. The second failed case is due to several people who interfere with the judgment. Although the sentence structure is not set manually, the output is completely consistent with the grammar, such as the singular-plural rule.
V-E Experimental Conclusion
Compared with the baseline, this method based on S2VT outperforms the classifier with C3D due to the comprehension of words in the sentence separately. This verifies that it is not necessary to know what the specific features represent when extracting features, and the analysis of crowd features can be directly handed over to the language model.
Moreover, it proves that image classification models such as Inception and ResNet can extract the feature of the crowd in each frame, and the sequence composed of these features can be used for the crowd captioning.
This reflects the power of the video captioning model, especially if more information needs to be described in the summary later. Video captioning models can interpret the temporal information from the features extracted by the convolution neural network. It can also understand which information each word represents, and make them into a sentence without grammatical mistakes.
Vi Conclusion and Future Work
In this paper, we propose a new video captioning task of describing the off-site audiences or visitors crowd, called crowd video captioning (CVC). In our encoder-decoder system for this task, we use a deep convolutional neural network to extract the features of crowd video and feed them to a language model for crowd description generating. We create a dataset based on WorldExpo’10. On this dataset, our experimental trials prove that our CVC system works well to accomplish this task, and they have achieved high accuracy. It shows that the language model can deeply comprehend the information about the crowd which feature extractors don’t understand.
In our approach, S2VT with features extracted from Inception V3 works better than other methods in the CVC task because our dataset is small and the captions are simple. For future work, these models should be adjusted to fit this fine-grained captioning task for the crowd. The number of videos and the complexity of captions needs to be increased as well in the dataset.
-  M. Sukhwani and C. V. Jawahar, “TennisVid2Text: Fine-grained descriptions for domain specific videos,” in BMVC, 2015.
-  H. Y. Yu, S. Cheng, B. B. Ni, M. S. Wang, J. Zhang, and X. K. Yang, “Fine-grained video captioning for sports narrative,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6006-6015.
-  A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, “Privacy preserving crowd monitoring: Counting people without people models or tracking,” in IEEE Conference on Computer Vision Pattern Recognition, 2008, pp. 1-7.
-  C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting via deep convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 833-841.
-  Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang, “Crowd counting via adversarial cross-scale consistency pursuit,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5245-5254.
-  H. R. Rabiee, J. Haddadnia, H. Mousavi, M. Nabi, V. Murino, and N. Sebe, “Emotion-based crowd representation for abnormality detection,” CoRR, vol. abs/1607.07646, 2016.
-  H. Su, Y. Dong, J. Zhu, H. Ling, and B. Zhang, “Crowd scene understanding with coherent recurrent neural networks,” in IJCAI, 2016, vol. 1, p. 2.
-  C. Dupont, L. Tobias, and B. Luvison, “Crowd-11: A dataset for fine grained crowd behaviour analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 9-16.
-  Y. Li, “A deep spatiotemporal perspective for understanding crowd behavior,” IEEE Transactions on Multimedia, vol. 20, no. 12, pp. 3289-3297, 2018.
-  S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” in HLT-NAACL, 2015.
-  S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence-video to text,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4534-4542.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818-2826.
-  C. Szegedy et al., “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1-9.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770-778.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489-4497.
-  D. D’Informatique et al., “Long short-term memory in recurrent neural networks,” Epfl, vol. 9, no. 8, pp. 1735-1780, 2001.
-  K. Cho et al., “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724-1734.
-  K. Soomro, A. Zamir, R, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” Computer Science, 2012.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255: Ieee.