Large-scale, Fast and Accurate Shot Boundary Detection through Spatio-temporal Convolutional Neural Networks
Shot boundary detection (SBD) is an important pre-processing step for video manipulation. Here, each segment of frames is classified as either sharp, gradual or no transition. Current SBD techniques analyze hand-crafted features and attempt to optimize both detection accuracy and processing speed. However, the heavy computations of optical flow prevents this from happening. To achieve this aim, we present an SBD technique based on spatio-temporal Convolutional Neural Networks (CNN). Since current datasets are not large enough to train an accurate SBD CNN, we are the first to present a very large SBD dataset that allows deep neural networks techniques to be effectively applied. Our dataset contains more than 3.5 million frames of sharp and gradual transitions. The transitions are generated synthetically using image compositing models. Our dataset contain additional 70,000 frames of important hard-negative no transitions. We perform the largest evaluation to date for one SBD algorithm, on real and synthetic data, containing more than 4.85 million frames. In comparison to the state of the art, we outperform dissolve gradual detection, generate competitive performance for sharp detections and produce significant improvement in wipes. In addition, we are up to 11 times faster than the state of the art.
With the wide adoption of digital video, the demand for editing and manipulating video content is in continuous rise. This, however, requires better understanding of videos and their composition. Videos are composed of different camera shots placed after each other. A video shot transitions into another through several forms of visual effect. These visual effects can be classified into two main categories: sharp and gradual  as shown in Figure 1. The former is a sudden change of the shot over 1 frame, while gradual transitions occur over multiple frames. Gradual transitions are further classified into dissolve and non-dissolve. The former includes cases such as semi-transparent graduals, fade in and fade out (see Figure 1). Non-dissolve are dominated by wipes (see Figure 1). Wipe graduals have a much wider variety than the dissolve graduals.
Video post-processing techniques are in rising popularity and they cover a wide range of applications. This includes video coding , visual quality enhancement [3, 4], graphics rendering [5, 6, 7], video understanding [8, 9] and many others [10, 11]. Such post-processing techniques, however, are based on assumptions, some of which can be violated during shot transitions. For instance, many techniques assume the presence of one layer at one spatial point, an assumption heavily violated during dissolve transitions. This can lead to unpleasant artifacts as in the case of 2D-to-3D conversion (see Figure 2). Here, the disparity maps can undergo strong artifacts during gradual transitions. Hence, detecting video transitions and assigning a special treatment for them during post-production is an important and desirable step. However, with the high computational demand of many post-production techniques, as well as the real-time requirement of some, shot boundaries detection (SBD) needs to be performed with both high detection accuracy and very fast processing speed.
Current SBD techniques analyze hand-crafted features [12, 13, 1, 14, 15, 16, 17, 18, 19, 20]. Fast techniques analyze only spatial information such as intensity histogram [13, 19], edges , mutual information and others [16, 20, 18, 17]. Such techniques, while being fast, generate poor detection. To boost detection, motion information is incorporated through optical flow [12, 1, 21, 22]. However, the heavy computations of optical flow [23, 24, 25] make such techniques slow. As SBD techniques are commonly used as a pre-processing step for video manipulation, optimizing both their detection accuracy and processing speed is important. This, however, remains a challenging problem.
We present DeepSBD, a fast and accurate shot boundary detection through convolutional neural networks (CNN). We exploit big data to achieve high detection performance. In addition, we exploit the parallelizable nature and common GPU implementations of CNNs to achieve fast processing speed. Our technique takes a segment of 16 frames as input, and classifies it as either gradual, sharp or no-transition. It analysis both spatial and temporal information through an effective 3D convolutional network for video processing, inspired by C3D .
To train our network, we need a well-annotated very large dataset. Despite datasets already exist from the TRECVID challenge and others [1, 27], experiments show they are not sufficient to train a high accuracy CNN solution. In addition, the vast majority of these datasets are used for testing and evaluating different techniques, and hence should not be used for training. To overcome this problem, we present a very large SBD dataset with clean and accurate annotations capable of training a highly accurate CNN SBD solution. This also allows us to test on all available TRECVID data (3.9 million frames) . The first dataset portion, SBD_Syn, is generated synthetically using image compositing models. It contains 220,339 sharp and gradual segments, each segment contains 16 frames. The second portion, SBD_BT, contains 4,427 no transition segments. They are carefully manually annotated in a way to improve detector’s precision; they act as hard-negatives. We optionally use 1 TRECVID release (2005) and another SBD dataset of Baraldi et al.  to further improve performance. These datasets have 18,027 total transitions with prior annotations. That is only 7% of all training datasets.
Aspects of novelty of our work include:
The first CNN SBD technique. We outperform dissolve gradual detection, generate competitive performance for sharp detections and produce significant improvement in wipes. In addition, we are up to 11 times faster than the state of the art.
Introduction of a new very large SBD dataset for training an accurate CNN model. Our dataset contains 3.5 million frames of synthetic transitions and 70,000 frames of hard negative no-transitions.
A large wipes dataset containing 1.1 million frames. We will release all our data-sets and code to encourage future research.
The largest SBD evaluation to date on 4.85 million frames. 3.9 million frames are from all TRECVID years  while most of the rest are synthetically generated.
The next section reviews the state of the art. Here, we discuss the main components of our solution including current SBD techniques, current available SBD datasets and CNN solutions for video spatio-temporal analysis. We then present our SBD solution with emphasize on our detection system and our dataset generation process. Section IV presents detailed results and analysis. The results are also supported by a supplementary material (in .pdf format, please examine). Section V is conclusion.
Ii State of the Art
Ii-a Shot Boundary Detection Techniques
SBD techniques [1, 12, 13, 22] extract features and analyze them temporally. Detection is then performed by finding temporal profiles that fit the examined transition model. Sharp transitions undergo a sudden change in the temporal profile over one frame. Gradual transitions exhibit a more stretched change in time. Current SBD techniques are classified into two main categories: spatial-only and spatio-temporal analysis based. The former estimates the temporal profile by comparing only spatial features [12, 13, 1, 14, 15, 16, 17, 18, 19, 20]. A number of spatial features are used such as color histograms [13, 19], edges , mutual information and Entropy , wavelet representations , SURF  and many others [20, 18, 17, 29].
Spatial-only SBD methods generate conservative detection accuracy with fast processing speed. Spatio-temporal techniques use optical flow to make detection more robust to scene and camera motions [12, 14, 30, 31, 31]. Such motions can arise due to camera movements and shakiness and often confuse the detection process. Hence, optical flow [23, 24, 25] between neighboring frames is estimated and removed through frame interpolation. Analysis of the temporal profile is then proceeded as in the spatial-only techniques. Here, motion compensation often reduces false detections of SBD. The main drawback of spatio-temporal techniques, however, is the heavy computations of optical flow.
Among the rich SBD literature, four of the best performing and/or most recent techniques are Liu et al. , Yuan et al. , Lu et al.  and Priya et al. . Lu et al. focuses more on generating fast results and hence they do not incorporate motion information. Their technique is based on assessing temporal discontinuities through HSV histogram. Priya et al.  proposed a wavelet based feature vector that measures four main quantities: color, edge, texture, and motion strength. The feature vector is extracted for each frame of a sequence and the temporal profile is estimated through frame differencing. Liu et al.  uses a large number of features including color, histogram, edge, motion and related statistical features. Liu et al., Priya et al. and Yuan et al. all focus on high detection accuracy. This, however, comes with the high cost of optical flow. Furthermore, the techniques of Apostolidis et al.  and Berladi et al.  were recently released. They analyze only spatial information such as SUFR and HSV/color histogram and hence often generate conservative performance with fast processing speed.
To the best of our knowledge, Liu et al.  is the latest wipe detector. A candidate transition segment is proposed and the difference between each frame and the start and end frame is calculated. This generates two curves, one for the start and another for the end of the segment. For wipes, the curves should have opposing gradients and somewhat linear. Furthermore, to reduce errors due to camera and object movements, motion compensated frame differencing is used.
Ii-B SBD Datasets
Between the years 2001 to 2007, the National Institute of Standards and Technology (NIST)  maintained data for the TRECVID shot boundary detection (SBD) challenge . The dataset contains a wide variety of content including color, gray-scale, indoor, outdoor, outer-space and different levels of noise. The dataset has a total of 4,333,153 frames with 24,423 transitions, of which are sharp. The rest are gradual. Transitions were manually annotated in a way to distinguish between sharp and graduals. Four more releases from a different challenge were maintained by NIST that contain data relevant to SBD. The releases are T2007t, T2007d, T2008 and T2009, containing 34,765,424 frames with 155,902 transitions. The annotations of these data, however, do not distinguish between sharp and graduals. Finally, one more data release related to SBD was generated by Baraldi et al. . Here, the authors addressed the different application of video scene segmentation.
We collected all the SBD related dataset. However, some TRECVID data appear not to exist anymore and/or they can not be tracked. Despite being a large dataset, several factors prevent them to be used for training. First, most of T2001 and T2002 should be removed due to their poor annotations. In addition, at least T2007 and the rest of T2001 should be removed as they are commonly used for evaluation [12, 13, 22]. This leaves at most 15,163 sharp and 7,274 gradual annotations from TRECVID and Baraldi et al. . Experiments show this is not sufficient to train an accurate SBD CNN.
Ii-C Spatio-temporal analysis using CNN
Our solution analyzes both spatial and temporal information through CNN. Hence, our network is related to the literature on video classification. Karpathy et al.  proposed multiple approaches for extending the connectivity of CNN to take advantage of the spatio-temporal information. Results show that CNN can generate strong improvement over hand-crafted features. However, the multiple frame models showed a modest improvement compared to the single-frame model. Next, Simonyan et al.  proposed a two stream CNN network for video classification. One network analyzes the spatial information while the second analyzes the optical flow field. Their approach generates significant improvement over the single frame model of .
Tran et al.  presented the first single stream CNN that incorporate both spatial and temporal information at once. Their approach takes multiple frames as input and examines them with 3D spatio-temporal convolutional filters. They handle the problem of activity recognition and performed evaluation on the UCF101 dataset. They outperformed all previous work, including [34, 35]. In addition, their technique is fast as does not require optical flow estimation.
Our solution is a full Shot Boundary Detection (SBD) system consisting of a CNN-based classification step, a merging step and a post-processing step. At the core of our CNN-classification is a spatio-temporal architecture inspired by Tran et al. . Unlike Tran et al. , however, our architecture uses batch normalization. Furthermore, our solution contains a component for generating very large well annotated data-sets for training our SBD. Results show that all components of our solution, including dataset generation and our full SBD system, play an important role in outperforming the state of the art, both in detection accuracy and processing speed.
Iii Our Approach
Iii-a Algorithm Design
We present a technique for automatic detection and classification of shot boundaries. We name our technique DeepSBD. A video is divided into segment of frames. Each segment is assigned one of three labels: 1) sharp transition, 2) gradual transition or 3) no transition. We use segments of length 16, with an overlap of 8. Each segment is fed to a deep 3D-CNN that analysis both spatial and temporal information. Our network, C3D_sbd, is inspired by  and is trained from scratch for shot boundary detection. The last feature layer is fed to an SVM classifier. This gives the first labeling estimate. Consecutive segments with the same labeling are merged and the result is passed to a post-processing step. The step reduces false positives with little motion. For such segments, we estimate the color histogram of the first and end frame. We measure the Bhattacharyya distance between these histograms. If the distance is small, we declare this segment as no-transition. We use an OpenCV implementation for both color histogram and Bhattacharyya distance, which is very fast.
Figure 3 shows an overview of our detection system. Our network, C3D_sbd, consists of five 3D convolutional layers (see Table I). All convolutional layers are followed by Rectified Linear Unit (ReLU) and pooling layers. The first two convolutional layers are followed by Local Response Normalization (LRN). Two fully connected layers exist, fc6 and fc7, each containing 2049 neurons. The last fully connected layer fc8 contain only 3 neurons, one for each class (sharp, gradual and no transition). In comparison to , C3D_sbd uses batch normalization after the first two convolutional layers.
Iii-B Dataset Generation
Training an SBD CNN requires a large and well-annotated dataset. We present two datasets, SBD_Syn (Table II) and SBD_BT (Table III). SBD_Syn is generated synthetically while SBD_BT is generated in a way to improve detector’s precision, through bootstrapping. Figure 4 shows the process of generating both datasets. We first use SBD_Syn with T2005 and Baraldi et al. to train from scratch our solution. We run this solution on data from T2007t/d, T2008 and T2009. Due to the massive size of these datasets, however, we only examine segments originally annotated as any form of transition. Note that original annotations here do not distinguish between sharp or gradual. We closely examine segments detected as graduals. We manually filter them into three classes: gradual, sharp and no transitions. The no-transition represent complicated hard-negative cases such as illumination variation and fast motion (see Figure 5). Finally, we train from scratch a final solution using both SBD_Syn and SBD_BT. We optionally use T2005 and Baraldi et al. to further improve performance. Results show that SBD_BT has a great impact in reducing false detections and improving the overall performance. The supplementary material shows images from the datasets of SBD_Syn and SBD_BT in Figure 1 and Figure 2.
SBD_Syn: Table II shows the content of SBD_Syn. Images from this dataset is shown in the supplementary material (Figure 1). The dataset is generated synthetically through image compositing models . A transition is modeled as a linear combination between the underlying shots
Here, denotes the observed frame at time , while and are the content from the previous and next shots respectively. is the mixing parameter between both shots while denotes image pixels. The values and distribution of define the type of shot transition. If no transition exist, then . A sharp transition, however, have a sudden temporal change with . For gradual transitions, changes over time from to . This change occurs over a set of frames and hence . is the transition duration and is the frame index where denotes the last frame of the previous shot. Here, the in-between values are non-binary. This generates the dissolve nature of most gradual transitions (Figure 1). For wipes, is spatially-varying aswell as temporally-varying.
To generate SBD_Syn we need to define , and in Eq. 1. and must not contain any shot transitions. We use the T2007t/d, T2008, T2009 and their annotations to find such frames. We sample and in a way to ensure a large offset from the nearest transition. Sharp transitions are generated by applying Eq. 1 with . Gradual transition generation, however, is more complex. For SBD_Syn we focus on dissolve gradual generation. We randomly select the transition duration , where . We also randomly select the transition start and end frames for both and . We draw samples, where is modeled with a uniform distribution. We sort all values in descending order and apply Eq. 1 for each of the frames.
We train C3D_sbd using balanced data for sharp, gradual and no-transition. We experimented with different data sizes. We found 40,000 segments for each class generate good results. We also train the SVM for sharp and gradual using 110,000 segments for each. For CNN, we use a step learning policy. Learning rate starts with a value of 0.0001 and is reduced gradually by a factor of 10 every two epochs. We use a batch size of 20, and train the model for 6 epochs. That is two epochs for each learning rate of 1e-4, 1e-5, and 1e-6. The momentum value is 0.9. All these values were set empirically to optimize performance. We also found empirically that SVM works better with features from fc8 as opposed to fc6 and fc7.
|Datasets||Synthetic Gradual||Synthetic Real|
|Transitions||Number of segments|
We performed experiments on real data as well as on synthetically generated data. We examined 4,683,552 frames, of which are real. Our work is the largest SBD evaluation to date for one algorithm. We asses performance quantitatively using precision (P), recall (R) and F-score (F). Here, we use the standard TRECVID evaluation metric  where a transition is detected if it overlaps with the annotations by at least one frame. We report the per-transition performances. During comparison we highlight the best performing technique in bold. To account for possible mis-annotations and system error in such large experiment, we claim a technique is superior only if it achieves more then P, R, or F improvement over the second best performing technique. Techniques with difference are claimed as competitive. We train two models, both using our datasets DSB_Syn and SBD_BT. One of them uses few real data from T2005 and Barladi et al. , denoted by , at most of the total training data. Both models are competitive to each other. We report results with in the paper and report the other model in the supplementary material.
We compare against the latest techniques (Lu et al. , Priya et al. , Apostolidis et al.  and Berladi et al. ) as well as the best performers in the 7 years of the TRECVID challenge (Yuan et al.  and Liu et al. ). These techniques show the compromise between detection accuracy and processing speed commonly present in SBD. Lu et al.  is the fastest of all, but generates conservative performance. Priya et al. , Liu et al.  and Yuan et al.  generate better performance. However, at the cost of heavy optical flow computation. Our results show that DeepSBD optimizes both detection accuracy and processing speed over all current techniques. That is, we outperform gradual detection, generate competitive performance for sharp transitions and produce significant improvement in wipes detection. In addition, we are up to times faster than the state of the art. More detailed results are reported in the supplementary material.
Iv-a Real Sequences
We evaluated our technique on all seven TRECVID releases, from 2001 to 2007. They have a total of 3,831,648 frames, with 8,545 gradual and 14,602 sharp transitions. No test data was included in the training. Table V shows performance evaluation on 6 sequences commonly used in Lu et al.  and Priya et al. . The sequences are from T2001a (see Table IV) and present challenging videos from outer-space. The videos include cases such as global illumination variation, smoke, fire and fast non-rigid motion. We outperform Lu et al. in all sequences for both gradual and sharp transition. Furthermore, we outperform Priya et al. in the vast majority of sequences in both transition types.
|T2001a||BOR10_001, BOR10_002, NAD57, NAD58,|
|anni001, anni005, anni006, anni007, anni00,|
|T2001b||BOR03, BOR08, BOR10, BOR12, BOR17|
|Lu et al. |
|Priya et al. |
|Size of test-data (in sequences)||9||10||11||12||13||14||15||16||17|
|Priya et al. |
Table VI compares our technique against Priya et al.  on T2007. Note that  used a slightly different approach for evaluation than the one recommended by TRECVID . TRECVID recommends estimating the average performance per transition. However,  estimated the average performance per sequence. Furthermore, Priya et al. tested on 17 sequences, 7 of which were included in their training set. This biases the results towards Priya et al. . Hence, for fair comparison these 7 sequences should be removed from the 17 test sequences and the comparison should be done on at most 10 sequences. To illustrate this point, we examined our technique with different sizes of the test dataset. Each column of Table VI shows the performance with different size of the test data. With 10 test sequences, our technique outperforms Priya et al.  significantly in gradual transitions (0.88 vs. 0.76 f-score) and generates competitive results for sharp transitions. Furthermore, we still outperform Priya et al. even with a test-set of 14 sequences. Here, however, at least 4 sequences are included in Priya et al. training and hence results are biased towards Priya et al. Including these videos in our training is expected to improve performance even further. The spatio-temporal aspect of our solution allow us to generate these high detection accuracy results without explicitly estimating optical flow. Our experimental results showed that just relying on the spatial information generates very poor performance.
Table VII evaluates DeepSBD on T2004, 2005, 2006 and 2007. To test on 2005, we removed it from our training. We compare against the best TRECVID performers as well as Lu et al. . We significantly outperform Lu et al. in T2007. Furthermore, we outperform the best TRECVID performers, Liu et al.  and Yuan et al.  on all four datasets. Table VIII evaluates DeepSBD on the remaining TRECVID datasets. T2001b and 2002 annotations contain significant overlap between sharp and gradual transitions. Hence, for them we show the overall combined transitions performance. Furthermore, T2003 is missing 4 videos and hence we could not compare against the reported TRECVID performance. In all sequences we generate good performance. T2001b and 2002 sequences contain strong noise and jitter. Yet, our technique was robust enough to handle such artifacts. Figure 6 (a) shows the precision-recall curves for our DeepSBD on all real TRECVID sequences. Table XVI shows the combined f-score for the RAI dataset . Here, we compare against the techniques of Apostolidis et al.  and Berladi et al. . Results show that we significantly outperform both techniques. The supplementary material (Table II-XVI) shows the per sequence results for each of the TRECVID and RAI dataset examined by our technique. This includes much more statistics e.g. true positives (TP), false positives (FP), false negative (FN) and so on.
|Lu et al. |
|Method in ||Method in ||Ours|
Table X examines different test configurations for DeepSBD. SVM on fc8 generates better results than on fc6. The post-processing (pp) step improves the performance, especially for T2007. The best performance is obtained with fc8+svm+pp. Figure 7 shows failure cases in gradual transition detection. Too long transitions can get misclassified as False negatives (FN). Here, no enough temporal difference is captured over our 16 frames window. FN can also be generated when both shots have similar texture and color. False positives are largely generated by computer graphics content. Such content have a gradual-like effect. However, they are not semantically classified as a shot transition.
Iv-B The importance of our datasets
Table I shows the significance and importance of our datasets SBD_Syn and SBD_BT in generating high accuracy detections. We evaluate DeepSBD on T2007 with six different training sets: 1) R_3-5 2) R_3-6 3) R_3-6 + BT, 4) S + r, 5) S + r + BT and 6) and S + BT. S and BT is short for our datasets SBD_Syn and SBD_BT. R_3-6 represent TRECVID real videos and annotations from 2003 to 2006. is T2005 and Baraldi . Results show that training with R_3-5 generate poor performance. In addition, it limits us to testing on just 3 data-sets (T2001a, T2006 and T2007). Adding T2006 to training improves performance but limits our testing further to 2 data-sets (T2001a and T2007). Adding our bootstrapping data SBD_BT (BT) improves precision and performance significantly. This shows the high quality and importance of our SBD_BT. The best performance, however, is generated when both our datasets SBD_Syn and SBD_BT with are used for training. In addition to the highest performance, this option allow us to test on all TRECVID videos, except T2005. Removing from the training generates the second best performance. This, however, allow us to test on all TRECVID videos, including T2005. The experiment shows the significance and importance of our data-sets. We performed this experiment on several test sets and we found S + r + BT and S + BT are always the top and competitive to each other (see supplementary material, Table. 1). This shows the significance of our datasets and their generation process (Section III-B).
|R_3-6 + BT||0.755||0.705||0.729||0.961||0.961||0.961|
|S + r||0.722||0.63||0.673||0.979||0.955||0.967|
|S + r + BT||0.799||0.753||0.776||0.973||0.969||0.971|
|S + BT||0.779||0.714||0.745||0.969||0.966||0.968|
Iv-C Controlled Experiments
We generated a synthetic test-set. Our dataset contain 53,324 segments, divided equally between gradual, sharp, wipes and no-transitions. Each segment is 16 frames long. We generated gradual and sharp transition using image compositing as we did for SDB_Syn (see Eq. 1). Here, we constrain the shots to come from two different UCF101 videos . We present the first large wipes dataset, containing 1.1 million wipe frames ( test). They are also generated using Eq. 1. Here, however, the opacity values have more complicated spatio-temporal patterns than sharp and gradual transitions. Figure 8 shows some of the 196 mattes we used. The supplementary material, Figure 3, show frames from our wipes dataset. We call our synthetic UCF dataset UCF101_SBD.
We train the model using SBD_Syn, SBD_BT and the synthetic wipes. This model generates 4 classes. Table XII evaluates DeepSBD on UCF101_SBD. We generate high performance for all classes, including wipes. Performance is higher than the ones previously reported on the TRECVID sequences. This could be due to the highly accurate annotations of UCF101_SBD. Figure 6 (b) compares our wipe detector against the state of the art of Liu et al. . We evaluate Liu et al. using two strategies. The first examines all frames of UCF101_SBD. The second, ‘+’, examines only frames not detected as gradual nor sharp transitions by DeepSBD. Our technique outperform both approaches significantly.
Iv-D Processing Speed
|Real-time speed-up factor|
|Liu et al. [22, 1]||3.24|
|Priya et al. ||1.76|
|Yuan et al. ||2.43|
We examined a TRECVID video of duration 4,096 seconds containing 102,400 frames. We ran the test-phase of DeepSBD with different batch sizes as input. The GPU performs n iterations until all 102,400 frames are processed. The smaller the batch size, the more iterations required and hence the more time required to process all frames. However, the less memory required. Experiments shows that the processing speed gain from 10 to 100 batch size is not significant. That is between 16-19.3 real-time speed up factor. We use Titan X, a GPU commonly used for deep learning applications. Table XIII compares the processing speed of different SBD techniques. In comparison with the best performing optical-flow based techniques, we are 11 times faster than Priya et al. , 6 times faster than Liu et al.  and 9.65 times faster than Yuan et al. . The supplementary material shows more analysis of the processing speed in Figure 4-5 (Section II).
Iv-E Deep Analysis on Network Responses
We randomly selected two segments (16 frames) from UCF101 and synthetically generated a sharp and gradual transition using Eq. 1. We treated one of the two sequences as no-transition. We examined all segments using DeepSBD. Figure 9 shows the heat map of some Conv5 filter responses for each transition type. The filters are stacked next to each other, in blocks. The green grid shows filters’ borders. Time is the y-axis and space is the x-axis. Vertical space is averaged over the horizontal space. Sharp transitions have abrupt responses in the time axis in form of bright horizontal lines. Gradual transitions have blurred responses in the time axis. No transitions do not show a specific response pattern. The patterns are consistent on several other segments. The supplementary material shows more of such results in Figure 6 (Section III).
We presented the first CNN technique for shot boundary detection. Current techniques compromise between detection accuracy and processing speed and use hand-crafted features. We exploit big data to optimize both accuracy and speed. This is important as SBD is a common pre-processing step for video manipulation. We present two large datasets containing 3.57 million frames. One set is generated synthetically while the other is carefully annotated through bootstrapping. We outperform state of the art gradual transition detections, generate competitive performance in sharp transitions and produce significant improvement in wipes detections. Our approach is up to 11 times faster than the state of the art. Future work can examine computer graphics content more closely. We will release our datasets and code to encourage future research.
-  A. F. Smeaton, P. Over, and A. R. Doherty, “Video shot boundary detection: Seven years of trecvid activity,” Computer Vision and Image Understanding (CVIU), vol. 114, no. 4, pp. 411–418, 2010.
-  J. Fan, D. K. Y. Yau, W. G. Aref, and A. Rezgui, “Adaptive motion-compensated video coding scheme towards content-based bit rate allocation,” Journal of Electronic Imaging, vol. 9, no. 4, pp. 521–533, 2000.
-  Z. Wang, D. Liu, S. Chang, Q. Ling, and Y. Yang, “D3: Deep dual-domain based fast restoration of jpeg-compressed images,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2764–2772.
-  J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in CVPR, 2016, pp. 1646–1654.
-  J. Xie, R. Girshick, and A. Farhadi, “Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks,” in IEEE Internataional Conference on Computer Vision (ICCV), 2016, pp. 842–857.
-  K. Calagari, M. Elgharib, P. Didyk, A. Kaspar, W. Matusik, and M. Hefeeda, “Gradient-based 2d-to-3d conversion for soccer videos,” in ACM Multimedia, 2015, pp. 331–340.
-  S. Bae, M. A. Elgharib, M. Hefeeda, and W. Matusik, “Efficient and scalable view generation from a single image using fully convolutional networks,” CoRR, vol. abs/1705.03737, 2017. [Online]. Available: http://arxiv.org/abs/1705.03737
-  K. Zahng, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in CVPR, 2016, pp. 766–782.
-  ——, “Summary transfer: Exemplar-based subset selection for video summarization.” in CVPR, 2016, pp. 1059–1067.
-  Y. Song, M. Redi, J. Vallmitjana, and A. Jaimes, “To click or not to click: Automatic selection of beautiful thumbnails from videos,” in ACM International Conference on Information and Knowledge Management (CIKM), 2016, pp. 659–668.
-  K. Templin, P. Didyk, K. Myszkowski, M. M. Hefeeda, H.-P. Seidel, and W. Matusik, “Modeling and optimizing eye vergence response to stereoscopic cuts,” ACM Transactions on Graphics (proceedings of SIGGRAPH), vol. 33, no. 4, 2014.
-  L. Priya and D. S., “Walsh hadamard transform kernel-based feature vector for shot boundary detection,” IEEE Transactions on Image Processing (TIP), vol. 23, no. 12, pp. 5187–5197, 2014.
-  Z.-M. Lu and Y. Shi, “Fast video shot boundary detection based on svd and pattern matching,” TIP, vol. 22, no. 12, pp. 5136–5145, 2013.
-  P. P. Mohanta, S. K. Saha, and B. Chanda, “A model-based shot boundary detection technique using frame transition parameters,” IEEE Transactions on Multimedia (TMM), vol. 14, no. 1, pp. 223–233, 2012.
-  D. Adjeroh, M. C. Lee, N. Banda, and U. Kandaswamy, “Adaptive edge-oriented shot boundary detection,” EURASIP Journal on Image and Video Processing, vol. 2009, no. 1, 2009.
-  Z. Cernekova, I. Pitas, and C. Nikou, “Information theory-based shot cut/fade detection and video summarization,” IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 16, no. 1, pp. 82–91, 2006.
-  J. Lankinen and J.-K. Kämäräinen, “Video shot boundary detection using visual bag-of-words,” in International Conference on Computer Vision Theory and Applications (VISAPP), 2013.
-  D. Lelescu and D. Schonfeld, “Statistical sequential analysis for real-time video scene change detection on compressed multimedia bitstream,” IEEE Transactions on Multimedia, vol. 5, no. 1, pp. 106–117, 2003.
-  C. Zhang and W. Wang, “A robust and efficient shot boundary detection approach based on fisher criterion,” in ACM Multimedia, 2012, pp. 701–704.
-  D. M. Thounaojam, T. Khelchandra, K. M. Singh, and S. Roy, “A genetic algorithm and fuzzy logic approach for video shot boundary detection,” Computational intelligence and neuroscience, vol. 2016, 2016.
-  J. Yuan, W. Zheng, L. Ding, D. Wang, Z. Tong, H. Wang, J. L. J. Wu, F. Lin, and B. Zhang, “Tsinghua university at trecvid 2004: Shot boundary detection and high-level feature extraction,” in TRECVID Workshop, 2004.
-  Z. Liu, E. Zavesky, D. Gibbson, B. Shahraray, and P. Haffner, “At&t research at trecvid 2007,” in TRECVID Workshop, 2007.
-  M. W. Tao, J. Bai, P. Kohli, and S. Paris, “Simpleflow: A non-iterative, sublinear optical flow algorithm,” Computer Graphics Forum (Eurographics), vol. 31, no. 2, 2012.
-  A. Dosovitskiy, P. Fischery, E. Ilg, P. HÃ¤usser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in ICCV, 2015, pp. 2758–2766.
-  S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” International Journal of Computer Vision (IJCV), vol. 92, no. 1, pp. 1–31, 2011.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015, pp. 4489–4497.
-  L. Baraldi, C. Grana, and R. Cucchiara, “A deep siamese network for scene detection in broadcast videos,” in ACM Multimedia, 2015, pp. 1199–1202.
-  E. Apostolidis and V. Mezaris, “Fast shot segmentation combining global and local visual descriptors,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 6583–6587.
-  L. Baraldi, C. Grana, and R. Cucchiara, “Shot and scene detection via hierarchical clustering for re-using broadcast video,” in International Conference on Computer Analysis of Images and Patterns (CAIP), 2015, pp. 1–11.
-  S. Lian, “Automatic video temporal segmentation based on multiple features,” Soft Computing, vol. 15, no. 3, pp. 469–482, 2011.
-  Y. Kawai, H. Sumiyoshi, and N. Yagi, “Shot boundary detection at trecvid 2007,” in TRECVID Workshop, 2007.
-  J. Yuan, H. Wang, L. Xiao, D. Wang, D. Ding, Y. Zuo, Z. Tong, X. Liu, S. Xu, W. Zheng, X. Li, Z. Si, J. Li, F. Lin, and B. Zhang, “Tsinghua university at trecvid 2005,” in TRECVID Workshop, 2005.
-  N. I. of Standards and Technology, “http://trecvid.nist.gov/trecvid.data.html,” https://www.nist.gov/, 2017.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in CVPR, 2014.
-  K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 568–576.
-  A. Levin and Y. Weiss, “User assisted separation of reflections from a single image using a sparsity prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 29, pp. 1647–1654, 2007.
-  R. T. Network, “The rai scuola video archives,” http://www.scuola.rai.it/, 2015.
-  K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” CoRR, vol. abs/1212.0402, 2012. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1212.html#abs-1212-0402
Supplementary Material: Large-scale, Fast and Accurate Shot Boundary Detection through Spatio-temporal Convolutional Neural Networks
Appendix A Our data-sets
Figure 1 shows samples from the gradual transitions class of our dataset (SBD_Syn). Our data is synthetically generated through image compositing. It is diverse, containing a wide variety of colors, texture, objects, motion and so on. Figure 2 shows hard negative samples from our bootstrapping data (SBD_BT). The samples contain challenging cases that commonly confuse gradual transition detectors e.g. fast motion, fast zoom in, illumination changes, object occlusion, strong lighting, and so on. Figure 3 shows 10 sequences from our synthetically generated wipes dataset. The sequences show some variety of the alpha mattes used to generate wipes.
Tab. I shows the significance and importance of our synthetic SBD_Syn and bootstrapping SBD_BT datasets in generating high accuracy detections. We evaluate our technique, DeepSBD, on different datasets with six different training sets: 1) R_3-5 2) R_3-6 3) R_3-6 + BT, 4) S + r, 5) S + r + BT and 6) and S + BT. S and BT is short for our datasets SBD_Syn and SBD_BT. R_3-6 represent TRECVID real videos and annotations from 2003 to 2006. is T2005 and Baraldi. Results show that training with R_3-5 generate poor performance. In addition, it limits us to testing on just 3 data-sets (T2001a, T2006 and T2007). Adding T2006 to training improves performance but limits our testing further to 2 data-sets (T2001a and T2007). Adding our bootstrapping data SBD_BT (BT) improves precision and performance significantly. This shows the high quality and importance of our SBD_BT. The best performance, however, is generated when both our datasets SBD_Syn and SBD_BT with are used for training. In addition to the highest performance, this option allow us to test on all TRECVID videos, except T2005. Removing from the training generates a competitive performance. This, however, allow us to test on all TRECVID videos, including T2005. The experiment shows the significance and importance of our data-sets. We performed this experiment on several test sets and we found S + r + BT and S + BT are always the top and competitive to each other. This shows the significance of our datasets.
Tab. II-XVI shows detailed per video results for different testing sets. For each testing dataset, we report the results using two different training-sets (S+r+BT and S+BT). We show: the number of transitions (#T), true positives (TP), false positives (FP), false negatives (FN), precision (P), recall (R) and F-measure (F).
Appendix B Processing speed
Figure 4-5 examines the processing speed (test-phase) of our technique with different batch sizes as input. We ran our model on 6,394 segments. Each segment is 16 frames long, and hence our test-set contains 102,304 frames. Figure 4 reports the total processing speed in seconds while Figure 5 reports the real-time speed up factor. Tab. XVII shows detailed analysis of this experiment. For each batch size we ran our technique twice to ensure consistency. Results show that the processing speed gain from 10 to 100 batch size is not significant. Thatâs between 16-19.3 real-time speed up factor.
Appendix C Deep Analysis on Network Responses
Figure 6 visualizes the feature response of our technique. We show the visualization of two different image sequences. For each sequence, we randomly selected two segments (16 frames) from UCF101 and synthetically generated a sharp and gradual transition using image compositing models. We treated one of the two sequences as no-transition. We examined all segments using our technique, DeepSBD. Figure 6 shows the heat map of some Conv5 filter responses for each transition type. The filters are stacked next to each other, in blocks. The green grid shows some filters’ borders. Time is the y-axis and space is the x-axis. Vertical space is averaged over the horizontal space. Sharp transitions have abrupt responses in the time axis in form of bright horizontal lines. Gradual transitions have blurred responses in the time axis. No transitions do not show a specific response pattern. The learned patterns of the three classes capture meaningful and discriminative information for the different types of shot transitions. Such information generate high detection accuracy as shown through out our results.
|Gradual and Sharp|
|Gradual and Sharp|
|Gradual and Sharp|
|Gradual and Sharp|
|Batch size||Starting Time||End Time||# Seconds||Memory||# Iterations||Faster than real time by|