Lecture video indexing using boosted margin maximizing neural networks
This paper presents a novel approach for lecture video indexing using a boosted deep convolutional neural network system. The indexing is performed by matching high quality slide images, for which text is either known or extracted, to lower resolution video frames with possible noise, perspective distortion, and occlusions. We propose a deep neural network integrated with a boosting framework composed of two sub-networks targeting feature extraction and similarity determination to perform the matching. The trained network is given as input a pair of slide image and a candidate video frame image and produces the similarity between them. A boosting framework is integrated into our proposed network during the training process. Experimental results show that the proposed approach is much more capable of handling occlusion, spatial transformations, and other types of noises when compared with known approaches.
With recent year advancements, more and more information is recorded in multimedia documents instead of traditional documents. Specifically, classroom lectures are often recorded, shared and distributed as digital videos. In general, lectures span several hours and can produce large video files. This makes efficient access of content within lectures difficult. Automated video indexing allows quick retrieval of specific sections of interest without having to watch all the videos involved. Slides given by instructors can provide good indexes for lecture videos, if matched correctly to the videos. Specifically, searching text in slides can be used to query topics of interest from a large set of lecture videos, if the slides are linked to video segments. This association can be performed by matching high quality slide images to video frames. We assume that the high quality slide images are either OCRed or that their text is available.
We use text image matching to link slide images to video frames. While the slide images and video frames have the same content, they may exhibit different perspective views, resolution, illumination, noise, and occlusions. Text image matching techniques are widely used in several areas besides matching slides to video frames for lecture video indexing . For example, when building digital libraries, numerous paper documents are scanned and archived as document images. It is important to ensure that duplicate pages are removed by text image matching techniques so that only one copy of a document exists in the library. This is done to facilitate accurate, non-redundant indexing and reducing storage cost. Postal automation  is another good example for text image matching, where matching techniques can be used for identifying the same letter scanned by different letter-sorting machines.
This paper has two novel contributions. First, we propose a boosted deep neural network system to perform slide matching for lecture video indexing. The proposed approach is much more resilient to occlusion, spatial transformations, and other types of noise compared with existing methods. Second, the proposed neural network system is trained as a classification margin maximizer by integrating boosting framework in the proposed training process. We show how such combination can efficiently train the proposed neural network system using just a small subset of samples from a huge number of samples.
During the past several decades, a variety of methods have been introduced to address the lecture video indexing problem. Yang et al. presented an automatic lecture video indexing algorithm by using video OCR technology in . Tuna et al. presented an approach for lecture video indexing in . Global frame difference metrics were used for lecture video segmentation, which were then followed by OCR to extract textual metadata from slide frames and use it for indexing. In these two methods, the performance of segmentation and OCR strongly affects the quality of indexing.
Several approaches  make use of global pixel-level-differencing metrics (e.g. HOG) for capturing slide transitions and matching slides to video frames to provide index points. A drawback of such approaches is that global differences can fail to generate accurate video segmentation results, when video frames have noise, or when slide frames change gradually. As a result, many redundant segments and indexes will be created. Jeong et al. proposed a lecture video indexing method using Scale Invariant Feature Transform (SIFT) features and the adaptive threshold . In their work SIFT features are applied to measure slides with similar content, and an adaptive threshold selection algorithm is used to detect slide transitions. SIFT feature can handle various image transformations well, but may fail, when two images are from different sources.
Text image matching is a key part of lecture video indexing in the proposed approach. Lopresti  applied Optical Character Recognition (OCR) to each text image, based on which approximate string matching techniques were used for near-duplicate document detection. However, OCR usually requires high quality images and is both language-dependent and time-consuming. Vitaladevuni et al.  employed interest points in detecting near-duplicate document images. Since there are often hundreds or thousands of interest points per image, this approach may exert a prohibitive computational overhead. Moreover, in document images with large amounts of noise, numerous incorrect corresponding points could be introduced and lead this type of algorithms to fail. About CNN based method, in , the authors present a state-of-the-art method for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs). However, this method focuses on classifying documents into different categories. Retrieval is performed by retrieving files from specified classes.
Several works related to image patch matching has been developed recently. Han et al. proposed a convolutional neural network, MatchNet , designed for unifying feature and metric learning for patch-based matching. This system achieves state-of-the-art performance on a standard dataset for patch matching. However, the original MatchNet uses small image patches as input. In our problem, downsampling slide and lecture video images to loses too much text details. Specifically, the data sampling used in MatchNet causes slide images in one lecture video to be very similar with each other. Another popular method for image patch matching is the triplet network presented by Hoffer et al . In this method, three images are used as input including two images from a same class and a image from a different class. A metric that maximizes the distance between images from different classes is learned. The triplet net has comparable performance to the proposed approach which matching document images. However, training a triplet net is very tricky where the selection of three images as input plays an important role directly affecting both the convergence speed and the final quality of the trained network. Note that these methods are not designed for matching document images with identical text content.
Boosting is a general method for improving the accuracy of any given learning algorithm. Boosting has achieved a lot of success on different areas and applications. Its usage in training neural networks including deep neural networks can be traced back to many years ago. Bengio et al.  investigated the performance of AdaBoost in neural networks and found that bootstrap resampling contained in the boosting framework was essential to obtain reduction in generalization error. In  the authors integrate boosting with CNN by adding a boosting layer to incrementally select discriminative neurons from lower layers. They show that the performance of CNN with such a boosting layer is better than traditional CNN. Instead of training multiple weak learners, a successful usage of boosting in a single learner has been introduced in . In  boosting also helps learning deep representation of underlying data, where by using the boosting technique, a new manifold with better representation of features is discovered by combining features from different hierarchies of CNN. In the proposed approach, the boosting training framework as well as weighted sampling techniques are integrated into the training process. Instead of training multiple weak learners, a single network system is trained.
3The proposed approach
In this paper, we employ a system based on a deep neural network with boosting framework for image retrieval by measuring image similarity. The proposed network system is inspired from the MatchNet . Three modifications are made for solving our problem, including more layers for larger size of input image data, an additional feature concatenation layer and a different loss function. Furthermore, a boosting framework with a balance data loader is integrated into the proposed system. Figure 1 shows the architecture of the proposed system.
3.1A two network system solution
To enable automatic feature extraction and similarity computation, the proposed system is composed of two neural networks that are connected to each other: a feature extraction network (FEN) and a similarity discrimination network (SDN). The FEN extracts features from input images and the SDN automatically computes a similarity score between a pair of image features. The two networks are trained as a whole system in an end-to-end manner. To train the system, we use training data composed of query images and target images , whereas there is at least one matching image for each in . We approach the training of the system as a supervised training problem where in each training sample , contains a pair of images randomly chosen from and , and is a binary label representing whether the pair of images is a match. During training, we feed images and to the FEN to extract their features and respectively. The features and are then concatenated to form a longer feature vector .
The order of and in the concatenation determines how the SDN is trained. Given any pair of image features, the SDN should be order-invariant. Thus, we create a symmetric feature vector and feed it to the SDN as well. An average error of these the two feature vectors is computed. We observe that retrieval results were improved when using this symmetric concatenation. We believe that this is because when the SDN is trained in an order-invariant manner, features extracted using the FEN are more robust and representative when used to distinguish a pair of images.
3.2Structure of the proposed network system
The proposed system contains two networks, a feature extraction network (FEN) and a similarity discrimination network (SDN). The FEN is an 8 layers network consisting of 7 convolution layers and 1 fully connected layer. Batch normalization  followed by a leaky rectified linear unit (LReLU) is used for all convolution layers in FEN. Such configuration has been demonstrated to be both efficient and effective in many approaches . Kernels of size are used universally. There are 16 kernels at the first convolution layer. We double the number of kernels at each new layer, while at the same time, the size of images is halved using max-pooling. A full connection layer with 1024 nodes is stacked on top of the last convolution layer. A tanh activation is used for this layer.
The SDN is a 3-layer fully connected neural network. The numbers of neurons in each layer are: 1024, 512 and 256 respectively. The tanh activation function is used for all layers except the last one where a linear output is used. This configuration will be discussed further in Section 3.3.
3.3Learning to approximately maximize the decision margin
Successfully training a well generalized neural network usually requires a large amount of training data. When it comes to the proposed two network system, given a training sample randomly picked from query image set and target image set , to cover all possible combination of samples in this problem, a training set contains samples should be generated. With such a large number of samples, the training process becomes very inefficient.
A feasible solution of this problem is to train only a randomly chosen subset from the overall training samples in each epoch. However, this solution will easily cause two problems in practice. First, by nature, our dataset is highly imbalanced, where the number of negative samples is substantially larger than the number of positive samples. This is because for each slide image, there are only a few matching frame images to form positive training samples. Training directly using imbalanced data like ours will cause a biased prediction towards negative cases. Second, the decision boundary in a binary classification problem is often defined by samples that are close to the boundary. A well-known example is support vector machine’ (SVM’), where the decision boundary is only defined by support vectors that lay on the margin. Thus, a blindly chosen training subset can result in a moving decision boundary that changes from iteration to iteration. This may cause slower convergence to a stable decision boundary.
To solve the problems mentioned above, in this paper, we propose to use a training framework similar to boosting algorithm, to maximize margins we use in addition a data loading balancer to solve the problems of imbalanced learning with a huge training dataset as described above. Details of these two approaches are described in following sections.
Maximizing the margin by boosting
Boosting is a well known technique for primarily reducing bias and also variance in supervised learning. Boosting consists of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When classifiers are added, they are typically weighted in some way that is usually related to the weak learners’ accuracy. After a weak learner is added, the data are reweighted : examples that are misclassified gain weight and examples that are classified correctly loss weight. Thus future weak learners focus more on the examples that previous weak learners misclassified.
Schapire et al. explain that the success of boosting is due to its property of increasing the margin. If the margin increases, the training instances are better separated and an error is less likely. This makes boosting’s aim similar to that of support vector machines.
The boosting framework adopted in this work is different from the original boosting framework, in that the weighted votes from weak learners are linearly combined as the final result. Thus we produce a single learner. By adopting the boosting framework, the algorithm only update the weights of samples and the final classification results is performed by a single network.
For each training round in our boosting framework, our data loading balancer provides a dataset for the current training round. The proposed network system is trained by epochs with mini-batch of data extracted from in each epoch. Given the fact that there are possible samples in total, is just a small portion that is visible to the proposed network system in each round. We don’t keep track the weights for all possible samples for the sake of saving memory. Instead, we use a hash table H to save the weights of all samples that are trained so far. Thus, it is possible that all weights that are currently available in the hash table do not add up to one, and so there is a need to normalize the weights for all samples extracted for training in the current round. For training samples that are used the first time, we initialize their weights as . Once training is done for samples in the current round, we update their weights in the hash table accordingly.
The proposed training framework using boosting is summarized by Algorithm ?.
A balanced data loader
Normally, for a relatively small dataset, it is feasible to load all samples and keep them in memory. However, when a dataset contains a huge number of samples such as the one used in this work (). Reading all samples may not be possible. In addition, there are more negative samples where a query image does not match with a target image than positive samples. To have a balanced data for training an unbiased classifier, we use a mechanism in this work, a data loader, which works as a separate process to pre-fetch samples used to train the proposed network system in each round.
The data loader in this work is responsible for generating a training set based on the current weights of samples in the hash table with a balanced class membership distribution. The inputs to the data loader are the query image set , target image set , and the hash table . Suppose that there are on average (where ) matching targets for each query image. To enable an equal number of drawings of positive and negative samples from entire dataset, the probabilities of picking positive samples and negative samples are set to and respectively. We consider samples with weights ranked in the top percentage as hard samples which are samples potentially laying close to the classification margin. After the data loading process, a dataset is provided by data loader in each training round.
We summarize details of loading data in Algorithm ?.
By maximizing the decision margin, not only do we require that samples are classified correctly, but also we desire that correctly classified samples lay far away from the decision boundary. Such design strategy is essentially as same as that of SVM where hinge loss is used to maximize the margin. Thus, we use hinge loss as the objective function when training our FEN-SDN network system. Given as a ground truth label of sample and the classifier output score, the hinge loss of the prediction is defined as:
Since assuming that class labels are or, hinge loss is not differentiable, it will fail in gradient descent optimization. Hence, we employ the squared hinge loss from L2-SVM which penalizes violated margins more strongly (quadratically instead of linearly). The squared hinge loss has the form:
where is a smoothing parameter (e.g. ).
3.4Training of the network
The proposed neural network system is trained using mini-batch. Since the size of an input image is , to avoid memory overflow, we use a relatively small number of image pairs in each mini batch and so the batch size is set to 32. The network is trained using adaptive moment estimation (Adam) with learning rate set to and momentum set to . The training is also regularised by weight decay (the penalty multiplier is set to ). We initialize the network weights using the approach proposed by He .
In this section, we provide experimental evaluation of the proposed approach. We first introduce the generation of the dataset and then describe three benchmark methods , to which the results of our proposed approach can be compared.
To the best of our knowledge, there is no publicly available dataset for the task of lecture video indexing. To generate a dataset we collected 150 videos associated with their course slides from the web. A video segmentation algorithm  was applied to break the lecture videos into segments. We manually matched each slide with its corresponding video segment to create positive ground truth cases. All other possible combinations between slides and video frames were given negative labels.
As a result, for training, slide images were collected. Each slide is associated with 5 video frames. Theoretically, there are positive image pairs and negative image pairs. The amount of positive cases is much smaller and the dataset is very imbalanced. To augment the dataset, synthetic query video frame were generated by adding random black block noise and applying a random projective transformation to warp each video frame image. We controlled the parameters of the projective transformation matrix in a reasonable ranges (e.g. scale: , rotation: , translation: , etc.) so that a realistic transformation is produced. We also used 10 different levels of noise to generate the synthetic data. The noise level indicates the ratio of blocked pixels and the total number of pixels in the image. The number of the noise blocks was also changed. Similar to the real data, each unique target slide image was paired with 5 positive synthetic query frame images. This process provided us extra positive synthetic image pairs and negative image pairs. The proposed data loader was responsible for creating a balanced dataset during training as described before. The testing dataset was created using slide images. Similar to the training data, each target slide image was associated with 5 video frames. The testing dataset includes positive image pairs and negative image pairs. All images were resized to pixels before being fed into the proposed deep neural network system for learning.
We compared three methods to the proposed system. The first method is a traditional interest point based approach. This benchmark algorithm computes SIFT interest point features on a set of images to build a database. Given a query image, its SIFT features are extracted and their nearest neighbours in the feature database are retrieved. The second and third methods are convolutional network based approaches: MatchNet and the triplet net. Since the original MatchNet can only take images as input, we added two additional convolutional layers to adapt the input images. These convolutional neural networks extract features from image patches and compute similarities between them.
Performance of the proposed method for lecture video index frame retrieval
We first evaluate the proposed deep neural network system for lecture video index frame retrieval at different noise levels. The proposed deep neural network system generates a matching score for each image pair. The matching scores are sorted and the best candidates of matched slide images are selected from the target slide image set. We compare the proposed approach with the three benchmark methods . Taking into account that slides on video frames have variable quality caused by occlusion, lighting conditions and image deformation due to different viewing angles, we evaluate the proposed neural network system using various video images with different qualities. We define image quality as a measure of size of slide area, blocked area, text area and the extent of perspective distortion. We employed the approaches proposed in  to detect text areas, slide areas and human foreground. The overall image quality is calculated by:
In this equation, is the quality value of image . is the size of image , which is a constant in our experiments. , and are the text area, slide area and blocked area respectively. Based on the image quality value, we divide our query video frames into 10 levels. Video indexing tests are conducted on query images of each level separately and the top 1, top 5 and top 10 hit rates are summarized in Table ?.
As can be observed in Table ?, the proposed approach has better performance compared with the benchmark methods, where the proposed approach achieves the highest hit rate on all experiments. It also shows that our proposed approach can handle noisy query frame image better than the benchmark methods. Note that besides the key component text area, other factors like image warping parameters, illumination, resolution, can also effect the performance and so it is possible that in some instances a set with a higher quality is more difficult to match and will produce a slightly lower hit rate. Figure 4 shows the results of video frame retrieval. The first image of each row is the query image. The subsequent five images are retrieval results sorted based on the matching score generated by our proposed neural network system. Correct target images are found in all four cases. We can also observe that when noise level increases, the accuracy decreases.
This paper presents a novel approach for lecture video indexing using boosted deep convolutional neural networks (CNNs). The indexing is performed by matching high quality slide images to lower resolution video frames with possible noise, perspective distortion, and occlusions for which text is either known or extracted.The proposed deep neural network system consists of two sub-networks targeting feature extraction and similarity determination to perform image matching for lecture video indexing. The trained network system is given as input a pair of slide image and a candidate video frame image and produces the similarity between them. A boosting framework is integrated into our proposed network during the training process. Experimental results show that the proposed approach is much more capable of handling occlusion, spatial transformations and other types of noises when compared with known approaches.
- D. Ma, B. Xie, and G. Agam, “A machine learning based lecture video segmentation and indexing algorithm,” in Document Recognition and Retrieval XXI, vol. 9021.1em plus 0.5em minus 0.4emSPIE, 2013, pp. 8–14.
- D. Ma and G. Agam, “Enhanced features for supervised lecture video segmentation and indexing,” in Imaging and Multimedia Analytics in a Web and Mobile World.1em plus 0.5em minus 0.4emSPIE, 2015.
- L. Liu, Y. Lu, and C. Y. Suen, “Novel global and local features for near-duplicate document image matching,” in Pattern Recognition (ICPR), 2014 22nd International Conference on, Aug 2014, pp. 4624–4629.
- H. Yang, M. Siebert, P. Luhne, H. Sack, and C. Meinel, “Automatic lecture video indexing using video ocr technology,” in Multimedia (ISM), 2011 IEEE International Symposium on.1em plus 0.5em minus 0.4em IEEE, 2011, pp. 111–116.
- T. Tuna, J. Subhlok, L. Barker, V. Varghese, O. Johnson, and S. Shah, “Development and evaluation of indexed captioned searchable videos for stem coursework,” in Proceedings of the 43rd ACM technical symposium on Computer Science Education.1em plus 0.5em minus 0.4emACM, 2012, pp. 129–134.
- H. J. Jeong, T.-E. Kim, and M. H. Kim, “Talkminer a lecture webcast search engine,” in Proceedings of the 10th Internationa.1em plus 0.5em minus 0.4emACM, 2012, pp. 245–250.
- S. Rajgure, K. Raghavan, V. Oria, R. Curtmola, E. Renfro-Michel, and P. Gouton, “Indexing multimedia learning materials in ultimate course search,” in 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI), June 2016, pp. 1–6.
- H. J. Jeong, T.-E. Kim, and M. H. Kim, “An accurate lecture video segmentation method by using sift and adaptive threshold,” in Proceedings of the 10th International Conference on Advances in Mobile Computing & Multimedia.1em plus 0.5em minus 0.4emACM, 2012, pp. 285–288.
- D. P. Lopresti, “Models and algorithms for duplicate document detection,” in Document Analysis and Recognition, 1999. ICDAR ’99. Proceedings of the Fifth International Conference on, Sep 1999, pp. 297–300.
- S. Vitaladevuni, F. Choi, R. Prasad, and P. Natarajan, “Detecting near-duplicate document images using interest point matching,” in Pattern Recognition (ICPR), 2012 21st International Conference on, Nov 2012, pp. 347–350.
- A. W. Harley, A. Ufkes, and K. G. Derpanis, “Evaluation of deep convolutional nets for document image classification and retrieval,” CoRR, vol. abs/1502.07058, 2015.
- X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg, “Matchnet: Unifying feature and metric learning for patch-based matching,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3279–3286.
- E. Hoffer and N. Ailon, Deep Metric Learning Using Triplet Network.1em plus 0.5em minus 0.4emCham: Springer International Publishing, 2015, pp. 84–92.
- H. Schwenk and Y. Bengio, “Boosting neural networks,” Neural Computation, vol. 12, no. 8, pp. 1869–1887, 2000.
- S. Han, Z. Meng, A.-S. KHAN, and Y. Tong, “Incremental boosting convolutional neural network for facial action unit recognition,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds.1em plus 0.5em minus 0.4emCurran Associates, Inc., 2016, pp. 109–117.
- S. Shalev-Shwartz, “Selfieboost: A boosting algorithm for deep learning,” arXiv preprint arXiv:1411.3436, 2014.
- N. Karianakis, T. J. Fuchs, and S. Soatto, “Boosting convolutional features for robust object proposals,” arXiv preprint arXiv:1503.06350, 2015.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
- =2plus 43minus 4 K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online]. Available: http://arxiv.org/abs/1409.1556 =0pt
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds.1em plus 0.5em minus 0.4emCurran Associates, Inc., 2012, pp. 1097–1105.
- G. E. Hinton, “Rectified linear units improve restricted boltzmann machines vinod nair,” 2010.
- =2plus 43minus 4 X. Zhang, D. Ma, L. Gan, S. Jiang, and G. Agam, “Cgmos: Certainty guided minority oversampling,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ser. CIKM ’16. 1em plus 0.5em minus 0.4emNew York, NY, USA: ACM, 2016, pp. 1623–1631. [Online]. Available: http://doi.acm.org/10.1145/2983323.2983789 =0pt
- R. E. Schapire, “The strength of weak learnability,” Machine learning, vol. 5, no. 2, pp. 197–227, 1990.
- K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026–1034.
- D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, Nov. 2004.
- A. S. Imran, F. A. Cheikh, and S. J. Kowalski, “Automatic annotation of lecture videos for multimedia driven pedagogical platforms,” Knowledge Management & E-Learning: An International Journal (KM&EL), vol. 8, no. 4, pp. 550–580, 2016.