Online Generative-Discriminative Model for Object Detection in Video: An Unsupervised Learning Framework

Online Generative-Discriminative Model for Object Detection in Video: An Unsupervised Learning Framework

Dapeng Luo,  Zhipeng Zeng, Longsheng Wei, Chen Luo, Jun Chen,  and Nong Sang,  Dapeng Luo, Zhipeng Zeng, are with the School of Mechanical Engineering and Electronic Information, China University of Geosciences, P. R. China, 430074.
E-mail: Longsheng Wei and Jun Chen are with the School of Automation, China University of Geosciences, P. R. China, 430074. Chen Luo is with the Huizhou School Affiliated to Beijing Normal University, Huizhou,China, 516002. Nong Sang is with the School of Automation, Huazhong University of Science and Technology, Wuhan, China, 430074.Manuscript received xxxx; revised xxxx.

One object class may show large variations due to diverse illuminations, backgrounds and camera viewpoints. Traditional single-view object detection methods often perform worse under unconstrained video environments. To address this problem, many modern multi-view detection approaches model complex 3D appearance representations to predict the optimal viewing angle for detection. Most of these approaches require an intensive training process on large database, collected in advance. In this paper, the proposed framework takes a remarkably different direction to resolve multi-view detection problem in a bottom-up fashion. First, a scene-specific objector is obtained from a fully autonomous learning process triggered by marking several bounding boxes around the object in the first video frame via a mouse. Here the human labeled training data or a generic detector are not needed. Second, this learning process is conveniently replicated many times in different surveillance scenes and results in a particular detector under various camera viewpoints. Thus, the proposed framework can be employed in multi-view object detection applications from unsupervised learning process. Obviously, the initial scene-specific detector, initialed by several bounding boxes, exhibits poor detection performance and is difficult to improve with traditional online learning algorithm. Consequently, we propose Generative-Discriminative model to partition detection response space and assign each partition an individual descriptor that progressively achieves high classification accuracy. A novel online gradual learning algorithm is proposed to train the Generative-Discriminative model automatically and focus online learning on the hard samples: the most informative samples lying around the decision boundary. The output is a hybrid classifier based scene-specific detector which achieves decent performance under different viewing angles. Experimental results on several video datasets show our approach achieves comparable performance to robust supervised methods, and outperforms the state of the art online learning methods in varying imaging conditions.

Multi-view object detection, unsupervised learning, hybrid classifiers, online learning, generative-discriminative model.

1 Introduction

With the development of intelligent surveillance systems, pedestrian and vehicle detection approaches have garnered profound interest from engineers and scholars. Many impressive works [1, 2, 3, 4][5, 6, 7] have been published in the last several years. The results of these studies have been employed in a great deal of applications related to the multimedia community, such as video surveillance [8], abnormal event detection [9] and automatic traffic monitoring [10]. However, object detection and recognition remains a considerably difficult issue in highly populated public places such as airports, train stations and urban arterial roads, which have distributed multiple surveillance cameras with different viewpoints, illuminations and backgrounds in cluttered environments. One object class may show large inter-class variations under these different imaging conditions. A substantial amount of training data need to be collected and labeled to model one object category detector based on statistical learning. Without this, the detector, which is trained in constrained video environments, will deliver poor performance under different environmental conditions. How to robustly and stably locate objects in arbitrary video environments from unsupervised learning is still an open issue.

Multi-view object detection methods can be used to locate the object in various view-points, which minimizes the influence of changing imaging conditions. However, these approaches increase the discriminability of detectors through relatively complex additional stages involving view-invariant features [11, 12, 13, 14], pose estimator [15, 16, 17, 18] or the 3D object model [19, 20, 21, 22],[23, 24], which make them computationally expensive and require an intensive training process on large datasets. These top-down approaches will result in considerable runtime complexity.

Transfer learning methods [25, 26] are an alternative strategy to learn different view-point scene detection from pre-training model. These methods can be used to reduce the efforts involved in collecting samples and retraining in response to such variations in appearance. However, negative transfer often occurs when transfering very different scenes [27], significantly influencing the performance of target scene detector and limiting the application of the transfer learning.

Let’s focus on one surveillance camera used in one specific scenario. A common strategy is to train specific object detector from human collected and labelled training sample set to the scenario since each individual has limited variant poses in one scene. However, it is impossible to train scene-specific detector for every scenario considering tedious human efforts and time costs, except if that training process is fully-autonomous.

A method must be found to learn a scene-specific object detector without human intervention, extending to other scenes, making sure that every scene has its own detector and achieves satisfactory detection performance under different imaging conditions and viewpoints. This feasible and efficient approach will resolve the multi-view object detection problem, because the learning process of each scene-specific detector is completely automatic and adaptive to the object-appearance changes aroused by the viewing angles and distances. Although the idea sounds attractive, this task is challenging because constructing an object model without prior knowledge is difficult, and there is no effective algorithm to collect and label training samples automatically for training the detector on the fly.

Some studies on online-learning object detection methods have been proposed and most of them adopt a similar framework, including an online learning detector and a validation strategy. However, these methods are not completely unsupervised learning methods, but only minimize manual effort, and are always initialed by several hundred human labeled training samples for one specific scene. Moreover, the number of manually labeled initial instances will increase proportionally considering multi-scene object detection application. In addition, these approaches employ co-training [28, 29], background subtraction [30], generative model [31, 32] and tracking-by-detection [33, 34] as the validation strategies to collect and label the online learning samples automatically, which are far from competitive with supervised methods because of the high label error in hard samples distributed around the decision hyperplane, compared to the human label in supervised training process.

Our goal is to design an unsupervised object detection framework, which can train object class detector in each particular scenario without human intervention. Instead of manually labeling several hundred initial training samples in tradition online object detection methods, our scene-specific detector is obtained by simply marking several bounding boxes around the object in the first video frame via a mouse. This approach reduces human annotation effort to an effortless mouse operation within the first frame, which ”determine” the interested object category in current surveillance video. Other than this, human labeled samples or general object detectors are not needed. There are two processes in our framework:

In learning process, first, an initial sample set generated automatically by affine warping of these marked objects in the first frame. Second, a Generative-Discriminative model, trained by the initial sample set, runs as an initial detector on subsequence frames. Obviously, the initial detector has poor detection performance due to the incomplete initial training sample set. Third, for this case, we propose online gradual learning algorithm to iterative train the Generative-Discriminative model in unsupervised manner. When the convergence condition is satisfied, the learning process will stop and result in a hybrid classifier composed of a generative model and a discriminative model.

During object detection process, the two models work together to determine the location of real object. The generative model is first used to detect object based on sliding window strategy, and the detection regions, located near the classification boundary, will be further recognized by the discriminative model. Regions with a high confidence will be considered as real objects and the rest as backgrounds.

Our method, triggered by several bounding boxes, is fully unsupervised without human effort on collecting and labelling samples, and can be easily extended to other scenes, forming scene-specific objector in various surveillance cameras with different viewing distances and angles. Thus, this is a bottom-up method to resolve the multi-view object detection problem.

Moreover, both the use of Generative-Discriminative model and online gradually learning algorithm makes our method robust in tackling the most informative samples lying at the decision boundary, and the method achieves state of the art detection performance under different viewing angles, as shown in Fig.1. By self-learning within three hours, our method achieves satisfactory performance in three different viewpoints sequences of CAVIAR [35] and PETS2009 [36] datasets. Moreover, we obtain competitive detection performance in S2 sequences of PETS2009 dataset, compared to the Aggregated Channel Features (ACF) [6], one of the most robust supervised object detection approaches, trained by manually labeling 300 positive samples and 900 negative samples in the same viewpoint. To the best of our knowledge, this is the first time to demonstrate an online learning scene-specific detector without additional help, compared to other online object detection methods.

Fig. 1: (a) Shop sequence from CAVIAR dataset; (b) Walk sequence from CAVIAR dataset; (c) S2 sequence from PETS2009 dataset (d) Experiments in multi-view video sequences; (e) Comparison with ACF[6].

The main contributions of this paper are:

1) We present a bottom-up method to resolve multi-view object detection problem by unsupervised training an online Generative-Discriminative model in various view-point surveillance videos, and automatically achieve successful detection performance.

2) We present the online gradual learning algorithm which allows the online learning detector, initialed by marking several bounding boxes around the object with poor detection performance, to successively improve classification accuracy and becoming more dedicated to challenging samples lying near the decision boundary.

3) We present Generative-Discriminative model, which employ different features and are integrated to form hybrid classifier for improving the discriminability of online learning object classifiers while being efficient at run-time.

The rest of this paper is arranged as follows: Section 2 briefly recalls related works. In Section 3, the analysis of our approach is provided. Generative model is described in section 4, discriminative model is explained in section 5, and section 6 shows the online gradual learning process. Experiments and results are presented in section 7, which is followed by the conclusion.

2 Literature Review

Object class detection is the core component in most computer vision tasks, and it has achieved prominent success for single-view object detection and supervised learning based approaches, such as [1, 2, 3, 4][5, 6, 7]. Here, we focus on multi-view object detection and online learning object detection.

2.1 Multi-view Object Methods

Conventional object detection methods locating objects by considering single view cannot be employed in multi-view object detection since objects have wide variations in their poses, colors and shapes under multi-view imaging conditions. Thus, a common idea is to model object classes by collecting distinct views forming a bank of viewpoint-dependent detectors which can be used to predict the optimal viewing angle for detection.

In early stages, most multi-view detection approaches independently apply several single-view detectors and then combine their responses via arbitrary logic. Some impressive works [37, 38, 39, 40] have been reported in the domain of face detection, dealing with multiple viewpoints (frontal, semi-frontal and profile).

Following this progress, Thomas [22] no longer rely on single-view detectors working independently, but develop a single integrated multi-view detector that accumulates evidence from different training views. Several other approaches, such as [23, 24], build complex 3D part models containing connections between neighboring parts and overlapping viewpoints which achieve remarkable results in predicting a discrete set of object poses. The discrete views are usually treated independently, however, [41, 42, 43, 44] require evaluating a large number of view based detectors, resulting in considerable runtime complexity

Recently, Pepik [45] proposed 3D deformable part models which extend the deformable part model to include viewpoint information and part-level 3D geometry information. This method establishes represented 3D object parts and synthesizes appearance models for viewpoints of arbitrary granularity on the fly, resulting in significant speed-up. Xu [46] proposed to accomplish multi-view learning with incomplete views by exploiting the connections between multiple views, enabling the incomplete views to be restored with the help of the complete views.

With respect to all the aforementioned approaches, our framework takes a markedly different direction: we establish scene-specific detector in unsupervised manner for each scenario, so as to prevent from integrating complex models and labeling substantial training samples in different viewpoints. To resolve multi-view object detection problem in bottom-up fashion is an improvement over other systems.

2.2 Online Learning Object Detection Framework

This paper proposes to address multi-view object detection problem by bottom-up online object detection frameworks. In essence, online learning frameworks have been published to adapt the detector to object deformation. However, it is difficult to use these approaches in multi-view object detection domain due to two major reasons:

First, traditional online learning object detection methods are inseparable from manual efforts. For instance, most of the online learning detectors are initialed by human labelled training samples. The detection performances are usually degraded when the number of manual labelled instances are minimized. Some works [32, 47] are proposed to specify a well-trained generic object detector to a specific scene. However, these works are particularly valuable in constrained applicant conditions. Recently, semi-supervised learning [48, 49], transfer learning [25, 26], and weak-supervised learning [50, 51, 52, 53], are employed to reduce the amount of labeled training data for object detector. How to minimize human effort in online object detection system is still a hot research topic.

[54, 55] proposed to initial online learning classifier by giving a bounding box in the first video frame and online training the classifier to track a single object, which motivated us to train an object category detector from several bounding boxes. However, in our case, the detector should detect multiple objects in video frames and improve its performance by self-training without any human labeled samples, which is more difficult compared with the online learning tracker approaches involved to detect a specific single object in video sequences.

Second, the new samples, which are used to train the detector online, need to be automatically collected and labeled. How to correctly label the new training samples is still a challenging topic. To date, various automatic annotation methods have been reported and can be broadly divided into four categories: 1) co-training based, 2) background subtraction based, 3) generative model based, and 4) tracking based.

In co-training based approaches [28, 29], two classifiers are trained simultaneously and labeled for each other. In background subtracted approaches [30], the foreground detector, based on background model, is employed as an automatic labeler. Generative model based methods [31, 32] use reconstructed model error to validate the detection responses and train the detector by a feedback process. Tracking based approaches [33, 34] collect and label online training samples by using tracking-by-detection methods which can interpolate missed object instances and false alarms as positive and negative samples, respectively.

However, the aforementioned methods have no special strategy to deal with the problematic samples located around decision boundary, the most informative and ambiguous part of feature space describing the objects, especially for cluttered environment. Our methods employ online gradual learning and generative-discriminative model to hierarchically process the detection responses which make online learning focus on the hard samples and reduce the online labeling error.

Other than the above mentioned approaches, there exist methods [56, 57] to detect unknown object class from motion segmentation. Although these methods learn foreground model in arbitrary scenarios without any a priori assumption, they are very different from our methods, which address one object category detection problem: these methods can not recognize the detected responses because they use the cluster based global optimization procedure.

Our method initializes a scene-specific detector with several bounding boxes in the first frame, and eventually realizes a state of the art multiple object detection system without human label effort. Online gradually learning strategy, which is proposed in this paper, is the key point which ensures our framework can be improved from a poor detection performance initial detector (see detail in section 6). Otherwise, the output of our framework is a Generative-Discriminative model which is very different from the other online learning object detection framework. In this way, our framework opens up the possibility for several different classifiers to work together to determine the location of one object class in video from online learning.

3 Analysis of Our Method

Object detection can be viewed as a two-class classification problem [58]. However, in most applications, the sample space can be divided into three groups: positive, hard and negative samples, by two decision boundaries. Both hard positive and hard negative examples have a significant effect on enhancing the classification performance.

In this section, a Generative-Discriminative model has been proposed to describe the three sample spaces from one surveillance domain. A novel cost function will be employed to improve the model performance and speed up the convergence rate. After that, an example of the Generative-Discriminative model will be introduced to learn scene-specific objector in unsupervised manner.

3.1 Generative-Discriminative Model

A generative classifier , with the inputs and the label , concentrates on capturing the generation process of by modelling, which is robust to partial occlusion, viewpoint changes and significant intra-class variation of object appearance. Thus, the model is suitable to describe the positive sample space and negative sample space .

A discriminative model , on the other hand, learns the difference between different categories. Thus, can be employed to model the hard sample space , so as to find the optimal classification of samples located between the positive and negative decision boundary: , .

In addition, the distance between the positive and negative decision boundary has a huge impact on the whole model. The smaller the distance, the more accurate the generative model is to describe the the positive and negative samples. In addition, a smaller distance also means fewer hard samples, which provides a convergence condition in online training a Generative-Discriminative model (see detail in section 6).

Thus, the cost function is as follows:


where and are the number of different sample sets, is the weight of the distance item .

Many generative and discriminative models can be employed as the and , such as Autoencoder networks, Generative Adversarial network, conditional random fields, hidden Markov model and Bayesian networks. In our case, we propose Online Selector Fern(OSF) generative model and unsupervised iterative SVM(ISVM) discriminative model, considering real-time requiring in video object detection applications. This results in a hybrid model which is both highly effective and computationally efficient (running at over 60 fps for images) and is particularly suitable for intelligent surveillance systems.

3.2 Unsupervised Training a Generative-Discriminative Model

In online training the Generative-Discriminative model, initial training sample set is prepared by affine warping of the several selected patches in the first surveillance video frame. Thus, the human labeled training data or a generic detector are not needed. An OSF generative model is trained as initial detector. Obviously, the detector has poor detection performance due to the incomplete training sample set. Thus, we propose the online gradual learning algorithm to train a Generative-Discriminative model in fully-autonomous fashion, as shown in Fig.2:

Fig. 2: Approach overview

First, we define positive and negative decision boundaries in OSF classifier. The detected responses are then collected as positive, negative and hard samples, respectively. To ensure the high correct label rate, the OSF classifier has large initial margin setting between the two boundaries.

Second, an unsupervised ISVM discriminative model is proposed for online learning and labeling the hard samples.

Third, the labelled hard samples are used to online train the OSF classifier and gradually minimize the margin between positive and negative boundaries, improving the ability to express the positive and negative sample spaces.

This process is repeated till convergence. The output is a Generative-Discriminative model, composed of an OSF classifier and an unsupervised ISVM model.

In detection process, the scanning window strategy is employed. Most image patches are classified by OSF classifier and only a small fraction located between the dual boundaries, collected as hard samples, are classified by unsupervised ISVM model, which makes our approach robust while being efficient at run-time.

Next section, we will first introduce the Boosting fern classifier and its online learning algorithm realized by online selection operator. The unsupervised iterative SVM will be presented in section 5. Following, the online gradual learning algorithm is shown in section 6.

4 Online Selector Fern(OSF) Generative Model

Traditional fern classifier is widely used in object tracking field [59], due to efficiency and high performance in tracking affine transformation planar objects. [60, 61, 62] extended fern classifier methods to detect objects appearing in the image under different orientations and view point. In this paper, we propose online selector fern algorithm integrated fern classifier and online feature selection strategy. Fern classifiers play the role of weak classifier and can be boosted into a strong model by online selection operators

4.1 Boosting Fern

Let denotes training samples, which is an image patch set and their labels. Where is the number of samples, is the sample labels, is an N-dimension feature vector which is used to describe the samples:


We employ Local Binary Feature (LBF) [59] to map an image sample to a Boolean feature space by comparison between two random position intensities of a sample. Fern is denoted as a feature sub-space random sampled from the -dimension feature space. The th fern can be denoted as:


Where is the number of features in a fern. For a training sample, this gives an -digit binary code to describe the sample appearances. In other words, each Fern maps 2D image samples to a 2-dimensional feature space, as shown in Fig.3 (a). Accordingly, apply the fern to each labelled training sample and learning the posterior distribution as histogram in each class and , as shown in Fig. 3(b).

Fig. 3: (a) LBF features;(b) posterior distribution of ;(c) online selector of Boosting fern.

A single fern does not give the accurate estimation of the generative model in a specific surveillance scene, but we can build an ensemble of ferns by randomly choosing different subsets of LBF features. A weak classifier is defined by the co-occurrence of a random Fern ’s observation.


which are further linearly combined to an ensemble classifier. Where is a smoothing factor. Thus, the classifier is estimated approximately:


is the threshold of each weak fern classifier, which is usually set to 0.5. Next section will introduce the weak classifier’s selection strategy and online selector fern algorithm.

4.2 Online Selector Fern(OSF)

From section 4.1, the fern based weak classifier is more discriminant than single feature based weak classifier because every fern is a group of features in fixed size of . Moreover, the online learning process of each weak classifier is simplified by updating the post probability of every fern. However, in online learning process, we must find a method to select the most discriminant fern classifier and minimize the first item in equation(1), the cost function of a Generative-Discriminative model. We denote a fern based weak classifier set , and a selector:


Where is chosen according to an optimization criterion. In this paper, we use the criterion of picking the one that minimizes the following Bhattacharyya distance between the distributions of the class and background :


As shown in Fig3(c), a fixed set of selectors and is initialized randomly, each has a fixed set of fern based weak classifiers. When the weak classifiers of each selector receive a new training sample, the classifiers are updated by changing the post probability distribution of each fern according to the location in a dimension feature space partitioned by the fern. The weak classifier with the smallest Bhattacharyya distance is selected. This procedure is repeated for all selectors. A strong classifier is obtained by linear combination of selectors.


We conduct an extensive set of experiments to evaluate the performance of the online selection fern for vehicle detection from learning human labelled online samples, as shown in Fig.5. However, for a self-learning object detection system, the online training samples need to be autonomously collected and labelled. In this paper, we employ a unsupervised discriminative model to label the online training samples and construct the online gradually learning algorithm without human annotated training data.

5 Unsupervised iterative SVM

A standard SVM classifier for two-class problem can be defined as:


Where is a training sample (feature vector), -1,1 is the label of . is a regularization constant.

The traditional SVM belongs to the supervised discriminative model. However, iterative semi-supervised SVM [63] can be trained by labelled and unlabeled samples simultaneously. This learning model can be embed in our framework, and evolves into an unsupervised learning algorithm.

As shown in Fig.2, the initial training samples, generated by affine warping of the several bounding boxes in the first frame, can be used as the initial labelled sample set , and the hard samples takes the place of unlabeled sample set which is automatically collected from the detection responses of online selector fern based detector. Thus, the manual labelled samples are no longer needed in online learning process. The semi-supervised SVM can be trained in an unsupervised manner, as follows:

First, the initial SVM classifier, denoted , is trained by the same initial sample set with online selector fern, which ensures the model is correctly initialed. The HOG feature [2] is employed to train , which is different from the LBF feature used in fern classifier training process. Thus, two different types of features can be integrated in our online learning framework, which is crucial to improve the detection performance in cluttered environments [3].

Second, we perform the on the unlabeled sample set . The predicted labels are denoted as . Moreover, the does not only provide the classification results. The distance from the separating hyperplane can also be seen as a measurement of classification confidence.

Third, denote and as the positive and negative thresholds according to SVM classification confidence. The labeled positive sample set can be updated by adding some hard samples with high classification confidence (more than ), while the labeled negative sample set can be updated by adding some hard samples with low classification score (less than ). This is a more conservative manner to update the positive and negative sample set when compared to the convenient iterative SVM algorithm. Using the updated training set Lt, we train a new model, and perform classification again on . The predicted labels are denoted as .

If all the hard example labels are unchanged, the algorithm stops after the th iteration, and the SVM model and the predicted labels of the hard sample set are the final output. If changed, perform the second and third step for the th iteration.

The iterative learning process is triggered by an online sample collected module without human label, which results in an unsupervised iterative SVM, as shown in Fig.2. When the online selector Ferns are retrained by the convergent iterative SVM and labeled hard sample set according to a gradual online learning strategy, we obtain a Generative-Discriminative model, consisting of a and a , more dedicated to the problematic samples. As a consequence, the updated classifier puts more emphasis on the most distinctive parts of the object that reduces the global classification error.

6 Online Gradual Learning algorithm

With online gradual learning algorithm, a poor performance detector is permitted in the beginning of our online learning process, and will improve from iterative learning the hard samples located close to the decision boundary, which is the key idea behind our proposed framework.

Let denote the initial OSF classifier based detector, which is firstly applied to the target video by sliding window search. All detected visual examples are collected and divided into positive sample set , hard sample set and negative sample set based on the confidence level calculated by the OSF classifier.


is the decision hyperplane. and are the positive and negative boundaries around and have large margins in initial stage. Thus, the online selector fern classifier becomes an dual-boundary classifier, and most detection visual examples located between the two boundaries are collected as hard samples with uncertain labels.

To minimize the third item of the cost function in a Generative-Discriminative model and obtain a robust video-specific detector, a learning process to reduce the margin gradually is necessary, called online gradually learning.

From equation (12), the parameter , determining the margin between the dual boundaries, can be minimized by the followed equation:


where is a sensitivity parameter which control the learning speed of dual-boundary OSF classifier (set to 0.85 in our experiments). measures the performance of the OSF classifier which make the margin reduce process adaptive to the classifier learning process. can be computed by:


The overall online gradual learning process is shown in Table 1.

Input: The initial training sample set L0={(x1,y1),…,(xm,ym)} generated from the affine warping of several bounding boxes in the first video frame. Denote empty hard sample set .Initialled online dual-boundary OSF classifier , and some initialled parameters including , and .
Output: composed by a and a .
Training initial SVM model from the initial training sample set .
while ()
 –using to detect object in video
 –collect , and calculated the number of collected
  hard samples
 –if ()
   = ISVM(, , )
  ; learning unsupervised iterative SVM
  ; label the hard samples set by
  = OSF(,,)
  ; updating dual-boundary OSF classifier
  classify hard sample by
  calculate by equation (14)
  calculate by equation (13)
  ; clear the
  t = t+1
 –end if

The output of online gradual learning is a Generative-Discriminative model integrated by a ISVM model and a OSF classifier with dual decision boundaries. When the Generative-Discriminative model is used to detect individual class object in a video, most candidate windows, generated by sliding window strategy, are classified by the OSF classifier. Only a small fraction, located between the positive and negative boundary, are dedicatedly classified by the ISVM model. This multiple classifier system exploits the strengths of the individual classifier models by first performing a sample space partitioning and, second, assigning to each partition region an individual classifier that achieves high classification accuracy while being efficient at run-time. Moreover, the online gradual learning process is fully-autonomous and can be extended to other surveillance scenes or object class detection tasks. Resolving the multi-view object detection problem is very important, and can be settled by combining multiple self-learning view independent detectors.

7 Experiments And Comparisons

The proposed method was evaluated on multi-view vehicle and pedestrian detection problems, which plays a key role in current intelligent transportation systems. For vehicle detection task, GRAM-RTM dataset [64] and Vehicle dataset are used to evaluate our approach. The Vehicle dataset has been captured and labeled by ourselves. The two datasets, composed of 6 video sequences with different view points and resolution levels, show a real urban road scene with multiple vehicles at the same time. As shown in Fig.4, three sequences: Hx, Yk and Hi are used from these datasets, which have 6415, 1663 and 7520 frames, respectively, of different resolutions: , and . Hx has 912 GT instances of the vehicle, whereas Yk and Hi have 344 and 2089 GT instances of the vehicle, respectively. To evaluate the multi-view pedestrian detection performance, four sequences: WalkByShop1front.avi (Shop), OneShopLeave2Enter (Enter), Meet-Crowd.avi (Walk), and S2.L1View-001.avi (S2), are used from the well-known public CAVIAR [35] and PETS2009 [36] datasets, which have different appearances due to different imaging view points, as shown in Fig.4. The Ground-Truth is available at [35, 36].

Fig. 4: Multi-view video sequences.

In each experiment, we trigger the video-specific object learning algorithm by several bounding boxes in the first frame. Thus, this method can be conveniently extended into each video sequence. In terms of the necessary parameters that define our Generative-Discriminative model, we would like to point out in all the experiments, described in the following section, the dual-boundary OSF classifier have 10 selectors with initial margin parameter setting to 1. Each selector uses 10 Random fern classifiers with 6 binary local features. When online training unsupervised ISVM classifier, HOG feature is employed to describe the samples which are divided into cells of size pixels and each group of cells is integrated into a block. In other parameters, in online gradually learning process, , controlling the updating speed of online dual-boundary OSF classifier, is set to 0.85.

In traditional object detection methods, the detection performance always appears at different levels depending on the scale of detector, which makes it difficult to set appropriate detection scales in different surveillance videos. But in our framework, we can conveniently ensure the optimal detection scales for every testing video according to the bounding box in the first frame, which describes the accurate object size in surveillance videos. From this, we increase 11 different scales and achieve robust detection performance in each testing video.

In experiments, our approach has demonstrated state of the art detection performance in each view-point testing videos after no more than 5 hours self-training process, learning about 400-1500 samples without any human efforts, as shown in Tab.2.

7.1 Online Generative-Discriminative model Structure and Parameter Test

Fig. 5: Experiments on analyzing our framework structure.

We initially analyze the object detection performance of the proposed framework on sequence Hx from the Vehicle dataset, which will reveal the influence of different hybrid classifier strategies and human annotation. As shown in Fig.5, these classifiers were initialed by the same sample set generated by affine warping of several bounding boxes in the first frame of the sequence Hx, but have different online learning processes and classifier systems. Here, we evaluate five different alternatives:

Human fern: the vehicles are detected only by OSF classifier, which is online trained by 850 human labeled frames, including 246 positive samples and 500 negative samples collected from the sequence Hx.

Human fern SVM: the detector is a Generative-Discriminative model consisting of a dual-boundary OSF classifier and a ISVM model, which is supervised learning by 850 frames, including manually annotated 246 positive samples and 500 negative samples.

Our approach: The detector is a Generative-Discriminative model composed of a dual-boundary OSF classifier and ISVM classifier. The OSF classifier is used first to detect object by sliding window searching strategy and then ISVM focuses on recognizing the hard samples distributed around the decision boundary. Note the Generative-Discriminative model is obtained by self-learning 1460 frames, about 336 positive samples and 687 negative samples, all collected and labeled automatically; some collected samples are shown in Fig.6.

Fig. 6: Online collected training data.

Fern 300: the detector is the only OSF classifier which is supervised trained by 300 frames under human guidance, learning about 160 positive samples and 160 negative samples.

Fern SVM 300: the detector is a Generative-Discriminative model which is obtained by human annotated 300 frames, about 160 positive samples and 160 negative samples.

The ROC curves are shown in Fig.5. The Generative-Discriminative model based detector significantly outperforms single classifier. The results clearly demonstrate the ability of our Generative-Discriminative model to handle the most distinctive parts of the object.

Our method has comparable performance to the supervised Generative-Discriminative model (Human fern SVM), demonstrating our framework, with a high label correct rate, can improve detection performance by focusing learning on the problematic samples located near decision boundary.

7.2 Multi-view Object Detection in Video Sequences

Further tests check if the system can self-adjust to view point changes. Our system, initialed from several bounding boxes, was applied to video sequences Yk and Hi from Vehicle and GRAM-RTM datasets. For the Yk sequence, the detector self-learns 156 positive samples and 323 negative samples. For the Hi sequence, 244 positive samples and 598 negative samples were automatically collected and labeled to online train the video-specific detector. As shown in Fig.7, the new detector achieves state of the art performance without any human intervention. Following this action, our framework is used for multi-view pedestrian detection in CAVIAR and PETS2009 dataset. The number of online learning samples, self-learning duration and detection speed, different in each video, as shown in Tab.2. Thus, the trained scene-specific detector can detect object in real time on a standard PC (Intel Core i5 3.2 GHz with 4 GB RAM). The detection performance is shown in Fig.7. It is valuable to note all self-learning processes are fully unsupervised without any prior knowledge or constraint. Our method can be easily employed in other surveillance scenes and form a bottom-up multi-view object detection method.

Fig. 7: Multi-view vehicle and pedestrian detection experiments.
Sequence Positive Negative Duration Detection
samples samples times(s) speed(FPS)
Yk 156 323 815 36
Hi 244 598 2751 15
Hx 336 687 25196 10
Shop 281 479 1121 19
Walk 410 311 2728 34
S2 504 994 9787 62

7.3 Comparison with Online and offline Learning Methods

In this section, the proposed method is compared with three online learning object detection methods [34, 29, 31], which have achieved satisfactory results among the online learning object detection frameworks. More specifically, the proposed approach is compared with four supervised methods: Boosted fern [61], ISVM [63], ACF [6] and FernSVM, a supervised Generative-Discriminative model. 300 positive samples and 900 negative samples are collected and labelled manually from S2 sequence to train the offline learning classifiers. In Hx sequence of Vehicle Dataset, the number of supervised positive and negative training data become 200 and 500, respectively.

The ROC curves are shown in Fig.8 and Fig.9, and the according F-measures are calculated in Tab.3 and Tab.4. The ACF method outperforms the proposed method on Hx and S2 sequence. Our approach outperforms the other online object detection methods and achieves detection performance competitive with the supervised methods. We demonstrate the primary reason for these results is our special strategy to address the hard samples, required to detect objects in cluttered environments. Detection results are shown in Fig.10.

Fig. 8: Comparison with supervised learning method
Fig. 9: Comparison with online learning method
Sequence Roth[31] Qi[29] Sharma[34] Ours
Shop 0.5955 0.6769 0.7446 0.8225
Enter 0.5634 0.6454 0.6992 0.8061
Method S2 Hx
F-Measure FPS F-Measure FPS
Ours 0.9036 62 0.8779 10
ACF[6] 0.9271 14 0.9518 5
ISVM[62] 0.8438 2 0.8624 4
Boosted fern[60] 0.8496 92 0.7320 11
FernSVM 0.8391 62 0.8771 10
Fig. 10: Some detection results using our approach in Vehicle, GRAM-RTM, CAVIAR and PETS2009 Datasets. Note that the Hx,Yk and Hi sequences have high resolutions. Thus, a ROI region of purple box was settled for improving the detection speed. The detection results were shown in the last three rows.

8 Conclusions and Discussions

This paper presents an unsupervised online video object detection framework. In this framework, a Generative-Discriminative model, consisting of online dual-boundary OSF classifier and unsupervised ISVM, can be trained by online gradual learning algorithm without any human labelled samples. This framework can be easily employed in multiple surveillance scenarios and will result a scene-specific detector more dedicated by hierarchical process to the problematic samples. Consequently, the online Generative-Discriminative model puts more emphasis on the most distinctive parts of the object, which leads to the reduction of the global classification error. This process simulates the autonomous learning of humans. Experimental results show our approach achieves high accuracy on multi-view vehicle and pedestrian detection tasks.

Future investigations will integrate our self-learning detector with an online learning tracker to form a video-specific multiple object detection and tracking system which, in an unsupervised manner, will simultaneously increase the performance of the detection and tracking


This work was supported by the National Natural Science Foundation of China (61302137, 61603357, 61271328 and 61603354), Wuhan “Huanghe Elite Project”, Fundamental Research Funds for National University, China University of Geosciences (Wuhan) (1610491B06), Experimental technology research project, China University of Geosciences (Wuhan) (SJ-201517).


  • [1] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1.   IEEE, 2001, pp. I–I.
  • [2] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.   IEEE, 2005, pp. 886–893.
  • [3] X. Wang, T. X. Han, and S. Yan, “An hog-lbp human detector with partial occlusion handling,” in Computer Vision, 2009 IEEE 12th International Conference on.   IEEE, 2009, pp. 32–39.
  • [4] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
  • [5] B. Li, B. Tian, Y. Li, and D. Wen, “Component-based license plate detection using conditional random field model,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 4, pp. 1690–1699, 2013.
  • [6] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1532–1545, 2014.
  • [7] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
  • [8] J. Xiao, R. Hu, L. Liao, Y. Chen, Z. Wang, and Z. Xiong, “Knowledge-based coding of objects for multisource surveillance video data,” IEEE Transactions on Multimedia, vol. 18, no. 9, pp. 1691–1706, 2016.
  • [9] V. Ramanathan, J. Huang, S. Abu-El-Haija, A. Gorban, K. Murphy, and L. Fei-Fei, “Detecting events and key actors in multi-person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3043–3053.
  • [10] B.-H. Chen and S.-C. Huang, “An advanced moving object detection algorithm for automatic traffic monitoring in real-world limited bandwidth networks,” IEEE Transactions on Multimedia, vol. 16, no. 3, pp. 837–847, 2014.
  • [11] A. Torralba, K. P. Murphy, and W. T. Freeman, “Sharing visual features for multiclass and multiview object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 5, 2007.
  • [12] B. Leibe, A. Leonardis, and B. Schiele, “Robust object detection with interleaved categorization and segmentation,” International journal of computer vision, vol. 77, no. 1-3, pp. 259–289, 2008.
  • [13] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
  • [14] B. Ko, J.-H. Jung, and J.-Y. Nam, “View-independent object detection using shared local features,” Journal of Visual Languages & Computing, vol. 28, pp. 56–70, 2015.
  • [15] M. Viola, M. J. Jones, and P. Viola, “Fast multi-view face detection,” in Proc. of Computer Vision and Pattern Recognition.   Citeseer, 2003.
  • [16] B. Wu and R. Nevatia, “Cluster boosted tree classifier for multi-view, multi-pose object detection,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on.   IEEE, 2007, pp. 1–8.
  • [17] B. Wu, H. Ai, C. Huang, and S. Lao, “Fast rotation invariant multi-view face detection based on real adaboost,” in Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE International Conference on.   IEEE, 2004, pp. 79–84.
  • [18] X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 2879–2886.
  • [19] D. Hoiem, C. Rother, and J. Winn, “3d layoutcrf for multi-view object class recognition and segmentation,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on.   IEEE, 2007, pp. 1–8.
  • [20] N. Razavi, J. Gall, and L. Van Gool, “Backprojection revisited: Scalable multi-view object detection and similarity metrics for detections,” Computer Vision–ECCV 2010, pp. 620–633, 2010.
  • [21] E. Seemann, B. Leibe, and B. Schiele, “Multi-aspect detection of articulated objects,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2.   IEEE, 2006, pp. 1582–1588.
  • [22] A. Thomas, V. Ferrar, B. Leibe, T. Tuytelaars, B. Schiel, and L. Van Gool, “Towards multi-view object class detection,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2.   IEEE, 2006, pp. 1589–1596.
  • [23] D.-Q. Zhang and S.-F. Chang, “A generative-discriminative hybrid method for multi-view object detection,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2.   IEEE, 2006, pp. 2017–2024.
  • [24] S. Savarese and L. Fei-Fei, “3d generic object categorization, localization and pose estimation,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on.   IEEE, 2007, pp. 1–8.
  • [25] M. Wang, W. Li, and X. Wang, “Transferring a generic pedestrian detector towards specific scenes,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 3274–3281.
  • [26] J. Pang, Q. Huang, S. Yan, S. Jiang, and L. Qin, “Transferring boosted detectors towards viewpoint and scene adaptiveness,” IEEE transactions on image processing, vol. 20, no. 5, pp. 1388–1400, 2011.
  • [27] M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich, “To transfer or not to transfer,” in NIPS 2005 Workshop on Transfer Learning, vol. 898, 2005.
  • [28] O. Javed, S. Ali, and M. Shah, “Online detection and classification of moving objects using progressively improving detectors,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.   IEEE, 2005, pp. 696–701.
  • [29] Z. Qi, Y. Xu, L. Wang, and Y. Song, “Online multiple instance boosting for object detection,” Neurocomputing, vol. 74, no. 10, pp. 1769–1775, 2011.
  • [30] V. Nair and J. J. Clark, “An unsupervised, online learning framework for moving object detection,” in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol. 2.   IEEE, 2004, pp. II–II.
  • [31] P. M. Roth, H. Grabner, H. Bischof, D. Skocaj, and A. Leonardist, “On-line conservative learning for person detection,” in Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on.   IEEE, 2005, pp. 223–230.
  • [32] X. Wang, G. Hua, and T. X. Han, “Detection by detections: Non-parametric detector adaptation for a video,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 350–357.
  • [33] P. Sharma, C. Huang, and R. Nevatia, “Unsupervised incremental learning for improved object detection in a video,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 3298–3305.
  • [34] P. Sharma and R. Nevatia, “Efficient detector adaptation for object detection in a video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3254–3261.
  • [35]
  • [36] J. Ferryman and A. Shahrokni, “Pets2009: Dataset and challenge,” in Performance Evaluation of Tracking and Surveillance (PETS-Winter), 2009 Twelfth IEEE International Workshop on.   IEEE, 2009, pp. 1–6.
  • [37] Z.-G. Fan and B.-L. Lu, “Fast recognition of multi-view faces with feature selection,” in Computer vision, 2005. ICCV 2005. Tenth IEEE international conference on, vol. 1.   IEEE, 2005, pp. 76–81.
  • [38] J. Ng and S. Gong, “Multi-view face detection and pose estimation using a composite support vector machine across the view sphere,” in Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, 1999. Proceedings. International Workshop on.   IEEE, 1999, pp. 14–21.
  • [39] M. Weber, W. Einhauser, M. Welling, and P. Perona, “Viewpoint-invariant learning and detection of human heads,” in Automatic Face and Gesture Recognition, 2000. Proceedings. Fourth IEEE International Conference on.   IEEE, 2000, pp. 20–27.
  • [40] S. Z. Li and Z. Zhang, “Floatboost learning and statistical face detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 9, pp. 1112–1123, 2004.
  • [41] M. Stark, M. Goesele, and B. Schiele, “Back to the future: Learning shape models from 3d cad data.” in BMVC, vol. 2, no. 4.   Citeseer, 2010, p. 5.
  • [42] R. J. López-Sastre, T. Tuytelaars, and S. Savarese, “Deformable part models revisited: A performance evaluation for object category pose estimation,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on.   IEEE, 2011, pp. 1052–1059.
  • [43] C. Gu and X. Ren, “Discriminative mixture-of-templates for viewpoint classification,” Computer Vision–ECCV 2010, pp. 408–421, 2010.
  • [44] C. M. Christoudias, R. Urtasun, and T. Darrell, “Unsupervised feature selection via distributed coding for multi-view object recognition,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.   IEEE, 2008, pp. 1–8.
  • [45] B. Pepik, M. Stark, P. Gehler, and B. Schiele, “Multi-view and 3d deformable part models,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 11, pp. 2232–2245, 2015.
  • [46] C. Xu, D. Tao, and C. Xu, “Multi-view learning with incomplete views,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5812–5825, 2015.
  • [47] G. Shu, A. Dehghan, and M. Shah, “Improving an object detector and extracting regions using superpixels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3721–3727.
  • [48] A. Levin, P. A. Viola, and Y. Freund, “Unsupervised improvement of visual detectors using co-training.” in ICCV, 2003, pp. 626–633.
  • [49] Y. Yang, G. Shu, and M. Shah, “Semi-supervised learning of feature hierarchies for object detection in a video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1650–1657.
  • [50] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, “Learning object class detectors from weakly annotated video,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 3282–3289.
  • [51] R. G. Cinbis, J. Verbeek, and C. Schmid, “Weakly supervised object localization with multi-fold multiple instance learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 1, pp. 189–203, 2017.
  • [52] Y. Liu, Y. Wang, A. Sowmya, and F. Chen, “Soft hough forest-erts: Generalized hough transform based object detection from soft-labelled training data,” Pattern Recognition, vol. 60, pp. 145–156, 2016.
  • [53] K. Kumar Singh, F. Xiao, and Y. Jae Lee, “Track and transfer: Watching videos to simulate strong human supervision for weakly-supervised object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3548–3556.
  • [54] Z. Kalal, J. Matas, and K. Mikolajczyk, “Pn learning: Bootstrapping binary classifiers by structural constraints,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.   IEEE, 2010, pp. 49–56.
  • [55] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 7, pp. 1409–1422, 2012.
  • [56] H. Celik, A. Hanjalic, and E. A. Hendriks, “Unsupervised and simultaneous training of multiple object detectors from unlabeled surveillance video,” Computer Vision and Image Understanding, vol. 113, no. 10, pp. 1076–1094, 2009.
  • [57] I. Huerta, M. Pedersoli, J. Gonzàlez, and A. Sanfeliu, “Combining where and what in change detection for unsupervised foreground learning in surveillance,” Pattern Recognition, vol. 48, no. 3, pp. 709–719, 2015.
  • [58] Y. Amit, 2D object detection and recognition: Models, algorithms, and networks.   MIT Press, 2002.
  • [59] M. Ozuysal, P. Fua, and V. Lepetit, “Fast keypoint recognition in ten lines of code,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on.   Ieee, 2007, pp. 1–8.
  • [60] M. Villamizar, F. Moreno-Noguer, J. Andrade-Cetto, and A. Sanfeliu, “Efficient rotation invariant object detection using boosted random ferns,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.   IEEE, 2010, pp. 1038–1045.
  • [61] M. Villamizar, J. Andrade-Cetto, A. Sanfeliu, and F. Moreno-Noguer, “Bootstrapping boosted random ferns for discriminative and efficient object classification,” Pattern Recognition, vol. 45, no. 9, pp. 3141–3153, 2012.
  • [62] D. Levi, S. Silberstein, and A. Bar-Hillel, “Fast multiple-part based object detection using kd-ferns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 947–954.
  • [63] A. R. Shah, C. S. Oehmen, and B.-J. Webb-Robertson, “Svm-hustle—an iterative semi-supervised machine learning approach for pairwise protein remote homology detection,” Bioinformatics, vol. 24, no. 6, pp. 783–790, 2008.
  • [64] R. Guerrero-Gómez-Olmedo, R. J. López-Sastre, S. Maldonado-Bascón, and A. Fernández-Caballero, “Vehicle tracking by simultaneous detection and viewpoint estimation,” in International Work-Conference on the Interplay Between Natural and Artificial Computation.   Springer, 2013, pp. 306–316.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description