A Fully-Convolutional Neural Network for Background Subtraction of Unseen Videos

A Fully-Convolutional Neural Network for Background Subtraction of Unseen Videos

M. Ozan Tezcan, Janusz Konrad, Prakash Ishwar
Department of Electrical and Computer Engineering
Boston University
Boston, MA
[mtezcan, jkonrad, pi]@bu.edu

Background subtraction is a basic task in computer vision and video processing often applied as a pre-processing step for object tracking, people recognition, etc. Recently, a number of successful background subtraction algorithms have been proposed, however nearly all of the top-performing ones are supervised. Crucially, their success relies upon the availability of some annotated frames of the test video during training. Consequently, their performance on completely “unseen” videos is undocumented in the literature. In this work, we propose a new, supervised, background-subtraction algorithm for unseen videos (BSUV-Net) based on a fully-convolutional neural network. The input to our network consists of the current frame and two background frames captured at different time scales along with their semantic segmentation maps. In order to reduce the chance of overfitting, we also introduce a new data-augmentation technique which mitigates the impact of illumination difference between the background frames and the current frame. On the CDNet-2014 dataset, BSUV-Net outperforms state-of-the-art algorithms evaluated on unseen videos in terms of F-measure, recall and precision metrics.


A Fully-Convolutional Neural Network for Background Subtraction of Unseen Videos

 A Preprint
M. Ozan Tezcan, Janusz Konrad, Prakash Ishwar Department of Electrical and Computer Engineering Boston University Boston, MA [mtezcan, jkonrad, pi]@bu.edu

July 29, 2019

1 Introduction

Background subtraction (BGS) is a foundational, low-level task in computer vision and video processing. The aim of BGS is to segment an input video frame into regions corresponding to either foreground or background . It is frequently used as a pre-processing step for higher-level tasks such as object tracking, people and motor-vehicle recognition, human activity recognition, etc. Since BGS is often the first pre-processing step, the accuracy of its output has an overwhelming impact on the overall performance of subsequent steps. Therefore, it is critical that BGS produce as accurate a foreground/background segmentation as possible.

Traditional BGS algorithms are unsupervised and rely on a background model to predict foreground regions [1, 2, 3, 4, 5, 6, 7, 8, 9]. SubSENSE [7] and WisenetMD [9] are considered to be state-of-the-art unsupervised BGS algorithms. However, since they rely on the accuracy of the background model, they encounter difficulties when applied to complex scenes. Recently, ensemble methods and a method leveraging semantic segmentation have been proposed and significantly outperform traditional algorithms [10, 11, 12].

The success of deep learning in computer vision did not bypass BGS research. A number of supervised deep-learning BGS algorithms have been developed [13, 14, 15, 16, 17, 18, 19, 20] with performance easily surpassing that of traditional methods. However, most of these algorithms have been tuned to either one specific video or to a group of similar videos, and their performance on unseen videos has not been evaluated. For example, FgSegNet [18] uses 200 frames from a test video for training and the remaining frames from the same video for evaluation. If applied to an unseen video, its performance drops significantly (Section 4.3).

In this paper, we introduce Background Subtraction for Unseen Videos (BSUV-Net), a fully-convolutional neural network for predicting foreground of an unseen video. A key feature of our approach is that the training and test sets are composed of frames originating from different videos. This guarantees that no ground-truth data from the test videos have been shown to the network in the training phase. Consequently, our network’s performance on unseen videos is expected to be on par with that of the test videos. By employing two reference backgrounds at different time scales, BSUV-Net addresses two challenges often encountered in BGS: varying scene illumination and intermittently-static objects that tend to get absorbed into the background. We also propose novel data augmentation which further improves our method’s performance under varying illumination. Furthermore, motivated by recent work on the use of semantic segmentation in BGS [12], we improve our method’s accuracy by inputting semantic information along with the reference backgrounds and current frame. The main contributions of our work are as follows:

  1. Supervised BGS for Unseen Videos: Although supervised algorithms, especially neural networks, have significantly improved BGS performance, they are tuned to a specific video and thus their performance on unseen videos deteriorates dramatically. To the best of our knowledge, BSUV-Net is the first supervised BGS algorithm that is truly generalizable to unseen videos.

  2. Data Augmentation for Increased Resilience to Varying Illumination: Changes in scene illumination pose a major challenge to BGS algorithms. To mitigate this, we develop a simple, yet effective, data augmentation technique. Using a simple additive illumination model, we differently “illuminate” the current frame and the reference background frames that are fed into BSUV-Net in training. This enables us to effectively tackle various illumination change scenarios that may be present in test videos.

  3. Leveraging Semantic and Multiple Time-Scale Information: BSUV-Net improves foreground-boundary segmentation accuracy by accepting semantic information as one of its inputs. This is unlike in an earlier BGS method [12] which used semantic information as a post-processing step. The other network inputs are the current frame (to be segmented) and a two-frame background model capturing different time scales. While one background frame based on distant history helps with the discovery of intermittently-static objects, the other frame based on recent history is key for handling dynamic factors such as illumination changes.

Based on our extensive experiments on the CDNet-2014 dataset [21], BSUV-Net outperforms state-of-the-art BGS algorithms evaluated on unseen videos

2 Related Work

A wide range of BGS algorithms have been developed in the past. Since this is not a survey paper, we will not cover all BGS variants. Instead, we will focus only on recent top-performing methods. We divide these algorithms into 4 categories: (i) BGS by (unsupervised) background modeling, (ii) supervised BGS tuned to a single video or a group of videos, (iii) Improving BGS algorithms by post-processing.

2.1 BGS by Background Modeling

Nearly all traditional BGS algorithms first compute a background model, and then use it to predict the foreground. While a simple model based on the mean or median of a subset of preceding frames offers only a single background value per pixel, a probabilistic Gaussian Mixture Model (GMM) [1] allows a range of background values. This idea was improved by creating an online procedure for the update of GMM parameters in a pixel-wise manner [2]. Kernel Density Estimation (KDE) was introduced into BGS [3] as a non-parametric alternative to GMMs and was subsequently improved [4]. The probabilistic methods achieve better performance compared to single-value models for dynamic scenes and scenes with small background changes.

In [5], Barnich and Droogenbroeck introduced a sample-based background model. Instead of implementing a probability model, they modeled the background by a set of sample values per pixel and used a distance-based model to decide whether a pixel should be classified as background or foreground. Since color information alone is not sufficient for complex cases, such as illumination changes, Bilodeau et al. introduced Local Binary Similarity Patterns (LBSP) to compare the current frame and background using spatio-temporal features instead of color [22]. St-Charles et al. combined color and texture information, and introduced a word-based approach, PAWCS [6]. They considered pixels as background words and updated each word’s reliability by its persistence. Similarly, SuBSENSE [7] combines LBSP and color features, and employs pixel-level feedback to improve the background model.

Recently, Isik et al. introduced SWCD, a pixel-wise, sliding-window approach [8]. They used a dynamic control system to update the background model. Lee et al. introduced WisenetMD, a multi-step algorithm to eliminate false positives in dynamic backgrounds [9].

2.2 Supervised BGS

Although background subtraction has been extensively studied in the past, the definition of a supervised BGS algorithm is still vague. Generally speaking, the aim of a supervised BGS algorithm is to learn the parameters (e.g., neural-network weights) of a complex function in order to minimize a loss function of the labeled training frames. Then, the performance of the algorithm is evaluated on a separate set of test frames. In this section we divide the supervised BGS algorithms into three groups namely, video-optimized, video-group-optimized and video-agnostic depending on which frames and videos they use during training and testing.

Several algorithms use some frames from a test video for training and all of the frames of the same video for evaluating algorithm’s performance on that video. Clearly, parameter values are optimized separately for each video. We will refer to this class of algorithms as video-optimized BGS algorithms. Some other algorithms use randomly-selected frames from a group of test videos for training and all of the frames of the same videos for testing. Since some frames from all test videos are used for training, we will refer this class of algorithms as video-group-optimized algorithms. Note that, in both of these scenarios the algorithms are neither optimized for nor evaluated on unseen videos and to the best of our knowledge all of the top-performing supervised BGS algorithms to-date are either video-optimized or video-group-optimized. In this paper, we introduce a new category of supervised BGS algorithms, called video-agnostic algorithms, that can be applied to unseen videos with no or little loss of performance. To learn parameters, a video-agnostic algorithm uses frames from a set of training videos but for performance evaluation it uses a completely different set of videos.

In recent years, supervised learning algorithms based on convolutional neural networks (CNNs) have been widely applied to BGS. The first CNN-based BGS algorithm was introduced in [13]. This is a video-optimized algorithm which uses a patch-wise approach by taking -pixel input and producing a single foreground probability for the center of the patch. A method proposed in [17] uses a similar approach, but with a modified CNN which accepts inputs of size .

Instead of using a patch-wise algorithm, Zeng and Zhu introduced Multiscale Fully-Convolutional Neural Network (MFCN) which can predict the foreground of an input image in one step [16]. Lim and Keles proposed a triplet CNN which uses siamese networks to create features at three different scales and combines these features within a transposed CNN [18]. In a follow-up work, they removed the triplet networks and used dilated convolutions to capture the multiscale information [19]. In [15], Bakkay et al. used generative adversarial networks for BGS. The generator performs the BGS task, whereas the discriminator tries to classify the BGS map as real or fake. Although all these algorithms perform very well on various BGS datasets, it is important to note that they are all video-optimized, thus they will sacrifice performance when tested on unseen videos. In [14], Babae et al. designed a video-group-optimized CNN for BGS. They randomly selected of CDNet-2014 frames [21] as a training set and developed a single network for all of the videos in this dataset. In [20], Sakkos et al. used a 3D CNN to capture the temporal information in addition to the color information. Similarly to [21], they trained a single algorithm using 70% of frames in CDNet-2014 and then used it to predict the foreground in all videos of the dataset. Note that even these approaches do not generalize to other videos since some ground truth data from each video exists in the training set. Table 1 compares and summarizes the landscape of supervised BGS algorithms and the methodology used for training and evaluation.

As discussed above, none of the CNN-based BGS algorithms to-date have been designed for or tested on unseen videos with no ground truth at all. Although such algorithms can significantly reduce the effort when creating ground truth for a new video, they are not useful for real-world problems since it is not possible to label some frames in each new video.

Algorithm Are some frames from test videos used in training?
Training and Evaluation
Braham-CNN-BGS [13] Yes First half of the labeled frames of the test video video-optimized
MFCNN [16] Yes
Randomly selected 200 frames from the first
3000 labeled frames of the test video
Wang-CNN-BGS [17]
FGSegNet [18, 19]
BScGAN [15]
Yes Hand picked 200 labeled frames of the test video video-optimized
Babae-CNN-BGS [14] Yes of the labeled frames of all videos video-group-optimized
3D-CNN-BGS [20] Yes of the labeled frames of all videos video-group-optimized
BSUV-Net (ours) No No frame from test videos is used in training video-agnostic
Table 1: Training/evaluation methodologies of supervised BGS algorithms for CDNet-2014.

2.3 Improving BGS Algorithms by Post-Processing

Over the last few years, many deep-learning-based algorithms were developed for the problem of semantic segmentation and they achieved state-of-the-art performance. In [12], Braham and Droogenbroeck introduced a post-processing step for BGS algorithms based on semantic segmentation predictions. Given an input frame, they predicted a segmentation map using PSPNet [23] and obtained pixel-wise probability predictions for semantic labels such as person, car, animal, house etc. Then, they manually grouped these labels into two sets – foreground and background labels, and used this information to improve any BGS algorithm’s output in a post-processing step. They obtained very competitive results by using SubSENSE [7] as the BGS algorithm.

As we have pointed out, many different algorithms have been designed to solve the BGS problem, and they all have some advantages and disadvantages. Bianco et al. introduced an algorithm called IUTIS which combines the results produced by several BGS algorithms [10]. They used genetic programming to to determine how to combine several BGS algorithms using a sequence of basic binary operations, such as logical and/or, majority voting and median filtering. Their best result was achieved by using 5 top-performing BGS algorithms on the CDNet-2014 dataset at the time of publication. Zeng et al. followed the same idea, but instead of genetic programming used a fully-convolutional neural network to fuse several BGS results into a single output [11], and outperformed IUTIS on CDNet-2014.

3 Proposed Algorithm: BSUV-Net

3.1 Inputs to BSUV-Net

Segmenting an unseen video frame into foreground and background regions without using any information about the background would be an ill-defined problem. In BSUV-Net, we use two reference frames to characterize the background. One frame is an “empty” background frame, with no people or other objects of interest, which can typically be extracted from the beginning of a video. This provides an accurate reference that is very helpful for segmenting intermittently-static objects in the foreground. However, due to dynamic factors, such as illumination variations, this reference may not be valid after some time. To counteract this, we use another reference frame that characterizes recent background, for example by computing median of 100 frames preceding the frame being processed. However, this frame might not be as accurate as the first reference frame since we cannot guarantee that there will be no foreground objects in it (if such objects are present for less than 50 frames, the temporal median will suppress them). By using two reference frames captured at different time scales, we aim to leverage benefits of each frame type.

It has been shown that leveraging results of semantic segmentation significantly improves the performance of a BGS algorithm [12]. Braham et al. used semantic segmentation results in a post-processing step [12]. In BSUV-Net, we follow a different idea and use semantic information as an additional input channel to our neural network. In this way, we let our network learn how to use this information. To extract semantic segmentation information, we used a state-of-the-art CNN called DeepLabv3 [24] trained on ADE20K [25], an extensive semantic-segmentation dataset with 150 different class labels and more than 20,000 images with dense annotations. Let us denote the set of object classes in ADE20K as . Similarly to [12], we divided these classes into two sets: foreground and background objects. As foreground objects, we used person, car, cushion, box, book, boat, bus, truck, bottle, van, bag and bicycle. The rest of the classes are used as background objects. The softmax layer of DeepLabv3 provides pixel-wise class probabilities for . Let us denote the predicted probability distribution of as where is an input frame, and are the spatial indices. Then, we compute a foreground probability map (FPM) , where stands for the set of foreground classes, for the current frame and for two reference frames.

We use the current, recent and empty frames in color, each along with its FPM, as the input to BSUV-Net (Figure 1). Clearly, the number of channels in BSUV-Net’s input layer is 12 for each frame consists of 4 channels (R,G,B,FPM).

3.2 Network Architecture and Loss Function

Figure 1: Network architecture of BSUV-Net. BN stands for Batch Normalization and SD stands for spatial dropout. Grayscale images in the input representation show foreground probability maps (FPM) of the corresponding RGB frames.

We use a UNET-type [26] fully-convolutional neural network (FCNN) with residual connections. The architecture of BSUV-Net has two parts: encoder and decoder, and is shown in Figure 1. In the encoder network, we use max-pooling operators to decrease the spatial dimensions and in the decoder network, we use up-convolutional layers (transposed convolution with a stride of 2) to increase the dimensions back to those of the input. In all convolutional and up-convolutional layers, we use convolutions as in VGG [27]. The residual connections from the encoder to the decoder help the network combine low-level visual information gained in the initial layers with high-level visual information gained in the deeper layers. Since our aim is to increase the performance on unseen videos, we use strong batch normalization (BN) [28] and spatial dropout (SD) [29] layers to increase the generalization capacity. Specifically, we use a BN layer after each convolutional and up-convolutional layer, and an SD layer before each max-pooling layer. Since our task can be viewed as a binary segmentation, we use sigmoid layer as the last layer in BSUV-Net. The operation of the overall network can be defined as a nonlinear map where is a 12-channel input, represents the parameters of the neural network , and is a pixel-wise foreground probability prediction. Note that since this is a fully-convolutional neural network, it does not require a fixed input size; it can be applied to a frame of any size, but some padding may be needed to account for max-pooling operations.

In most BGS datasets, the number of background pixels is much larger than the number of foreground pixels. This class imbalance creates significant problems for the commonly-used loss functions, such as cross-entropy and mean-squared error. A good alternative for unbalanced binary datasets is the Jaccard Index, but this is not a differentiable metric for probabilistic predictions. Thus, we opted for a relaxed form of the Jaccard index as the loss function, defined as follows.

where is the ground truth of and is used as a smoothing parameter.

3.3 Resilience to Illumination Change by Data Augmentation

Since neural networks have millions of parameters, they are very prone to overfitting. A widely-used method for reducing overfitting in computer-vision problems is to enlarge the dataset by applying several data augmentations such as random crops, rotations and noise addition. Since we are dealing with videos in this paper, we can also add augmentation in the temporal domain.

In real-life BGS problems, there might be a significant illumination difference between an empty background frame acquired at an earlier time and the current frame. However, only a small portion of videos in CDnet-2014 capture significant illumination changes. Therefore, we introduce a new data-augmentation technique for changing global illumination difference between the empty reference frame and the current frame. Suppose that represents the RGB channels of an empty reference frame. Then, an augmented version of can be computed as for , where represents RGB information in our illumination model. By choosing randomly for each example during training, we can make the network resilient to illumination variations.

4 Experimental Results

4.1 Dataset and Evaluation Metrics

In order to evaluate the performance of BSUV-Net, we used CDNet-2014 [21], the largest BGS dataset with 53 natural videos from 11 categories including challenging scenarios such as shadows, night videos, dynamic background, etc. The spatial resolution of videos varies from to pixels. Each video has a region of interest labelled as either 1) foreground, 2) background, 3) hard shadow or 4) unknown motion. When measuring an algorithm’s performance, we ignored pixels with unknown motion label and considered hard-shadow pixels as background.

We used the most common binary-classification performance metrics: precision, recall and F-measure. Since F-measure computes the harmonic average of precision and recall, it is one of the most informative metrics for unbalanced datasets, such as CDNet-2014 which is very unbalanced. As we mentioned earlier, BGS is typically used as a pre-processing step for advanced video processing and computer vision tasks. Usually, the goal is to obtain candidate regions using a BGS algorithm and then focus a task at hand on these regions. While false positives produced by a BGS algorithm will result in an unnecessary application of the task to false-positive regions (waste of computational resources), false negatives are more problematic as they may cause misses - the task will not be applied in missed areas at all. Thus, we argue that in BGS algorithms recall is more important than precision.

4.2 Training and Evaluation Details

As discussed in Section 2.2, we use a video-agnostic evaluation strategy in all experiments. This allows us to measure an algorithm’s performance on real-world-like tasks when no ground-truth labels are available. To evaluate performance on all videos in CDNet-2014, we applied cross-validation with 18 different combinations of training/test videos. Let us denote the -th combination by . Then, is equal to the set of all 53 videos in CDNet-2014. During training, we used 200 frames suggested in [16] for each video in .

When training with different sets , we kept exactly the same hyperparameters to make sure that we are not tuning our network to specific videos. In all of our experiments, we used ADAM optimizer with a learning rate of , , and . The minibatch size was 8 and we trained for 50 epochs. As the empty background frame, whenever possible, we used the median of up to 100 first frames of a video. However, in some videos the first few frames did not capture an empty background scene. For those videos, we hand-picked empty frames (e.g., in groups) and used their median as the empty reference. Since there is no single empty background frame in videos from the pan-tilt-zoom (PTZ) category, we slightly changed the inputs. Instead of “empty background + recent background” pair we used “recent background + more recent background” pair, where the recent background is computed as the median of 100 preceding frames and the more recent background is computed as the median of 30 preceding frames.

Although BSUV-Net can accept frames of any spatial dimension, in the training process we used random crops of fixed -pixel size to leverage parallel GPU processing. We applied random data augmentation at the beginning of each epoch. Data augmentation for illumination resilience (Section 3.3) is used with , where is sampled from and ’s - from assuming double precision for pixel values ranging from to . We also added random Gaussian noise from to each frame.

In the evaluation step, we did not apply any scaling or cropping to the inputs. To obtain binary maps, we applied thresholding with threshold to the output of the sigmoid layer of BSUV-Net. Finally, we applied binary opening followed by binary closing with a circular structuring element of size .

4.3 Quantitative Results

Method b. w. l. fr. night PTZ ther. sha. i. o. m. c. j. d. b. base. turb. overall
BSUV-net (ours) 0.8626 0.7770 0.7188 0.5914 0.7805 0.9514 0.7826 0.7733 0.8055 0.9254 0.7192 0.7898
SWCD 0.8437 0.7383 0.5559 0.4742 0.8581 0.8777 0.7092 0.7411 0.8645 0.9214 0.8393 0.7658
WisenetMD 0.8596 0.6549 0.4982 0.3764 0.8152 0.8981 0.7264 0.8228 0.8376 0.9487 0.8765 0.7559
PAWCS 0.8059 0.6433 0.4171 0.4450 0.8324 0.8910 0.7764 0.8137 0.8938 0.9397 0.7667 0.7477
SubSENSE 0.8594 0.6594 0.4918 0.3894 0.8171 0.8983 0.6569 0.8152 0.8177 0.9503 0.8423 0.7453
FgSegNet v2 0.2789 0.2115 0.3142 0.1400 0.3584 0.3809 0.3325 0.2815 0.2067 0.5641 0.1431 0.2920
BSUV-net (ours) 0.8234 0.7257 0.7068 0.7930 0.8641 0.9742 0.7448 0.8732 0.9273 0.9828 0.7661 0.8347
SWCD 0.8523 0.7863 0.6566 0.5598 0.8602 0.9343 0.7606 0.7085 0.8692 0.9610 0.7907 0.7945
WisenetMD 0.8139 0.8384 0.6340 0.8490 0.7867 0.9419 0.7398 0.8235 0.8062 0.9507 0.8336 0.8198
PAWCS 0.7091 0.7555 0.3929 0.6611 0.8504 0.9167 0.7487 0.7840 0.8868 0.9408 0.8502 0.7724
SubSENSE 0.8121 0.8435 0.6494 0.8269 0.8161 0.9414 0.6578 0.8243 0.7768 0.9520 0.8574 0.8143
FgSegNet v2 0.8878 0.5286 0.7607 0.8804 0.9297 0.9291 0.6414 0.9450 0.6398 0.9510 0.6351 0.7935
BSUV-net (ours) 0.9239 0.8478 0.7411 0.5884 0.7277 0.9305 0.9320 0.7163 0.7606 0.8757 0.7366 0.7982
SWCD 0.8391 0.7098 0.4830 0.4261 0.8585 0.8302 0.7372 0.7827 0.8633 0.8851 0.8969 0.7556
WisenetMD 0.9148 0.6323 0.4335 0.3096 0.8696 0.8636 0.7881 0.8278 0.8932 0.9477 0.9281 0.7644
PAWCS 0.9379 0.6285 0.5559 0.4859 0.8280 0.8710 0.8392 0.8660 0.9038 0.9394 0.7692 0.7841
SubSENSE 0.9168 0.6276 0.4224 0.3223 0.8328 0.8645 0.7957 0.8115 0.8915 0.9495 0.8399 0.7522
FgSegNet v2 0.1948 0.2951 0.2559 0.0768 0.2698 0.2598 0.4085 0.1655 0.1606 0.4565 0.0910 0.2395
Table 2: Comparison of the Per-Category Performance for Unseen Videos on CDNet-2014

A comparison of BSUV-Net with state-of-the-art BGS algorithms is presented in Table 2 using F-measure, recall and precision. The full names of columns are: “bad weather”, “low frame rate”, “night videos”, “pan-tilt-zoom”, “thermal”, “shadow”, “intermittent object motion”, “camera jitter”, “dynamic background”, “baseline” and “turbulence”, respectively. The “overall” column shows the mean score for all 11 categories.

Since BSUV-Net is video-agnostic, comparing it with video-optimized or video-group-optimized algorithms would not be fair and we omitted them. We also omitted the results of post-processing algorithms such as IUTIS [10] and SemanticBGS [12] since they can be applied to any BGS algorithm, including BSUV-Net, to improve it’s performance. Instead, we compared BSUV-Net with state-of-the-art unsupervised algorithms, namely WisenetMD [9], SWCD [8], SubSENSE [7] and PAWCS [6] , that, by definition, are video-agnostic. We also added FgSegNet v2 [19] to our comparison table since it is the best performing algorithm on CDNet-2014. We trained it in a video-agnostic manner using the same cross-validation that we used for BSUV-Net for fairness. As expected, this caused a huge performance decrease for FgSegNet v2 with respect to it’s video-optimized training. As can be seen in Table 2, BSUV-Net clearly outperforms these algorithms on all three metrics. The recall performance shows potential of BSUV-Net as a pre-processing step for more advanced algorithms, such as object tracking, while the F-measure demonstrates that BSUV-Net achieves good results without compromising either recall or precision. So as not to violate the author-anonymity policy, we did not upload our results to “changedetection.net”. Instead, we computed results for those frames of CDNet-2014 whose ground truths are available. This may cause our numbers to differ slightly from the results in “changedetection.net”.

As for the category-based results, it can be observed that BSUV-Net achieves a significant performance advantage in the “night videos” category. All videos in this category are traffic-related and many cars have headlights turned on at night which causes significant local illumination variations in time. BSUV-Net’s excellent performance for this category demonstrates that the proposed model is indeed largely illumination-invariant.

BSUV-Net performs poorer than other algorithms for the “thermal” category in terms of precision and F-measure. We believe, this is related to the fact that thermal images are lying on a different manifold than RGB images. Since BSUV-Net is a data-driven algorithm and the number of thermal images is very low compared to the number of RGB images, the network is unable to generalize to the thermal manifold. However, results for other categories show that, given enough data, it should be possible to train a separate algorithm for thermal images and get improved results. A similar argument can be made for the “turbulence” category as well.

Two other problematic categories for BSUV-Net are “camera jitter” and “dynamic background”. In both categories, although BSUV-Net’s recall scores are better than those of the other algorithms, its precision scores are worse, indicating more false positives. We believe this is related to the empty and recent background frames we are using as input. The median operation used to compute background frames creates very blurry images for these categories since the background is not static. Thus, BSUV-Net predicts some pixels in the background as foreground and increases the number of false positives.

4.4 Visual Results

A visual comparison of BSUV-Net with SWCD [8] and WisenetMD [9] is shown in Figure 2. Each row shows a sample frame from one of the videos in each category suing the same abbreviations as in Table 2. It can be observed that BSUV-Net produces visually the best results for almost all categories.

In the “night videos” category, SWCD and WisenetMD produce many false positives because of local illumination changes. BSUV-Net produces better results since it is designed to be illumination-invariant. In the “thermal” category, BSUV-Net performs much better in the shadow regions. Results in the “intermittent object motion” and “baseline” categories show that BSUV-Net can successfully detect intermittently-static objects. It is safe to say that BSUV-Net is capable of simultaneously handling the discovery of intermittently-static objects and also the dynamic factors such as illumination change.

An inspection of results in the “dynamic background” category shows that BSUV-Net has detected most of the foreground pixels but failed to detect the background pixels around the foreground objects. We believe this is due to the blurring effect of the median operation that we used in the computation of background frames. Using more advanced background models as an input to BSUV-Net might improve the performance in this category.

Figure 2: Visual Comparison for Unseen Videos on CDNet-2014.

5 Conclusions and Future Work

We introduced a novel deep-learning algorithm for background subtraction of unseen videos and proposed a video-agnostic evaluation strategy that uses cross-validation to treat each video in a dataset as unseen. The input to BSUV-Net consists of the current frame (to be segmented) and two reference frames from different time-scales, along with semantic information for all three frames (computed using Deeplabv3 [24]). To increase the generalization capacity of BSUV-Net, we formulated a simple, yet effective, illumination-change model. Experimental results on CDNet-2014 show that BSUV-Net outperforms state-of-the-art unsupervised BGS algorithms in terms of F-measure, recall and precision on most of the categories. This shows great potential of deep-learning-based BGS algorithms designed for unseen or unlabeled videos.

In the future, we are planning to work on temporal data-augmentation techniques to improve performance on challenging categories, such as “dynamic background” and “camera jitter”. We will also investigate different background models for the reference frames. In this work, we kept our focus on designing a high-performance, supervised BGS algorithm for unseen videos without considering the processing speed. To bring BSUV-Net closer to real-time performance, we are also planning to study a shallow-network implementation designed for fast inference.

6 Acknowledgement

This work was supported in part by ARPA-E under agreement DE-AR0000944 and by the donation of Titan GPUs from NVIDIA Corp.


  • [1] Chris Stauffer and W. Eric L. Grimson. Adaptive background mixture models for real-time tracking. In Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 246–252. IEEE, 1999.
  • [2] Zoran Zivkovic. Improved adaptive gaussian mixture model for background subtraction. In International Conference on Pattern Recognition (ICPR), volume 4, pages 28–31. IEEE, 2004.
  • [3] Ahmed Elgammal, Ramani Duraiswami, David Harwood, and Larry S Davis. Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proceedings of the IEEE, 90(7):1151–1163, 2002.
  • [4] Anurag Mittal and Nikos Paragios. Motion-based background subtraction using adaptive kernel density estimation. In Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 2. IEEE, 2004.
  • [5] Olivier Barnich and Marc Van Droogenbroeck. ViBe: A universal background subtraction algorithm for video sequences. IEEE Transactions on Image processing (TIP), 20(6):1709–1724, 2011.
  • [6] Pierre-Luc St-Charles, Guillaume-Alexandre Bilodeau, and Robert Bergevin. A self-adjusting approach to change detection based on background word consensus. In Winter Conference on Applications of Computer Vision (WACV), pages 990–997. IEEE, 2015.
  • [7] Pierre-Luc St-Charles, Guillaume-Alexandre Bilodeau, and Robert Bergevin. Subsense: A universal change detection method with local adaptive sensitivity. IEEE Transactions on Image processing (TIP), 24(1):359–373, 2015.
  • [8] Şahin Işık, Kemal Özkan, Serkan Günal, and Ömer Nezih Gerek. SWCD: A sliding window and self-regulated learning-based background updating method for change detection in videos. Journal of Electronic Imaging, 27(2):023002, 2018.
  • [9] Sang-Ha Lee, Soon-Chul Kwon, Jin-Wook Shim, Jeong-Eun Lim, and Jisang Yoo. Wisenetmd: Motion detection using dynamic background region analysis. arXiv preprint arXiv:1805.09277, 2018.
  • [10] Simone Bianco, Gianluigi Ciocca, and Raimondo Schettini. How far can you get by combining change detection algorithms? In International Conference on Image Analysis and Processing, pages 96–107. Springer, 2017.
  • [11] Dongdong Zeng, Ming Zhu, and Arjan Kuijper. Combining background subtraction algorithms with convolutional neural network. Journal of Electronic Imaging, 28(1), 2019.
  • [12] Marc Braham, Sébastien Piérard, and Marc Van Droogenbroeck. Semantic background subtraction. In International Conference on Image Processing (ICIP), pages 4552–4556. IEEE, 2017.
  • [13] Marc Braham and Marc Van Droogenbroeck. Deep background subtraction with scene-specific convolutional neural networks. In International Conference on Systems, Signals and Image Processing (IWSSIP), pages 1–4. IEEE, 2016.
  • [14] Mohammadreza Babaee, Duc Tung Dinh, and Gerhard Rigoll. A deep convolutional neural network for video sequence background subtraction. Pattern Recognition, 76:635–649, 2018.
  • [15] MC Bakkay, HA Rashwan, H Salmane, L Khoudour, D Puigtt, and Y Ruichek. BSCGAN: Deep background subtraction with conditional generative adversarial networks. In International Conference on Image Processing (ICIP), pages 4018–4022. IEEE, 2018.
  • [16] Dongdong Zeng and Ming Zhu. Background subtraction using multiscale fully convolutional network. IEEE Access, 6:16010–16021, 2018.
  • [17] Yi Wang, Zhiming Luo, and Pierre-Marc Jodoin. Interactive deep learning method for segmenting moving objects. Pattern Recognition Letters, 96:66–75, 2017.
  • [18] Long Ang Lim and Hacer Yalim Keles. Foreground segmentation using convolutional neural networks for multiscale feature encoding. Pattern Recognition Letters, 112:256–262, 2018.
  • [19] Long Ang Lim and Hacer Yalim Keles. Learning multi-scale features for foreground segmentation. arXiv preprint arXiv:1808.01477, 2018.
  • [20] Dimitrios Sakkos, Heng Liu, Jungong Han, and Ling Shao. End-to-end video background subtraction with 3d convolutional neural networks. Multimedia Tools and Applications, pages 1–19, 2017.
  • [21] Nil Goyette, Pierre-Marc Jodoin, Fatih Porikli, Janusz Konrad, and Prakash Ishwar. A novel video dataset for change detection benchmarking. IEEE Transactions on Image processing (TIP), 23(11):4663–4679, 2014.
  • [22] Guillaume-Alexandre Bilodeau, Jean-Philippe Jodoin, and Nicolas Saunier. Change detection in feature space using local binary similarity patterns. In International Conference on Computer and Robot Vision (CRV), pages 106–112. IEEE, 2013.
  • [23] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017.
  • [24] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [25] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ADE20K dataset. In Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 633–641, 2017.
  • [26] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention, pages 234–241. Springer, 2015.
  • [27] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [28] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [29] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Efficient object localization using convolutional networks. In Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description