# Real-Time Background Subtraction Using Adaptive Sampling and Cascade of Gaussians

## Abstract

Background-Foreground classification is a fundamental well-studied problem in computer vision. Due to the pixel-wise nature of modeling and processing in the algorithm, it is usually difficult to satisfy real-time constraints. There is a trade-off between the speed (because of model complexity) and accuracy. Inspired by the rejection cascade of Viola-Jones classifier, we decompose the Gaussian Mixture Model (GMM) into an adaptive cascade of classifiers. This way we achieve a good improvement in speed without compensating for accuracy. In the training phase, we learn multiple KDEs for different durations to be used as strong prior distribution and detect probable oscillating pixels which usually results in misclassifications. We propose a confidence measure for the classifier based on temporal consistency and the prior distribution. The confidence measure thus derived is used to adapt the learning rate and the thresholds of the model, to improve accuracy. The confidence measure is also employed to perform temporal and spatial sampling in a principled way. We demonstrate a speed-up factor of 5x to 10x and 17 percent average improvement in accuracy over several standard videos.

## 1Introduction

One of the most fundamental problems in computer vision is to provide a good estimate of a background in a given image sequence. Background subtraction is critical component of surveillance applications (indoor and outdoor), action recognition, human computer interactions, tracking, experimental chemical procedures that require significant change detection. Work on background subtraction has started since the 1970s and even today is an active open problem. There have been a host of methods which have been developed and below is a short review which will serve to aid understanding our algorithm. A survey by [6] provides a overview of common methods which includes Frame differencing(FD), Running Gaussian average(RGA), Gaussian Mixture Model (GMM) and Kernel Density Estimation(KDE). We employ these basic methods in a structured methodology to develop our algorithm.

A survey of variants of GMM, issues and analysis is presented in [4]. In our work, we focus on solving the variable-rate problem (needs description) and improving the performance. Abstractly, our work tries to fuse several algorithms to achieve speed and accuracy and we list similar methods here. Similar attempts have been made by the following researchers. [9] and [5] used a Hierarchical background subtraction method that operate in different hierarchical levels(of grouping of data) namely pixel, region and frame levels. Both methods though are hierarchical in terms of the abstraction of data they operate on (pixel, region, frame etc) and not in terms of processing. [12] switch between based GMM and RGA choosing a complex model for complicated backgrounds and simple model for simpler backgrounds. They use a entropy based measure to switch between the models (variable multi-modal nature). [11] use a Two-layer Gaussian mixture model (TLGMM) where first layer captures gradual changes and other layers capture rapid changes (variable rate nature). Similarly [7] has also developed a model that maintains two rate defined background models to capture dynamic and static pixels.

We briefly describe our observations and improvement over the standard Strauffer and Grimson [8]’s GMM. We observe in most cases, background subtraction is an asymmetric classification problem with probability of foreground pixel being much lesser than that of background. This assumption fails in the case of scenes like highways, a busy street, etc. In our work, we focus mainly on surveillance scenarios where there is very low foreground occupancy. Our framework exploits this fact and at the same time handles variable rate changes in background and improves accuracy. Our key contributions in this paper include: 1. Decomposition of GMM to form an adaptive cascade of classifiers - Cascade of Gaussians (CoG) which handles complex scenes in an efficient way to obtain real-time performance. 2. A confidence estimate for each pixel’s classification which would be used to vary the learning rate and thresholds for the classifiers and adaptive sampling. 3. Learning a time windowed KDE from the training data-set which would act as a prior to the Adaptive Rejection Cascade and also help the confidence estimate.

The decomposition of the GMM into the cascade is similar to the increasing true positive detection rate inspired by the Viola Jones Rejection Cascade [1]. [10] provides an optimized lookup for highly probable colors in the incoming background pixels thus providing speedup in the access. We try to provide a generalized method by grouping pixels with similar behavior during our Training Phase in Section 2. The rest of the paper is organized into three sections. Section 2 describes the components of the framework and providing motivation behind each component. Section 3 contains the description of the algorithm and compares with other algorithms. Section 4 discusses the results, improvements and future work.

## 2Components of the Cascade

This section describes the different components of the rejection cascade and how they were determined. The rejection cascade is accompanied by the confidence measure to make an accurate background classification at each level of the cascade.

### 2.1Training Phase

#### Scene Prior: Background Model

The process of distinguishing linearly varying background and noisy pixels is a challenge and critical since the background subtraction model intrinsically has no additional attribute to separate them. For this scenario, in our approach we introduce a “Scene Prior” for every pixel of the frame (equation 1). The non-parametric probability distribution for the pixels assuming independent R,G,B channels j is now given too. The Scene Prior basically provides a temporal distribution of the pixel value over N frames, with higher accuracy than the GMM model (in equation 2) during training. The choice of N is empirical and depends on how much dynamic background and foreground is present in the training frames. To obtain complete variability we choose as large N as possible. Henceforth we refer to Scene Prior as the prior. In the training phase we estimate the underlying temporal distribution of pixels by calculating the kernel function that approximates the said distribution. Our case primarily concentrates on long surveillance videos with sufficient information (no or very little foreground) available in the training sequence, and this is what decides N.

For the standard GMM model (assuming the covariance matrix is diagonal), the parameters updates include:

Where , represents the kernel function and the scale or bandwidth. This Kernel function is calculated to provide the modes of the different pixels. Where represents the pixel mode distribution obtained in equation 1, where represents the ratio of the component i in the distribution of pixel and , are the parameters of the component, represents 0 or 1 based on a component match and finally represents the learning rate of the pixel model. The is initialized for all pixels usually, there has been work in adapting it based on the pixel entropy. We use the pixel gradient value distribution to do the same.

#### Determining Learning Rate Hyper-parameters

Apart from the kernel density estimate we make an estimate of the dynamic nature of pixels in the scene. This is obtained by the Gaussian Mixture Model of the temporal residue between consecutive incoming pixels. We can see from the figure below that the residue when binned into 3 levels provides a good way to classify pixels into static/drifting pixels, Oscillating pixels and Dynamic pixels. of the This will help us resolve a pixel drift versus a pixel jump as show in example below in figure.That is we get Residue: , for n in [1,N] and the normalized histogram of these residue values are obtained using KDE and histogram (both methods to observe the effect of quantization). We use a simple histogram to depict that following. We use the normalized occupancy of the bins to determine the type of pixel. After thresholding based on the bins, a peaky first bin implies drift or static pixels, a peaky second bin implies oscillating pixels and the other cases are considered as dynamic pixels. Based on these values we choose the weights and for the confidence measure (explained in the next section). This sets the learning rate for the pixel. The process of obtaining the right learning rates (confidence function) from the normalized binned histogram values to determine and test for the learning rates have determined empirically by shape matching the histograms. This section of work is still under completion in terms of obtaining the right

#### Clustering Similar Background - Spatio-Temporal Grouping

The next step in the training phase is to determine background regions of pixels, in the frame that behave similarly in terms of adapted variance, number of modes, and optimally use fewer parameters and lesser instructions to update this specific region’s, pixel models. The problem definition can be formalized as: We are given pixels and for each pixel we have a set of matches of the form , which means that pixel correlated with pixel at frame number . From these N matches, we construct a discrete time series by clustering pixel at time interval frames. A time series of the pixel values at frame . Intuitively, measures the correlation in behavior of pixels over time window . For convenience we assume that time series have the same length. We group together pixel value time series so that similar behavior is captured by similarity of the time series . This way we can infer which pixels have a similar temporal pattern variances and modalities, and we can then consider the center of each cluster as the representative common pattern of the group. This helps us cluster similar behaving pixels together. This is can be seen a spectral clustering problem as described in [2]. We try a simpler approach here first by clustering the adapted pixel variances(matrix V) and weights(matrix R) of first dominant mode of pixels within a mixture model.

Obtain N frames and Perform training using these frames to obtain pixel mean, variance and adapted weights for all frames , where R(t) refers to the ranked weight of the first dominant mean at every pixel.

Form matrix whose rows are adapted variance and ranked weight observations, while columns are variables and ,

Obtain the Covariance Matrices for the same

Perform K-means Clustering with K=3 clusters (Empirically set to 3 - Quantizing covariance is based on the temporal residue of the pixel (Dynamic, Oscillating, Drifting))

Threshold for pixels within

calculate the KDE of given cluster and calculate the joint occurrence distribution and associated weight and where is first dominant common cascade level at grouped pixels

The above process suffers from the setback that the variances chosen temporally do not correspond mean values associated with the maximum eigen value as obtained in case of Spectral Clustering. So we have the pixel variance and adapted weight (dominant mode) covariance matrices and .

A single gaussian is fit over thresholded covariance matrices (adapted variance and first dominant mode weight).

The parameters , and , represent the mean and standard deviation of the cluster of pixel variances and adapted weights of the first dominant modes. The fundamental clustering algorithm requires Data set and , number of clusters - quantization of the adapted weights or variances, Gram matrix [2]. One critical point to note here is that, when we do not choose to employ spatio-temporal grouping, and reduce the number of parameters and consequent updates, we can use the Scene Prior covariance estimation to increase the accuracy of the foreground detection. This is very similar to the background subtraction based on co-occurrence of Image Variations. This process is depicted in figure(3).

### 2.2Confidence Measure

The confidence measure is a latent variable use to aid the Rejection Cascade to obtain a measure of fitness for the classification of a pixel based on various criteria. The Confidence for a pixel is given by

where represents the difference between the current pixel value and the parameters of the model occurring at the top of the ordered Rejection cascade described below. As seen in the ordered tree, the first set of parameters would be the first dominant mode - . This is carried out based on the level in which the pixel gets successfully classified. represents the Probability of occurrence of the pixel from the KDE. The values of and are determined by the normalized temporal residue distribution (explained above). The physical significance and implications of and - says how confident the region is and regions that are stable (for example from the segments from clustering adapted variances and weights of training phase pixel models) would have high values. While the value of determines how fast the pixel would need to adapt to new incoming values and this would mean a lower effect of the prior distribution. The final parameter determines the consistency of the pixel belonging to a model and this would change whenever the pixels behavior is much more dynamic (as opposed to a temporal residue weighting it).

#### Confidence Based Spatio-Temporal Sampling

Applying multiple modes of background classifiers and observing the consistency in their model parameters (mean, variance, and connectivity) we predict the future values of these pixels. A threshold on confidence function value determined by using stable regions(using region growing) as a reference is used to select the pixels both spatially and temporally. The description of the confidence measure is given in more detail in section 2.3. The pixels with low confidence reflect regions R over the frame with activity and thus a high probability of finding pixels whose label are in transition (FG-BG). Thus by thresholding the confidence function we sub-sample the incoming pixels spatio-temporally. This intuition is when pixel values arriving now are within the first dominant mode’s region, and even more so within the CHP level for a large number of frames, the confidence value saturates. The Region R is just a thresholded binary map of this confidence value. This is demonstrated in the analysis in section 3.

### 2.3Cascade of Gaussians CoG

The proposed method can be viewed as a decomposition of the GMM in an adaptive framework so as to reduce complexity and improve accuracy using a strong prior to determine the scenarios under which said gains can be achieved. The prior is used to determine the modality of the pixels distribution and any new value is treated as a new mean with variance model. The Cascade can be seen to consist of K Gaussians which are ordered based on the successful classification of the pixel. During steady state the ordered cascade conforms to the Viola Jones Rejection Cascade with decreasing positive detection rates. The Cascade is first headed by a Consistent Hypothesis Propagation (CHP) classifier which basically repeats the labeling process on the current pixel if its value is equal to the previous value (previous frame). This CHP classifier is then followed by an ordered set of Gaussians including the spatio-temporally grouped parameters. The tree ordering is different for different pixel and the order is decided based on the prior distribution (KDE) of the pixel and the temporal consistency of the pixel in the different levels. When the pixel values do not belong to any of the dominant modes based on the prior, we have scenario where the beta weight and gamma weight only considered and alpha is rejected (Prior Nullified).

The rejection cascade is based on the motivation that the number of occurrences of foreground detections is lesser compared to that of the background. The term rejection cascade was introduced first in the classic Viola Jones paper [ [1]]. In this rejection cascade the training phase produces a sequence of features with decreasing rates of negative rejections. In our case we arrange the different classifiers in increasing complexity to maximize the speed. We observe in practice that, this cascade would also produce decreasing rates of negative rejections. The critical difference in this rejection cascade is that the classifier in each level of the cascade is evolving over time. To make adaptation efficient we adapt only the active level of the cascade, thus resulting in only one active update at a time, and during a transition the parameters are updated. The performance of different rejection cascade elements is depicted in Figure 1. It depicts cascade elements with increasing complexity (and consequently accuracy) have higher performance. These times were obtained over 4 videos from the wallflowers data set by [9] of different types of dynamic background. This by itself can stand for the possible amount of speedup that can be obtained when the Rejection Cascade is operated on pixels adaptively based on the nature of the pixel. In a similar observation we saw that the number of pixels (in each of these 4 videos) was distributed in different manner amongst the 4 levels. This is seen in figure 2. Thus we see that even though the number of pixels corresponding to dynamic nature of pixel varies with the nature of the video, there is greater number of pixels on an average corresponding to low complexity Cascade elements. The rejection cascade for background subtraction was formed by determining (same as in [1]) the set of background pixel classifiers (or in our case models like attentional operator in Viola Jones) and is organized as a degenerate tree such that it has decreasing false positive rate as we proceed down the cascade. As we can see from the recall-precision graphs the False positive rates are decreasing as we proceed from Dominant Mean to Dominant Mean with Variance and finally GMM. Please refer to [3]for a more comprehensive list of Recall-Precision Rates, it provides an idea of how the tree rates were determined and cascade arranged. The dominant modes mean and associated variances is updated according to the conventional Running Gaussian Average (RGA) Model as in equation 2.

The learning Rate for the Mean and Variance are updated by the learning rates (based on confidence value) obtained during training based on the Temporal Residue and Time windowed KDE (to obtain temporal resolution). The Learning Rates are based on the spatio temporal grouping and their observed variances or dynamic nature. This is determined during the Training Phase. The performance of different rejection cascade elements is in depicted in Figure 1. It depicts cascade elements with increasing complexity (and consequently accuracy) have higher performance. These times were obtained over different types of static and dynamic background. This by itself can stand for the possible amount of speedup that can be obtained when the Rejection Cascade is operated on pixels adaptively based on the nature of the pixel. In a similar observation we saw that the number of pixels (in each of these 4 videos) was distributed in different manner amongst the 4 levels. This is seen in figure 2. Thus we see that even though the number of pixels corresponding to dynamic nature of pixel varies with the nature of the video, there is greater number of pixels on an average corresponding to low complexity Cascade elements. The rejection cascade for background subtraction was formed by determining (as in Viola Jones [5]) the set of background pixel classifiers (or in our case models like attentional operator in Viola Jones) need to be organized as a degenerate tree such that it has decreasing false positive rate as we proceed down the cascade. As we can see from the recall-precision graphs the False positive rates are decreasing as we proceed from Dominant Mean to Dominant Mean with Variance and finally GMM. Please refer to [3] for a more comprehensive list of Recall-Precision Rates. The dominant modes mean and variances is updated according to the conventional Running Gaussian Average (RGA) Model.

The learning rate for the model is calculated as a function of the confidence measure of the pixels. The abrupt illumination change is detected in the final level of the rejection cascade, by adding a conditional counter. This counter measures the number of pixels that are not modeled by the penultimate cascade element. If this value is above a threshold we can assume an abrupt illumination change scenario. This threshold is around seven tenth of the total number of pixels in the frame(similar to [9]).

## 3Analysis of Cascade of Gaussians

Here we discuss two parts of the CoG. The first section discusses the analysis of the training phase, in particular the spatio-temporal grouping and initialization of the CoG. The second section discusses the cascade itself and its performance.

### 3.1Scene Prior Analysis

Here we discuss the the Scene Prior and its different components. First with regard to the clustering pixels based on their dynamic nature similarity, we show results of various clustering methods and their intuitions. The first model considers the time series of variances of said pixels in the N frames of training. The covariance matrix is calculated for the variances of the pixels. This can loosely act as the affinity matrix for the describing similar behavior of a pair of pixels. The weight of the first dominant mode is also considered to form the affinity matrix.

### 3.2Cascade Analysis

The Cascade of Gaussians is faster on accounts of 2 parts: Firstly it is cascade of simple-to-complex classifiers (CHP to RGA) and averaging over the performance (seen in figure), we see an improvement in speed of operation since the simpler cases of classification outweigh the complex ones. Secondly it models the image as a spatio-temporal group of super pixels that needs a single set of parameters to update, even more so, when the confidence of the pixel saturates, the Cascade updates are halted, providing huge speedups. Though it is necessary to mention that the window of sampling is chosen empirically and in scale with the confidence saturation values. The average speedup of the rejection tree algorithm is calculated as.

Where x,y go over all indices of image, refers the ratio of background pixels labeled mean or mean with variance w.r.t the total number of background pixels in the image . is the normalized ratio of the time it takes for level i BG model to evaluate and label a pixel as background. The values of n and s have been profiled over various videos for different durations. Also we show the distribution of the CHP pixels as well as the first 3 dominant modes within different frames of Waving tree and Time of Day videos with 40 frames of training each. We can see a huge occupancy of Red (CHP) for both background and foreground pixels. Here we explain the confidence measure and effect on accuracy of the GMM model. We obtain a speedup of 2x-3x with the use of the Adaptive Rejection cascade based GMM. This speedup goes up at the effectiveness of accuracy of confidence based spatio-temporal sampling to 4-5x. This is evident in the Cascade level population (in figure below).

## 4Results and Conclusion

The results section discusses various tests we have performed on Toyama dataset, the PETS 2001,2006 datasets. We show the ratios of different types of background encountered by the Cascade of Gaussians. This basically depicts the different ratios of pixels that obtain different speed ups from the cascade based on their level (CHP, mode 1,2 and so on). This paper has demonstrated conceptually how a GMM (and its variants like AGMM) can be restructured optimally into a Prior and cascaded model ordered based on the probability of occurrences of each level of the cascade, the accuracy(and complexity) of each model in the cascade level. The spatio-temporal grouping helps evaluate similar pixels in the scene and provide a fewer parameters to update over the whole frame, minimizing the loss in accuracy at the same time. Finally the confidence measure chosen, is shown as a metric is sensitive to change in pixel values, pixel modes, and the Prior distribution, and the associated learning rates being decided based on the same.

### References

**Robust real-time face detection.**

P. Viola, and M. Jones.*International Journal of Computer Vision*, 57(2):137–154, May 2004.**Spectral methods for automatic multiscale data clustering.**

A. Azran and Z. Ghahramani.*Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on*, pages 190–197.**Review and evaluation of commonly-implemented background subtraction algorithms.**

Y. Benezeth, P. Jodoin, B. Emile, H. Laurent, and C. Rosenberger.*Pattern Recognition, 2008. ICPR 2008. 19th International Conference on*, pages 1–4, 2008.**Background modeling using mixture of gaussians for foreground detection - a survey.**

T. Bouwmans, F. E. Baf, and B. Vachon.*Recent Patents on Computer Science*, 1(3):219–237, Nov. 2008.**A hierarchical approach to robust background subtraction using color and gradient information.**

O. Javed, K. Shafique, and M. Shah.*Motion and Video Computing, 2002. Proceedings. Workshop on*, pages 22–27, 2002.**Background subtraction techniques: a review.**

M. Piccardi.*Systems, Man and Cybernetics, 2004 IEEE International Conference on*, 4(1):3099–3104, 2004.**Detection of temporarily static regions by processing video at different frame rates.**

F. Porikli.*Advanced Video and Signal Based Surveillance, AVSS 2007. IEEE Conference on*, pages 236–241, sept 2007.**Adaptive background mixture models for real-time tracking.**

C. Strauffer and W. Grimson.*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*, 1999.**Wallflower: Principles and practice of background maintenance.**

K. Toyama, J. Krumm, B. Brumitt, and B. Meyers.*Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on*, 1(1):255–261, 1999.**An efficient, chromatic clustering-based background model for embedded vision platforms.**

B. Valentine, S. Apewokin, L. Wills, and S. Wills.*Computer Vision and Image Understanding*, 114(11):1152–1163, Nov. 2010.**Accurate dynamic scene model for moving object detection.**

H. Yang, Y. Tan, J. Tian, and J. Liu.*Int Conf on Image Processing (ICIP 2007)*, 6:157–160, 2007.**Model switching based adaptive background modeling approach.**

J. Zuo, Q. Pan, H. Z. Y. Liang, and Y. Cheng.*Acta Automatica Sinica*, pages 467–473.