Learning deep structured network for weakly supervised change detection
Abstract
Conventional change detection methods require a large number of images to learn background models or depend on tedious pixellevel labeling by humans. In this paper, we present a weakly supervised approach that needs only imagelevel labels to simultaneously detect and localize changes in a pair of images. To this end, we employ a deep neural network with DAG topology to learn patterns of change from imagelevel labeled training data. On top of the initial CNN activations, we define a CRF model to incorporate the local differences and context with the dense connections between individual pixels. We apply a constrained meanfield algorithm to estimate the pixellevel labels, and use the estimated labels to update the parameters of the CNN in an iterative EM framework. This enables imposing global constraints on the observed foreground probability mass function. Our evaluations on four benchmark datasets demonstrate superior detection and localization performance.
1 Introduction
Identifying changes of interest in a given set of images is a fundamental task in computer vision with numerous applications in fault detection, disaster management, crop monitoring, visual surveillance, and scene analysis in general. When there are only two images available, existing approaches mostly resort to strong supervision, thus require large amounts of training data with accurate pixellevel annotations to perform pixellevel analysis. To comprehend the significant amount of effort needed for such a formidable task, we consider the example of CDnet2014 [Wang et al.2014], which is the largest dataset for video based change detection. This dataset required manual annotations for 8 billion pixel locations. Although sophisticated methods have been investigated to reduce the human effort, e.g., by expert feedback in case of ambiguity [Jain and Grauman2013, Gueguen and Hamid2015], semiautomatic propagation of annotations [Badrinarayanan et al.2013], and pointwise supervision [Russakovsky et al.2015], acquisition of accurate and dense pixelwise labels still remains a daunting task [Lin et al.2014, Song et al.2015].
Here, we address the problem of change detection within a pair of images and present a solution that uses only imagelevel labels to detect and localize changed regions (Fig. 1). Our method drastically reduces the effort required to collect annotations and provides an alternative to video change detection that requires a large number of consecutive frames to model the background scene. In many realworld applications, a continuous stream of images may not be always available due to a number of reasons such as challenging acquisition conditions, limited data storage, latency in processing, and long intervals before changes happen. For example, the analysis of aerial images for change detection, in particular for damage detection, often is formulated for a pair of images acquired at different times. Other examples where only a pair images might be available include structural defect identification, face rejuvenation tracking, and updating city streetview models.
Our algorithm jointly predicts the imagelevel change label and a segmentation map indicating the location of changes for a given pair of images. The central component of our method is a novel twostream deep network model with structured outputs (Sec. 2). This model operates on a pair of images and does not need the images to be registered precisely. It can be trained with only weak imagelevel labels (Sec. 4.2). The network has a Directed Acyclic Graph (DAG) architecture where the initial layers are shared, while the latter part splits into two branches that make separate (but coupled) predictions for change detection and localization. In this manner, our deep network is different from the popular singlestream convolutional neural networks (CNN) for object classification [Khan et al.2015], detection [Girshick et al.2014, Khan et al.2016] and semantic labeling tasks [Papandreou et al.2015, Long et al.2015, Pinheiro and Collobert2015].
In order to jointly predict the imagelevel and pixellevel labels, we introduce a constrained meanfield inference algorithm (Sec. 2.3), that employs a factorizable approximate posterior distribution with global linear constraints. Using a global constraint on the foreground (changed pixels) probability mass function, we suppress the bias towards the background (nochange labeled pixels) and encourage the assignment of change labels to nonidentical regions. Such global constraints enable us to derive an efficient meanfield inference procedure, while eliminating the need of approximate biases [Papandreou et al.2015] and object based priors [Pinheiro and Collobert2015, Russakovsky et al.2015]. Furthermore, based on the novel inference algorithm, we apply a variational ExpectationMaximization (EM) learning algorithm that maximizes the lower bound of the loglikelihood of imagelevel labels. We extensively evaluate our approach on three publicly available datasets (CDnet2014, PCD2015 and AICD2012) and a custom built satellite image dataset (GASI2015) (Sec. 4.2). Our experimental results demonstrate that the proposed approach outperforms the stateoftheart by a large margin (Sec. 4.3). The key contributions of our work include:

To the best of our knowledge, this is the first work to address the weakly supervised change detection problem.

Our proposed CNN model jointly detects and localizes changes in image pairs.

We present a modified meanfield algorithm with additional constraints to efficiently localize changes.

We introduce a new satellite image dataset (GASI2015) for change detection. Furthermore, we perform a rigorous evaluation on three other relevant datasets.
2 Twostream CNNs for Change Localization
We address the problem of joint change detection and localization with only imagelevel weak supervision. To this end, we propose a twostream deep convolutional neural network model with structured outputs, which can be learned with weakly labeled image pairs. We describe our model next.
2.1 Model Overview
Given a pair of input data, which can be images or (short) video clips, our goal is to predict the categories of change events in the data pair and localize the change more precisely at the pixel level. For simplicity, we focus on the image pair scenario in the following and video clips can be processed in a similar manner.
Specifically, let each input consists of a pair of images, . We associate an imagelevel output label vector to indicate the occurred change events i.e., change, nochange and . It is important to note here that the nochange category (i.e., ) refers to the staticbackground, irrelevant changes and the dynamic background change patterns while those change categories refer to changes of interest. In order to localize the change events at the pixel level, we introduce a set of binary variables to denote the labels of individual pixel locations for each image pair . Assume the image has pixel locations, .
We formulate the change detection and localization problem as the joint prediction of its imagelevel and pixellevel change variables. To achieve this, we consider a deep structured model that defines a joint probabilistic model on and as , where the Gibbs energy is defined as:
(1) 
where is the unary term for imagelevel label , modeled by a CNN with parameter , and is the unary term for pixellevel labels, modeled by a Fully Convolutional Network (FCN) with parameter . The pairwise energy consists of two terms, and , which enforces the spatial smoothness of pixellevel labels and captures the coupling between image and pixellevel labels, respectively. The joint prediction can be formulated as inferring the MAP estimation of the model distribution,
(2) 
A graphical illustration of the model is shown in Fig. 3.
2.2 Deep Network Architecture
We build the deep structured model by first introducing a twostream deep CNN for the unary terms as shown in Fig. 3. The underlying architecture of the network is similar to the VGGnet (configurationD, the winner of the classification and localization challenge, ILSVRC’14) [Simonyan and Zisserman2014] but with several major differences. Most importantly, the network operates on multichannel inputs (6 channels for paired color images) and divides into two branches after the fourth pooling layer (). From our initial experiments (consistent with [Zagoruyko and Komodakis2015]), a multichannel network performs better than a traditionally used Siamese network for paired images. The two branches compute the probability of the imagelevel and pixellevel labels, and therefore will be called as the classification and the segmentation branch, respectively. The segmentation branch in our architecture is similar to FCNVGG1616s network [Long et al.2015] which demonstrated stateoftheart performance on the Pascal VOC segmentation dataset. The initial shared layers in our architecture combine the initial (essentially similar) portions of VGG and FCN networks, which results in a significant decrease in trainable parameters without any drop in performance. We now describe the details of the two branches of the network architecture.
Imagelevel change unary energy:
The classification branch predicts the imagelevel label probability and has more layers to collapse the filter responses from the initial layers. Specifically, the classification branch output of the CNN architecture models , predicting the imagelevel change energy as:
(3) 
where is the deep network feature before the final softmax operator, are the weight parameters shared with the segmentation branch, and are the weight parameters for the classification branch only.
Pixellevel change unary energy:
The segmentation branch generates a downsampled coarse segmentation map (of size ) for each change category. After shared layers, the branch has three fully connected layers, which are implemented as convolution layers as in the FCN [Long et al.2015]. Formally, the segmentation branch of the CNN model generates the pixellevel change label energy as follows,
where denotes the segmentation branch scores of the CNN architecture before the softmax operator and are the weights for the fully connected layers.
We now describe the pairwise energy that encodes the compatibility relations between the imagelevel and the pixellevel variables as well as the spatial smoothness. Specifically, on the top of the fully connected layers, we add a densely connected Conditional Random Field (CRF) to impose the spatial smoothness of the pixel labeling. Unlike the previous models [Papandreou et al.2015], our dense CRF depends on the output label of the classification branch, and thus couples the imagelevel and pixellevel prediction.
Formally, we define the compatibility relations between the output variables and by the following energy functions,
(4) 
where enforces all hidden variables to be zero if the category label predicts nochange and encourages to take a change label otherwise:
where is a weight parameter and is the color difference between two images at pixel . The fullyconnected pairwise term defines the smoothing term between the latent variables given input features . These energies have a functional form of the weighted Potts model in which the weight is defined using Gaussian kernels of [Krähenbühl and Koltun2011]:
(5) 
where , are the kernel weights, is the Potts compatibility while , are the appearance and smoothness kernels [Krähenbühl and Koltun2011].
2.3 Model Inference for Change Localization
Given the twostream CNN+CRF model, we predict the image and pixellevel change labels by inferring the MAP estimation of the joint probability model in Eq. (1). In order to compute the most likely configuration efficiently, we adopt a sequential prediction approach that first infers the imagelevel change label followed by the pixellevel change mask inference. Specifically, we compute the change label prediction approximately as follows,
This prioritized inference procedure allows us to compute the (more reliable) imagelevel label first and to run an efficient meanfield inference for the pixellevel labels only once^{1}^{1}1In general, we note that we can compute the MAP estimation jointly by enumerating ’s values and running meanfield inference multiple times, which is less efficient..
We now derive a constrained meanfield inference algorithm for inferring the pixellevel change labeling . We note that the efficient meanfield algorithm [Krähenbühl and Koltun2011] usually leads to an oversmoothing of the pixellevel labeling and assigns most of the pixels to the ‘nochange’ class. In this work, we incorporate an additional global constraint on the proportion of ‘change’ label values in the image. Unlike previous methods (e.g., [Papandreou et al.2015, Pinheiro and Collobert2015, Russakovsky et al.2015]), we enforce such constraints on the approximate probability family which allows us to derive an efficient modified meanfield procedure.
Formally, we assume the foreground label proportion to be , which is fixed during training by crossvalidation. For each test image pair, we find closely matching pairs from the training set using a KNN search and average their foreground label proportion to estimate (details in Sec. 4.3). To enforce the proportion constraint, we introduce the following factorized approximate probability family with a global constraint:
(6) 
where, and the constraint implies that the overall foreground probability mass is . Following [Krähenbühl and Koltun2013], we minimize the approximate KLdivergence,
with  (7) 
where is the unary term vector (including and ) and is the compatibility matrix computed from , and is the log partition function. We use the CCCP algorithm [Yuille and Rangarajan2003] to minimize iteratively.
3 EM Learning with Weak Supervision
We now consider a weakly supervised learning approach to estimate the parameters of the twostream CNN+CRF model (Sec. 2). In particular, as the labeling of the pixellevel change pattern is tedious and impractical, we assume only imagelevel change annotations are available, which can be obtained with much less effort. Let us denote the dataset comprising of labeled image pairs: .
The learning objective is to maximize the log conditional likelihood and we consider a variational meanfield energy lower bound as follows,
where, and denote the expected value and the entropy function respectively, and is an approximate posterior probability factorizing over as defined in Eq. (6). In other words, the posterior probability can be expressed as the product of independent marginals: . We then derive a variational expectationmaximization (EM) algorithm for learning our twostream CNN+CRF in the following, which alternately maximizes the objective function above.
3.1 Meanfield E Step
We update the approximate function by maximizing the objective w.r.t the function given the model parameter from the previous iteration. Note that given the model structure, this leads to a meanfield updating equation to compute . The updating equation requires message passing between all the and , which is computationally expensive. Efficient message passing is achieved using the high dimensional Gaussian filtering by considering the permutohedral lattice structure [Adams et al.2010].
Given the approximate posterior marginals, we can compute the (approximate) most likely configuration of the latent variables ,
(8) 
The marginal mode will be used in the M step for the CNN+CRF learning.
3.2 M Step for CNN+CRF Training
Once we have the posterior marginal distribution and its mode, we update the model parameters with the posterior mode configuration and groundtruth . Specifically, we treat them as the groundtruth for the pixel and imagelevel labels, and learn the twostream deep CNN+CRF in a stagewise manner. Our stagewise learning first estimates the parameters in the unary terms, i.e., the two deep CNNs, and then validates the parameters in the pairwise term. This strategy is similar to the piecewise learning in the CRF literature.
We first use backpropagation to train the two branches of the deep CNN separately with the corresponding training data. More precisely, the averaged gradient from two streams is backpropagated to update the shared parameters (), while the individual gradients are computed using and as groundtruths to update and for the classification and segmentation branches, respectively. Concretely, the model parameters are updated to maximize the data likelihood as follows,
After the twostream deep network component is trained, we estimate the parameters in Eq. (4) by crossvalidation.
The overall EM procedure starts with an M step with an initial value of . We assume the initial hidden variable states () to be consistent with the imagelevel labels: . The model parameters are finetuned by training the twostream CNN+CRF with those initial labels. This is important because the CNN is pretrained for object recognition on ImageNet and therefore the estimation of change regions in the initial Estep does not generate reasonable groundtruths.
4 Experiments
4.1 CNN Implementation
The network weights are initialized from a pretrained VGG network (on ImageNet). The network splits into two portions after the fourth pooling layer. As we need a coarse segmentation map () at the output of the segmentation branch, enlarged paired images of size are fed to the CNN. Moreover, the convolution filter size in FC1 (segmentation branch) is kept to (in contrast to a filter size in FC1) to avoid the additional decrease in resolution of the output map.
The unary energies of our CRF model are defined using the CNN activations, while the Gaussian edge potentials proposed by [Krähenbühl and Koltun2011] are used as pairwise terms. Note that changes of interest can occur in any of the two paired images, and therefore it is not desirable to remain restricted to the detection of changes in only one of the images (w.r.t the other image). For this purpose, the groundtruth with which we compare our final segmentation results include the changes in both images (see Fig. 4). During the meanfield inference step, we find the segmentation map of both images using their respective edge potentials. Subsequently, the two output maps are combined to get the final estimate of hidden variables. The resulting segmentation map is used as groundtruth during the CNN training (M step).
4.2 Datasets and Protocols
We evaluate our method on the following four datasets. All of them include pixel level change ground truth, from which we derive the imagelevel annotations for weakly supervised learning. The pixellevel labels are not used in the training of our deep network.
CDnet 2014 Dataset:
The original video database consists of 53 videos with framebyframe groundtruths available for frames in specified regionsofinterests (ROIs). Various types of changes (e.g., shadow, object motion and motion blur) under different conditions (e.g., challenging weather, air turbulence and dynamic background) are included in this database [Wang et al.2014]. It is also important to note that the paired images are not registered and therefore background can change across paired images [Wang et al.2014]. A total of distinct image pairs are generated at random from the video sequences. In each pair, both images belong to the same video but they are captured at different time instances.
AICD 2012 Dataset:
Aerial Image Change Detection (AICD) dataset [Bourdis et al.2011] consists of pairs of large sized images (). It is a synthetic dataset in which the images are generated using a realistic rendering engine of a computer game (Battle Station 2). A total of scenes are included in this dataset containing several realworld objects including buildings, trees and vehicles. The scenes are generated under varying conditions with significant changes in viewpoint, shadows and types of changes. Because the change regions are very small in satellite/aerial images, we work on the patch level and extract patches of size from each image with minimal overlap. This provides a total of paired images, facilitating the training of a model with a large number of parameters.
GASI 2015 Dataset:
Geoscience Australia Satellite Image (GASI) dataset is a custom built dataset based on the changes occurred during in a area in the east of city of Melbourne in Victoria, Australia [Khan et al.2017]. For each region of interest, we have a time lapse sequence (between 19992015) of surface reflectance data and the corresponding pixel quality maps. Due to the severe artifacts caused due to clouds and band saturation, the modelling of the temporal trends is very challenging. In contrast, the acquisition of paired images captured at different times is much easier.
The annotations for two types of changes are provided in the GASI dataset namely: fire and harvests. We generate pairs of image patches for distinct regions of interest which were identified by experts. In total, pairs are generated for each region of interest which makes a total of pairs. Since the raw data contains artifacts, we improved it’s quality by filling data across different time instances.
There exists a large disparity among the sizes of change regions in the GASI dataset. For very large sized regions, we cropped the region bounded by a tightly fit bounding box. For small regions (mostly changes due to forest harvesting) with area of the total image area, we crop a bounding box with dimensions equal to three times that of a tightly bounded box. Since, there are large variations between the size of changes in the identified change regions, we converted all the regions to a uniform size of to get a consistent segmentation map.
PCD 2015 Dataset:
Panoramic change detection dataset [Sakurada and Okatani2015] consists of 200 pairs of panoramic images of street scenes and tsunamihit areas. The image size is , from which we extract patches with a minimal overlap for training and testing. This gives us a total of pairs (panoramic image). It is important to mention that the two images are not perfectly registered. As a result, there are temporal differences in camera viewpoints, illumination and acquisition conditions.
4.3 Results
The change detection results of our approach on the four CD datasets are shown in Table 1. As a baseline, we only consider the classification branch of the CNN network initialized with the pretrained VGGnet (configuration D, see columns of Table 1). Paired images are fed to this network architecture and dimensional feature vectors are extracted from the FC2 layer. A linear SVM classifier is then trained for classification using the liblinear package [Fan et al.2008]. On both datasets, the average precision (AP) and the overall accuracy of our approach was significantly higher than that of the baseline procedure (specifically 6.8% and 5.3% boost in AP for the CDnet and GASI datasets, respectively). As a stronger baseline, we also report performance of the network when only the finetuned classification branch was used (columns , Table 1). We note that our full model outperformed the results from the finetuned classification branch.
Dataset  Classification  Finetuned  This Paper  

Branch  Classification Branch  
AP  Acc.  AP  Acc.  AP  Acc.  
CDnet2014  88.8  92.0  94.0  96.6  95.6  98.7 
AICD2012  92.7  95.4  95.7  98.0  97.3  99.1 
GASI2015  80.3  82.5  83.0  84.4  85.6  86.5 
PCD2015  67.1  73.5  72.9  81.1  74.9  84.2 
We report the segmentation performance of our approach in Table 2 in terms of the mean intersection over union (mIOU) score. To compare our change localization results, we report four baseline procedures. Specifically, we compare against random segmentation masks ( columnRS), thresholding applied to a difference map obtained from the pair of images ( columnDT), thresholding applied to the output from a pretrained network (weights initialized for segmentation branch using VGGnet, columnPN), thresholding applied to the output from the finetuned network ( columnTh.) and the graphcuts inference [Boykov et al.2001] using CNN outputs as unaries and contrast based pairwise potentials with a Potts model ( columnGC). We note that random segmentation provides a lower baseline, while our results after training with groundtruths (shown in last column, Table 2) sets an upper bar on the performance. Another important trend is that the thresholding approach and graphcuts performances do not differ by a large margin. However, our weakly supervised approach was able to achieve significantly higher mIOU scores due to the additional potentials and constraints (Sec. 2).
We also report segmentation results on two additional baselines which use cardinality based pattern potentials (Table 3). These baselines include the higher order potential (HOP) based dense and grid CRF models of Vineet [Vineet et al.2014] and Kohli [Kohli et al.2007] respectively. For both these baselines, we define HOPs on segments generated using meanshift segmentation. Due to absence of pixel level supervision, we use the parameters from [Kohli et al.2007]. We note that the dense CRF model with HOP [Vineet et al.2014] performs better than the grid CRF model [Kohli et al.2007], however our deep structured prediction model outperforms both these strong baselines by a fair margin of in terms of mIOU score.
Method  Dense CRF + HOP  Grid CRF + HOP  This Paper 

[Vineet et al.2014]  [Kohli et al.2007]  
mIOU%  42.0  38.3  46.2 
Method  Segmentation Results (mIOU) 

with only segmentation branch  40.7 
w/o CD finetuning  37.4 
w/o difference term  41.5 
w/o proportion constraint  41.3 
The qualitative results for change localization on the CDnet2014, GASI2015 and PCD2015 datasets are shown in Fig. 4. The proposed approach performed well in localizing small as well as large sized changes (e.g., col, Fig. 4). Moreover, it showed good results for images acquired in varying conditions (e.g., night, snow, rainfall, dynamic background) and with different capturing devices (e.g., thermal camera, PTZ). For the CDnet2014 dataset, it is interesting to note that our method localized several changes in the regions outside the ROIs (shown in blue color in the groundtruth). Similarly, the qualitative results indicate the good performance of our method for satellite image based change detection.
We performed an ablative study on the CDnet2014 dataset for the change segmentation task (Table 4). The experimental results show that the localisation performance decreases without the feedback from the classification branch (whose predictions are more accurate). Moreover, since the pretrained network is not trained to detect changes from multichannel inputs, the performance is considerably lower than that of the finetuned network. The difference term in the unary potential of the dense CRF and the global proportion constraint on the foreground probability mass also contributes a fair share in the final mIOU score.
Normalized  0.1  0.2  0.3  0.4  0.5  0.6 

mIOU% (CDnet14)  30.8  28.4  36.7  41.1  32.5  25.4 
During test, we use KNN to estimate an imageadaptive , which gives better estimate of foreground proportion and performance. We perform the KNN search using Euclidean distance on the features from the FC1 layer of the CNN model (classification branch). We use the fast approximate KNN search method based on KDtree which has an average complexity of at most O(log(n)). Note that for the training, we validate a fixedvalue , which is faster than using KNN. As the pixellevel labels are unavailable in training, we set to a value which gives coverage of at least of each image on a validation set. To compare the performance of imageadaptive with a fixedvalue , we include the test segmentation scores on the CDnet14 dataset with different fixed values of in Table 5. Furthermore, we evaluate the segmentation performance with different numbers of nearest neighbours used to estimate and notice that the best performance is achieved when the is set to 6 (for KNN) to estimate the normalized foreground probability mass (Fig. 5). Finally, we present some error cases of our approach in Fig. 6.
5 Conclusion
This paper tackles the problem of weakly supervised change detection in paired images. We developed a novel CNN based model, which predicts change events and their location. Our approach defines a dense CRF model on top of the CNN activations and uses a modified meanfield inference procedure to enforce the compatibility between image and pixel level predictions. The proposed algorithm achieved a significant boost both in the case of detection and localisation of change events compared to strong baseline procedures. Our work is the first effort in the area of weakly supervised change detection using paired images and will find possible applications in damage detection, structural monitoring and automatic 3D model updating systems. In future, we will explore the possibility of multiclass change detection in pair of images/videos.
References
 [Adams et al.2010] Andrew Adams, Jongmin Baek, and Myers Abraham Davis. Fast highdimensional filtering using the permutohedral lattice. In Computer Graphics Forum, volume 29, pages 753–762. Wiley Online Library, 2010.
 [Badrinarayanan et al.2013] Vijay Badrinarayanan, Ignas Budvytis, and Roberto Cipolla. Semisupervised video segmentation using tree structured graphical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2751–2764, 2013.
 [Bourdis et al.2011] Nicolas Bourdis, Denis Marraud, and Hichem Sahbi. Constrained optical flow for aerial image change detection. In IGARSS, pages 4176–4179. IEEE, 2011.
 [Boykov et al.2001] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, 2001.
 [Fan et al.2008] RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. Liblinear: A library for large linear classification. The Journal of Machine Learning Research, 9:1871–1874, 2008.
 [Girshick et al.2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587. IEEE, 2014.
 [Gueguen and Hamid2015] Lionel Gueguen and Raffay Hamid. Largescale damage detection using satellite imagery. CVPR, 2(2):3, 2015.
 [Jain and Grauman2013] Suyog Dutt Jain and Kristen Grauman. Predicting sufficient annotation strength for interactive foreground segmentation. In ICCV, pages 1313–1320. IEEE, 2013.
 [Khan et al.2015] Salman H Khan, Mohammed Bennamoun, Ferdous Sohel, and Roberto Togneri. Cost sensitive learning of deep feature representations from imbalanced data. arXiv preprint arXiv:1508.03422, 2015.
 [Khan et al.2016] Salman H Khan, Mohammed Bennamoun, Ferdous Sohel, and Roberto Togneri. Automatic shadow detection and removal from a single image. IEEE transactions on pattern analysis and machine intelligence, 38(3):431–446, 2016.
 [Khan et al.2017] Salman H Khan, Xuming He, Fatih Porikli, and Mohammed Bennamoun. Forest change detection in incomplete satellite images with deep neural networks. IEEE Transactions on Geosciences and Remote Sensing, 2017.
 [Kohli et al.2007] Pushmeet Kohli, M Pawan Kumar, and Philip HS Torr. P3 & beyond: Solving energies with higher order cliques. In CVPR, pages 1–8. IEEE, 2007.
 [Krähenbühl and Koltun2011] Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, pages 109–117, 2011.
 [Krähenbühl and Koltun2013] Philipp Krähenbühl and Vladlen Koltun. Parameter learning and convergent inference for dense random fields. In ICML, pages 513–521, 2013.
 [Lin et al.2014] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
 [Long et al.2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440. IEEE, 2015.
 [Papandreou et al.2015] George Papandreou, LiangChieh Chen, Kevin P Murphy, and Alan L Yuille. Weaklyand semisupervised learning of a deep convolutional network for semantic image segmentation. In CVPR, pages 1742–1750. IEEE, 2015.
 [Pinheiro and Collobert2015] Pedro O Pinheiro and Ronan Collobert. From imagelevel to pixellevel labeling with convolutional networks. In CVPR, pages 1713–1721. IEEE, 2015.
 [Russakovsky et al.2015] Olga Russakovsky, Amy L Bearman, Vittorio Ferrari, and FeiFei Li. What’s the point: Semantic segmentation with point supervision. arXiv preprint arXiv:1506.02106, 2015.
 [Sakurada and Okatani2015] Ken Sakurada and Takayuki Okatani. Change detection from a street image pair using cnn features and superpixel segmentation. In BMVC, 2015.
 [Simonyan and Zisserman2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [Song et al.2015] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgbd: A rgbd scene understanding benchmark suite. In CVPR, pages 567–576, 2015.
 [Vineet et al.2014] Vibhav Vineet, Jonathan Warrell, and Philip HS Torr. Filterbased meanfield inference for random fields with higherorder terms and product labelspaces. International Journal of Computer Vision, 110(3):290–307, 2014.
 [Wang et al.2014] Yi Wang, PierreMarc Jodoin, Fatih Porikli, Janusz Konrad, Yannick Benezeth, and Prakash Ishwar. Cdnet 2014: An expanded change detection benchmark dataset. In CVPR Workshops, pages 393–400. IEEE, 2014.
 [Yuille and Rangarajan2003] Alan L Yuille and Anand Rangarajan. The concaveconvex procedure. Neural computation, 15(4):915–936, 2003.
 [Zagoruyko and Komodakis2015] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. arXiv preprint arXiv:1504.03641, 2015.