Coupled dictionary learning for unsupervised change detection between multisensor
remote sensing images^{1}^{1}1Part of this work has been supported by Coordenação de Aperfeiçoamento de Ensino Superior (CAPES), Brazil, the EU FP7 through the ERANETMED JCWATER program [MapInvPlnt Project ANR15NMED000202] and the ANR3IA Artificial and Natural Intelligence Toulouse Institute (ANITI).
Abstract
Archetypal scenarios for change detection generally consider two images acquired through sensors of the same modality. However, in some specific cases such as emergency situations, the only images available may be those acquired through sensors of different modalities. This paper addresses the problem of unsupervisedly detecting changes between two observed images acquired by sensors of different modalities with possibly different resolutions. These sensor dissimilarities introduce additional issues in the context of operational change detection that are not addressed by most of the classical methods. This paper introduces a novel framework to effectively exploit the available information by modelling the two observed images as a sparse linear combination of atoms belonging to a pair of coupled overcomplete dictionaries learnt from each observed image. As they cover the same geographical location, codes are expected to be globally similar, except for possible changes in sparse spatial locations. Thus, the change detection task is envisioned through a dual code estimation which enforces spatial sparsity in the difference between the estimated codes associated with each image. This problem is formulated as an inverse problem which is iteratively solved using an efficient proximal alternating minimization algorithm accounting for nonsmooth and nonconvex functions. The proposed method is applied to real images with simulated yet realistic and real changes. A comparison with stateoftheart change detection methods evidences the accuracy of the proposed strategy.
1 Introduction
Ecosystems exhibit permanent variations at different temporal and spatial scales caused by natural, anthropogenic, or even both factors (Coppin et al., 2004). Monitoring spatial variations over a period of time is an important source of knowledge that helps understanding the possible transformations occurring on Earth’s surface. Therefore, due to the importance of quantifying these transformations, change detection (CD) has been an ubiquitous issue addressed in the remote sensing and geoscience literature (Lu et al., 2004).
Remote sensing CD methods can be first classified with respect to (w.r.t.) their supervision (Bovolo and Bruzzone, 2007), depending on the availability of prior knowledge about the expected changes. More precisely, supervised CD methods require ground reference information about at least one of the observations. Conversely, unsupervised CD can be contextualized as automatic detection of changes without the need for any further external knowledge. Each class of CD methods present particular competitive advantages w.r.t. the others. For instance, supervised CD methods generally achieve better accuracy for predefined modalities whereas unsupervised methods are characterised by their flexibility and genericity. Nevertheless, implementing supervised methods require the acquisition of relevant ground information, which is a very challenging and expensive task, in terms of human and time resources (Bovolo and Bruzzone, 2007). Relaxing this constraints makes unsupervised methods more suitable for operational CD.
CD methods can also be categorized w.r.t. the imagery modalities the method is able to handle. As remote sensing encompasses many different types of imagery modalities (e.g., single and multiband optical images, radar, LiDAR), dedicated CD methods have been specifically developed for each one by exploiting its acquisition process and the intrinsic characteristics of the resulting data. Thus, due to differences in the physical meaning and statistical properties of images from different sensor modalities, a general CD method able to handle all modalities is particularly difficult to design and to implement. For this reason, most of the CD methods focus on a pair of images from one single target modality. In this case, the images are generally compared pixelwisely using the underlying assumption of same spatial resolutions (Singh, 1989; Bovolo and Bruzzone, 2015). Nevertheless, in some practical scenarios such as, e.g., emergency missions due to natural disasters, when the availability of data and the responsiveness are strong constraints, CD methods may have to handle observations of different modalities and/or resolutions. This highlights the need for robust and flexible CD techniques able to deal with this kind of observations.
The literature about multimodal CD is very limited, yet a few relevant references include the works by Kawamura (1971), Bruzzone et al. (1999), Inglada (2002), Lu et al. (2004), Alberga et al. (2007), Mercier et al. (2008) and Prendes et al. (2015). However, multimodal CD has always been an important topic since the initial development of CD methods. Earlier work by Kawamura (1971) described the potential of CD between a multimodal collection of datasets (e.g., photographic, infrared and radar), applied to weather prediction and land surveillance. Three features are extracted from the pair and the CD algorithm is trained on a learning set. According to Lu et al. (2004), various methods dedicated to CD between images from different sources of data are grouped as geographical information systembased methods. For instance, Solberg et al. (1996) proposed a supervised classification of multisource satellite images using Markov random fields. The work of Bruzzone et al. (1999) uses compound classification to detect changes in multisource data. The method uses artificial neural networks to estimate the posterior probability of classes. Moreover, Inglada (2002) studies the relevance of several similarity measures between multisensor data. These measures are implemented in a CD context (Alberga et al., 2007). A preprocessing technique based on conditional copula that contributes to better statistically modeling multisensor images was proposed by Mercier et al. (2008). Besides, Brunner et al. (2010) presented a strategy to assess building damages using a pair of very high resolution (VHR) optical and radar images by geometrically modeling buildings in both modalities. Chabert et al. (2010) updated information databases by means of logistic regression. More recently, the work of Prendes et al. (2015) presented a supervised method to infer changes after learning a manifold defined by pairs of patches extracted from the two images. Although some of these methods present relatively high detection performance, they are often restrained to a specific modality or to a specific target application. For instance, SolanoCorrea et al. (2018) proposed an approach to detect changes between multispectral images with different spatial and spectral resolutions by homogenization of radiometric and geometric image properties. However, this approach relies in particular on a taxonomy of possible radiometric changes observed in very high resolution images. Moreover, some methods are only suitable for building damage assessment taking benefit of their highlevel modeling, but show a poor adaptability to other scenarios (Brunner et al., 2010; Chabert et al., 2010). The other ones estimate some metrics from unchanged trained samples, which prevents their application within a fully unsupervised context (Bruzzone et al., 1999; Prendes et al., 2015; Mercier et al., 2008).
Recently, an unsupervised multisource CD method based on coupled dictionary learning was addressed by Gong et al. (2016). In the proposed methodology, the CD is based on the reconstruction error of patches approximated thanks to estimated coupled dictionary and independent sparse codes. Atoms of the dictionary are learnt from pairs of patches jointly extracted from the observed images. Following the same principle, in Lu et al. (2017), a semisupervised method was used to handle multispectral images based on joint dictionary learning. Both methods rely on the rationale that the coupled dictionary estimated from the observed images tends to produce stronger reconstruction errors in change regions rather than in unchanged ones. Because of the multimodality, the problem has not been formulated in the image space, but rather in a latent space formed by the coupled dictionary atoms. However, both methods exhibit some crucial issues that may impair their relative performance. First, the underlying optimization problem is highly nonconvex and no convergence guarantees are ensured, even by using some traditional dictionary learning methods (Aharon et al., 2006). Then, the considered CD problem has been split into two distinct steps: dictionary learning and code estimation. The errors in code estimation may produce false alarms in the final CD even with reliable dictionary estimates. Also, the statistical model of the noise – inherent to each sensor modality – has not been taken into consideration explicitly, which may dramatically impact the CD performance (Campbell and Wynne, 2011). Finally, these methods do not consider overlapping patches, which potentially would increase their robustness and thus do not explicitly handle the problem of possible differences in spatial resolutions (Ferraris et al., 2017b, a). The adequacy between the size of patches and the image scale is not discussed, although it may have a negative impact on the dictionary coupling and thus on the detection performance.
Overcoming these limitations, this paper proposes a similar methodology to learn coupled dictionaries able to conveniently model multimodal remote sensing images. Specifically, contrary to the aforementioned methods, the problem is fully formulated without splitting the learning and coding steps. Also, an appropriate statistical model is derived to describe the image from each specific remote sensing modality. Besides, the proposed method explicitly allows patch overlapping within the overall estimation process. To couple images with different resolutions, additional scaling matrices inspired by the work by Seichepine et al. (2014) are jointly estimated within the whole process. Finally, as the problem is highly nonconvex, it is iteratively solved based on the proximal alternating linearized minimization (PALM) algorithm (Bolte et al., 2014), which ensures convergence towards a critical point for some nonconvex nonsmooth problems. Note that the proposed patchbased method departs from segmentationbased methods which generally extract change information at an objectlevel, whose resolution is implicitly defined by the chosen segmentation procedure (Feng et al., 2018). Instead, the proposed method linearly decomposes overlapping square patches onto an appropriate common latent space, which allows CD to be operated at a pixel level.
This manuscript is organized as follows. Generic and welladmitted image models are introduced in Section 2. Capitalizing on these image models, Section 3 formulated the CD problem as a coupled dictionary learning. Section 4 proposes an algorithmic solution to minimize the resulting CDbased objective function. Section 5 reports experimental results obtained on synthetic images, considering three distinct simulation scenarios. Experiments conducted on real images are presented in Section 5.3. Finally, Section 6 concludes the manuscript.
2 Image models
2.1 Forward model
Let us consider that the image formation process inherent to all digital remote sensing imagery modalities is modeled as a sequence of transformations, denoted . This sequence applies to the original scene to produce the sensor output image. This output image is referred to as the observed image and is denoted by consisting of voxels stacked lexicographically that is from left to right, row by row. The voxel dimension may represent different quantities depending on the modality of the data. For instance, it stands for the number of spectral bands in the case of multiband optical images (Ferraris et al., 2017b) or for the number of polarization modes in the case of polarimetric synthetic aperture radar (POLSAR) images. The observed image provides a limited representation of the original scene with properties imposed by the image signal processor characterizing the sensor. The original scene cannot be exactly represented because of its continuous nature, but it can be conveniently approximated by a latent (i.e., unobserved) image related to the observed image as follows
(1) 
The sequence of transformations operated by the sensor over the latent image is often referred to as the degradation process. It may represent resolution degradations accounting for the spatial and/or spectral characteristics of the sensor (Ferraris et al., 2017a, b). In this paper, it specifically models the intrinsic noise corruption associated to the sensor modality (Sun and Févotte, 2014). The latent image can be understood, in this context, as a noisefree version of the observed image with the same resolution.
More precisely, the transformation underlies the likelihood function which statistically models the observed image conditionally to the latent image by taking into account the noise statistics. The noise statistical model mainly depends on the modality and rely on some classical distributions, e.g., the Gaussian distribution for optical images or the Gamma distribution for multilook SAR images. Moreover, as already pointed out by Févotte et al. (2009) in a different application context, for a wide family of distributions, this likelihood function relies on a divergence measure between the observed and latent images, which finally defines an explicit datafitting term through a negativelog transformation
(2) 
where and are parameters characterizing the distributions. In A, the divergence measures are derived for two of the most common remote sensing image modalities, namely optical multiband and SAR images, considered in this work.
2.2 Latent image sparse model
Sparse representations have been an ubiquitous and welladmitted tool to model images in various applications and taskdriven contexts (Mairal, 2014). Indeed, natural images are known to be compressible in a transformed domain, i.e., they can be efficiently represented by a few expansion coefficients acting on basis functions (Mallat, 2009). This finding has motivated numerous works on image understanding, compression and denoising (Olshausen and Field, 1997; Chen et al., 2001). In earlier works, this transformed domain, equivalently defined by the associated basis functions, was generally fixed in advance and chosen in agreement with the expected spatial content of the images (Mallat, 2009). Thus, the basis functions belonged to predetermined families with specific representation abilities, such as cosines, wavelets, contourlets, shearlets, among others. More recently, the seminal contribution by Aharon et al. proposed a new paradigm by learning an overcomplete dictionary jointly with a sparse code (Aharon et al., 2006). This dictionary learningbased approach exploits the key property of selfsimilarity characterizing the images to provide an adaptive representation. Indeed, it aims at identifying elementary patches that can be linearly and sparsely combined to approximate the observed image patches. In this paper, following the approach by Aharon et al. (2006), we propose to resort to this dictionarybased representation to model the latent image . More precisely, the image is first decomposed into a set of 3Dpatches with . Let denote a binary operation modeling the extraction, from the image, of the th patch () such that
(3) 
where stands for the th pixel patch in its vectorized form. The integer defines the spatial size of the patches, i.e., its number of rows and columns before being vectorized. Note that the number of patches is such that and patches may overlap. The choice of the number of patches will be more deeply discussed in Section 3 in the specific context of CD. The conjugate of the patchextraction operator^{2}^{2}2Note that, despite a slight abuse of notation, the operator (resp., ) does not stand for a matrix, but rather for a linear operator acting on the image (resp., the patch ) directly., denoted , acts on to produce a zeropadded image composed by the unique patch located at the th spatial position.
In accordance with dictionarybased representation principles, these patches are assumed to be approximately and independently modeled as sparse combinations of atoms belonging to an overcomplete dictionary
(4) 
where stands for the userdefined number of atoms composing the dictionary, commonly referred to as dictionary size and represents the code (coefficients) of the current patch over the dictionary, is the error covariance matrix and . Let denote the matrix that stacks the set of all, possibly overlapping, patches extracted from the latent image at spatial positions arranged on a generally regular spatial grid and enumerated in a lexicographical order (i.e., from left to right and top to bottom of the image). The matrix is the code matrix in which each column represents the code for each corresponding column of . The overcompletness property of the dictionary, occurring when the number of atoms is greater than the effective dimensionality of the input space, , allows for the sparsity of the representation (Olshausen and Field, 1997). The overcompletness implies redundancy and nonorthogonality between atoms. This property is not necessary for the decomposition, but has been proved to be very useful in some applications like denoising and compression (Aharon et al., 2006). Given the image patch matrix , dictionary learning aims at recovering the set of atoms and the associated code matrix and it is generally tackled through a 2step procedure. First, inferring the code matrix associated with the patch matrix and the dictionary can be formulated as a set of sparsitypenalized optimization problems. Sparsity of the code vectors () can be promoted by minimizing its norm. However, since this leads to a nonconvex problem (Chen et al., 2001), it is generally substituted by the corresponding convex relaxation, i.e., an norm. Within a probabilistic framework, taking into account the expected nonnegativeness of the code, this choice can be formulated by assigning a singleside exponential (i.e., Laplacian) prior distribution to the code components, assumed to be a priori independent
(5) 
where is the hyperparameter adjusting the sparsity level over the code. Conversely, learning the dictionary given the code can also be formulated as an optimization problem. As the number of solutions for the dictionary learning problem can be extremely large, one common assumption is to constrain the energy of each atom, thereby preventing to become arbitrarily large (Mairal et al., 2009). Moreover, in the particular context considered in this work, to promote the positivity of the reconstructed patches, the atoms are also constrained to positive values. Thus, each atom will be constrained to the set
(6) 
2.3 Optimization problem
Adopting a Bayesian probabilistic formulation of the image model introduced in Sections 2.1 and 2.2, the posterior probability of the unknown variables , and can be derived using the probability chain rule (Gelman, 2004)
(7) 
where is the likelihood function relating the observation data to the latent image through the direct model (1), is the dictionarybased prior model of the latent image, and are the (hyper)prior distributions associated with the dictionary and the sparse code. Under a maximum a posteriori (MAP) paradigm, the joint MAP estimator can be derived by minimizing the negative logposterior, leading to the following minimization problem
(8) 
with
(9)  
where represents the indicator function on the set ,
(10) 
and is the datafitting term associated with the image modality.
This model has been widely advocated in the literature, e.g., for denoising images of various modalities (Elad and Aharon, 2006; Ma et al., 2013). Particularly, in Ma et al. (2013), an additional regularization of the latent image was introduced as the target modalities may present strong fluctuations due to their inherent image formation process, i.e. Poissonian or multiplicative gamma processes. The final objective function (9) can thus be rewritten as
(11)  
where, for instance, can stand for a totalvariation (TV) regularization (Ma et al., 2013).
The next section expands the proposed image models to handle a pair of observed images in the specific context of CD.
3 From change detection to coupled dictionary learning
3.1 Problem statement
Let us consider two geographically aligned observed images and acquired by two sensors and at times and , respectively. The ordering of acquisition times is indifferent, i.e., either or are possible cases and the order does not impact the applicability the proposed method. The problem addressed in this paper consists in detecting significant changes between these two observed images. This is a challenging task mainly due to the possible dissimilarities in terms of spatial and/or spectral resolutions and of modality. Indeed, resolution dissimilarity prevents any use of classical CD algorithms without homogenization of the resolutions as a preprocessing step (Singh, 1989; Bovolo and Bruzzone, 2015). Moreover modality dissimilarity, which makes most of the CD algorithms inoperative because their inability of handling images of different nature (Ferraris et al., 2017a, b). To alleviate this issue, this work proposes to improve and generalize the CD methods introduced by Seichepine et al. (2014); Gong et al. (2016); Lu et al. (2017). Following the widely admitted forward model described in Section 2.1 and adopting consistent notations, the observed images and can be related to two latent images and
(12a)  
(12b) 
where and denote two degradation operators imposed by the sensors and . Note that (12) is a double instance of the model (1). In particular, in the CD context considered in this work, the two latent images and are supposed to represent the same geographical region provided the observed images have been beforehand coregistered.
Both latent images can be represented thanks to a dedicated dictionarybased decomposition, as stated in Section 2.2. More precisely, a pair of homologous patches extracted from each image represents the same geographical spot. Each patch can be reconstructed from a sparse linear combination of atoms of an imagedependent dictionary. In the absence of changes between the two observed images, the sparse codes associated with the corresponding latent image are expected to be approximately the same and the two learned dictionaries are coupled (Yang et al., 2010, 2012; Zeyde et al., 2010). This coupling can be understood as the ability of deriving a joint representation for homologous multiple observations in a latent coupled space (Gong et al., 2016). Akin to the manifold proposed by Prendes et al. (2015), this representation offers the opportunity to analyze images of different modalities in a common dual space. In the case where a pair of homologous patches has been extracted from two images representing the same scene, given perfect estimated coupled dictionaries, each patch should be exactly reconstructed thanks to the same sparse code. In other words, the pair of patches is an element of the latent coupled space. Nevertheless, in the case where the pair of homologous patches does not represent exactly the same scene, owing to a change that occurs between acquisitions, perfect reconstruction cannot be achieved using the same code. This means that the pair of patches does not belong to the coupled spaces. Using the same code for reconstruction amounts to estimate the point in the coupled spaces that best approximates the patch pair. Thereby, relaxing this constraint in some possible change locations may provide an accurate reconstruction of both images while spatially mapping change locations. In the specific context of CD addressed in this work, this finding suggests to evaluate any change between the two observed, or equivalently latent, images by comparing the corresponding codes
(13) 
where and denotes the code change vector associated with the th patch, . Then, to spatially locate the changes, a natural approach consists in monitoring the magnitude of , summarized by the code change energy image (Bovolo and Bruzzone, 2007)
(14) 
with
Note that, in the case of analyzing a pair of optical images, Zanetti et al. (2015) proposed to describe the components of the energy vector thanks to a RayleighRice mixture model whose parameters can be estimated to locate the changes. Conversely, in this work we propose to derive the CD rule directly from this magnitude. When the CD problem in the th patch is formulated as the binary hypothesis testing
a patchwise statistical test can be written by thresholding the code change energy
where the threshold implicitly adjusts the target probability of false alarm or, reciprocally, the probability of detection. The final binary CD map denoted can be derived as
The spatial resolution of this CD map is defined by the number of homologous patches extracted from the latent images and . This number can be tailored by the user according to the adopted strategy of patch extraction. In practice, to reach the highest resolution, overlapping patches should be extracted according to the regular grid defined by the observed image of highest resolution, i.e., .
Finally, to solve the multimodal image CD problem, the key issue lies in the joint estimation of the pair of representation codes or, equivalently, to the joint estimation of one code matrix and of the change code matrix, i.e. of , as well as of the pair of coupled dictionary and consequently of the pair of latent images from the joint forward model (12). The next paragraph introduces the CDdriven optimization problem to be solved.
3.2 Coupled dictionary learning for CD
The single dictionary estimation problem presented on Section 2.3 can be generalized to take into account the modeling presented in Section 3.1. Nevertheless, some previous considerations must be carefully handled in order to provide good coupling of the two dictionaries.
As the prior information about the dictionaries constrains each atom into the set of unitary energy defined by (6), an unbiased estimation of the code change vector would allow a pair of unchanged homologous patches to be reconstructed with exactly the same code, while changed patches would exhibit differences in their code. Obviously, this can only be achieved if the coupled dictionaries represent data with the same dynamics and resolutions. However, when analyzing images of different modalities and/or resolutions, this assumption can be not fulfilled. To alleviate this issue, we propose to resort to the strategy proposed by Seichepine et al. (2014), by introducing an additional diagonal scaling matrix constrained to the set where is the size of the dictionary . This scaling matrix gathers the code energy differences originated from different modalities for each pair of coupled atoms. This is essential to ensure that the sparse codes of the two observed images are directly comparable, following (13), and then properly estimated. Therefore, considering a pair of homologous patches, their joint representation model derived from (4) can be written as
(15)  
where represents the pair of homologous patches and is the diagonal scaling matrix.
Since the codes and are now elementwise comparable, a natural choice to enforce coupling between them should be the equality . This has been a classical assumption in various coupled dictionary learning applications (Yang et al., 2010; Zeyde et al., 2010; Yang et al., 2012). Nevertheless, in a CD context, some spatial positions may not contain the same objects. To account for possible changes in some specific locations while most of the patches remain unchanged, as in Ferraris et al. (2017b), the code change energy matrix defined by (14) is expected to be sparse. As a consequence, the corresponding regularizing function is chosen as the sparsityinducing norm of the code change energy matrix or, equivalently, as the norm of the code change matrix
(16) 
This regularization is a specific instance of the nonoverlapping grouplasso penalization (Bach, 2011) which has been considered in various applications to promote structured sparsity (Wright et al., 2009; Févotte and Dobigeon, 2015; Ferraris et al., 2017b).
Then, a Bayesian model extending the one derived for a single image (7) leads to the posterior distribution of the parameters of interest
(17)  
By incorporating all previously defined prior distributions (or, equivalently, regularizations), the joint MAP estimator of the quantities of interest can be obtained by minimizing the negative logposterior, leading to the following minimization problem
(18) 
with
(19)  
The next section describes an iterative algorithm which solves the minimization problem in (18).
4 Minimization Algorithm
Given the nature of the optimization problem (18), which is genuinely nonconvex and nonsmooth, the adopted minimization strategy relies on the proximal alternating linearized minimization (PALM) scheme (Bolte et al., 2014). PALM is an iterative, gradientbased algorithm which generalizes the GaussSeidel method. It performs iterative proximal gradient steps w.r.t. each block of variables from and ensures convergence to a local critical point . It has been successfully applied in many matrix factorization cases (Bolte et al., 2014; Cavalcanti et al., 2017; Thouvenin et al., 2016). Now, the goal is to generalize the single factorization to coupled factorization. The resulting CDdriven coupled dictionary learning (CDL) algorithm, whose main steps are described in the following paragraphs, is summarized in Algorithm 1. \setstretch1.35
4.1 PALM implementation
The PALM algorithm was proposed by Bolte et al. (2014) for solving a broad class of problems involving the minimization of the sum of finite collections of possibly nonconvex and nonsmooth functions. Particularly, the target optimization function is composed by a coupling function gathering the block of variables, denoted , and regularization functions for each block. Nonconvexity constraint is assumed for either coupling or regularization functions. One of the main advantages of the PALM algorithm over classical optimization algorithms is that each bounded sequence generated by PALM converges to a critical point. The rationale of the method can be seen as an alternating minimization approach for the proximal forwardbackward algorithm (Combettes and Wajs, 2005). Some assumptions are required in order to solve this problem with all guarantees of convergence (c.f (Bolte et al., 2014, Assumption 1, Assumption 2)). The most restrictive one (Bolte et al., 2014, Assumption 2(ii)) requires that the partial gradient of the coupling function is globally Lipschitz continuous for each block of variable keeping the remaining ones fixed. Indeed, it is a classical assumption for proximal gradient methods which guarantees a sufficient descent property.
Therefore, given the objective function to be minimized (19) and considering the same structure proposed by Bolte et al. (2014) and the Lipschitz property for linear combinations of functions (Eriksson et al., 2004), let us define the coupling function as
(20) 
This coupling function defined accordingly does not fulfill (Bolte et al., 2014, Assumption 2(ii)) because some of its terms are nonsmooth, specifically the TV regularizations and the norm sparsity promoting regularizations applied to . Thus, to ensure such a coupling function is in agreement with the required assumptions, smooth relaxations of and are applied by using the pseudoHuber function (Fountoulakis and Gondzio, 2016; Jensen et al., 2012).
The remaining terms of (19) are composed of the regularization functions associated with each variable block. Within the PALM structure, a gradient step applied to the coupling function w.r.t. a given variable block is followed by proximal step associated with the corresponding regularization functions. As a consequence, those regularization functions must be proximallike where their proximal mappings or projections must exist and have closedform solutions. It is important to keep in mind that, even if the convergence is guaranteed for all optimization orderings, it should not vary during iterations. Thus, the updating rules for each optimization variable in Algorithm 1 are defined. More details about the proximal operators and projections involved in this section are given in B.
4.2 Optimization with respect to
Considering the single block optimization variable , and assuming that the remaining variables are fixed, the PALM updating step can be written
(21) 
with
(22)  
where should be understood as an elementwise operation and is the associated Lipschitz constant
(23) 
Note that can be simply computed by considering the positive part of the softthresholding operator (Parikh et al., 2014).
4.3 Optimization with respect to
Similarly, considering the single block optimization variable and consistent notations, the PALM update can be derived as
(24) 
where
(25)  
and
(26) 
The proximal operator can be simply computed as a group softthresholding operator (Ferraris et al., 2017b), where each group is composed by each column of .
4.4 Optimization with respect to
As before, considering the single block optimization variable with , the PALM updating steps can be written as
(27) 
where
(28) 
and is the Lipschitz constant
(29) 
with and . Note that the projection can be computed as in Mairal et al. (2009), keeping only the values greater than zero.
4.5 Optimization with respect to
The updating rule of the scaling matrix can be written as
(30) 
where
(31) 
and is the Lipschitz constant related to
(32) 
The projection constrains all diagonal elements of to be nonzero.
4.6 Optimization with respect to
Finally, the updates of the latent images () are achieved as follows
(33) 
with
(34)  
and
(35) 
and where stands for the discrete divergence (Chambolle, 2004). Note that, represents the proximal mapping for the divergence measure associated with the likelihood function characterizing the modality of the observed image . For the most common remote sensing modalities, e.g., optical and radar, these divergences are well documented and A presents the corresponding proximal operators.
5 Performance analysis
Real datasets with corresponding ground truth are too scarce to statistically assess the performance of CD algorithms. Indeed, this assessment would require a huge number of pairs of images acquired at two different dates, geometrically coregistered and presenting changes. These pairs should also be accompanied by a ground truth (i.e., a binary CD mask locating the actual changes) to allow quantitative figuresofmerit to be computed. As a consequence, in a first step, we illustrate the algorithm and stateoftheart method outcomes over such rare examples corresponding to real images, real changes and associated ground truth (section 5.3.1). This first set of experiments is conducted on images of same spatial resolutions. Thus we also exhibit another set of examples that involves images with different resolutions, but alas without ground truth. For this second example, the accuracy of the proposed method cannot be quantified, but can be evaluated through visual inspection (section 5.3.2). In a second step, to conduct a complete performance analysis, the algorithm and comparable methods will be tested on image pairs that are representative of possible scenarios considered in this paper, i.e., coming from multimodal images. This test set is composed of real images, however simulated changes and associated ground truth (section 5.4).
5.1 Compared methods
As the number of unsupervised multimodal CD methods is rather reduced, the proposed technique has been compared to the unsupervised fuzzybased (F) method proposed by Gong et al. (2016), that is able to deal with multimodal images, to the robust fusion (RF) method proposed by Ferraris et al. (2017b) which deals exclusively with multiband optical images and with unsupervised segmentationbased (S) technique proposed by Huang et al. (2015). The fuzzybased method by Gong et al. (2016) relies on a coupled dictionary learning methodology using a modified KSVD (Aharon et al., 2006) with an iterative patch selection procedure to provide only unchanged patches for the coupled dictionary training phase. Then, the sparse code for each observed image is estimated separately from each other allowing to compute the crossimage reconstruction errors. Finally, a local fuzzy CMeans is applied to the mean of the crossimage reconstruction errors in order to separate change and unchanged classes. Equivalently to the proposed one, this method makes no assumption about the joint observation model. On the other hand, the robust fusion method by Ferraris et al. (2017b) is based on a more constrained joint observation model, considering that the two latent images share the same resolutions and differ only in changed pixels. Finally, the method proposed by Huang et al. (2015) replaces the pixelbased approach used on all previous methods to a featurebased approach. In this approach, features are derived from the segmentation of each image. Then, a difference map is generated based on metrics computed for the matching features extracted on previous steps at different scales. This strategy is used in order to provide finer details. At the end, the change map is generated using the histogram of difference map. The final change maps estimated by these algorithms are denoted as , and , respectively, while the proposed PALMCDL method provides a change map denoted .
5.2 Figuresofmerit
The CD performance of the different methods has been assessed through the empirical receiver operating characteristics (ROC) curves, representing the estimated pixelwise probability of detection () as a function of the probability of false alarm (). Moreover, two quantitative criteria derived from these ROC curves have been computed, namely, i) the area under the curve (AUC), corresponding to the integral of the ROC curve and ii) the distance between the no detection point and the point at the interception of the ROC curve with the diagonal line defined by . For both metrics, the greater the criterion, the better the detection.
5.3 Illustration through real images with real changes
In a first step, experiments are conducted on real images with real changes to emphasize the reliability of the proposed CD method and to illustrate the performance of the proposed algorithmic framework. Three distinct scenarios involving pairs of images of different modalities and resolutions, are considered namely,

Scenario 1 considers two optical images: the acquisition process is very similar for the two images and the image formation processes is characterized by an additive Gaussian noise corruption for both sensors.

Scenario 2 considers two SAR images: the image formation process is not the same as for optical images, in particular differing on the noise model, i.e., multiplicative Gamma noise instead of additive Gaussian model.

Scenario 3 considers a SAR image and an optical image: for this more challenging situation, there is no similarity between the noise corruption models for the two sensors.
To summarize, Scenarios 1 and 2 are dedicated to a pair of images with the same modality, but with a variation on the properties of images between scenarios, e.g., the noise statistics. Note that the proposed CDL algorithm has not been designed to specifically handle these conventional scenarios. However, they are still considered to evaluate the performance of the proposed method, in particular w.r.t. the methods specifically designed to address scenario 1 or 2. Conversely, Scenario 3 handles images of different modalities. All considered images have been manually geographically aligned to fulfil the requirements imposed by the considered CD setup.
5.3.1 Case of same resolutions and ground truth
For this first set of experiments, images of same spatial resolutions are considered. They are accompanied by a ground truth in the form of an actual change map to be estimated.
Scenario 1: optical vs. optical – The observed images are two multispectral (MS) optical images with 3 channels representing an urban region in the south of Toulouse, France, before (Figure 0(a)) and after (Figure 0(b)) the construction of a road. These pixel images are both characterized by a cm spatial resolution. The groundtruth change mask is represented in Figure 0(c). Figure 1 depicts the observed images at each date, the groundtruth change mask and the change maps estimated by the four compared methods.
The quantitative results for Scenario 1 are reported in Table 1 (lines 1 and 2) and the corresponding ROC curves in Figure 1(a). The analysis of these results shows that the proposed method outperforms stateoftheart methods for this scenario which involves common changes in urban areas and in MS optical images. Note that, besides the changes, this kind of situation involves a lot of small differences between the two observed images due to the variations in sun the illumination, in the vegetation cover, etc. These effects sometimes are classified as changes, increasing the false alarm rate, especially for stateoftheart methods. Besides, the proposed method still provides the best detection for this dataset.
Sc. 1 
AUC  

Dist.  
Sc. 2 
AUC  
Dist.  
Sc. 3 
AUC  
Dist. 
Scenario 2: SAR vs. SAR – For this scenario, two intensity radar images acquired over the Lake Mulargia region in Sardegna, by the Sentinel1 satellite in 05/21/2016 (Figure 2(a)) and 10/30/2016 (Figure 2(b)) are considered. These pixel images are both characterized by a m spatial resolution. This dataset mostly presents seasonal changes, in particular the variation of flooding areas around the Lake Mulargia. The groundtruth change mask is represented in Figure 2(c). Figure 3 depicts the two observed images, the groundtruth change mask and the change maps estimated by the four compared methods.
The quantitative results for Scenario 2 are reported in Table 1 (lines 3 and 4) and the corresponding ROC curves in Figure 1(b). The analysis of these results shows that the proposed method also outperforms the stateoftheart methods for this scenario. It is a good indication of its flexibility w.r.t. image modalities. Note that, due to the multiplicative noise and the consequent strong fluctuations, the stateoftheart methods present a lot of false alarms. This effect seems to be attenuated by the proposed method, probably thanks the TV regularization.
Scenario 3: optical vs. SAR – In order to test the performance of the different compared methods in a multimodality situation, we consider two images acquired over the Gloucester region, UK, before and after a catastrophic flooding accident in 2007. The beforeflooding image, presented in Figure 3(a), is a multispectral optical image with 3 channels acquired by Google Earth while the afterflooding image, depicted in Figure 3(b), is a radar image acquired by TerraSARX. These pixel images are both characterized by a m spatial resolution. The groundtruth change mask is represented on Figure 3(c). Figure 4 depicts the observed images at each date, the groundtruth change mask and the change maps estimated by the four comparative methods.
Table 1 (lines 5 and 6) reports the quantitative results for Scenario 3 and the corresponding ROC curves are displayed in Figure 1(c). Similarly as in the previous scenarios, the analysis of these results shows that the proposed method outperforms the stateoftheart methods even in this more challenging situation involving different image modalities with changes in rural and urban areas. As for Scenario 2, the TV regularization seems to be beneficial to smooth the fluctuations due to the nature of the noise in radar images, which may affect the other compared methods producing more false alarms.
Through this first illustration, we can observe that the segmentationbased method severly underperforms the other methods for scenarios 1 and 2 and underperfoms the proposed method under scenario 1. For brevity, in the following, we will compare only the F and RF methods to the proposed CDL one.
5.3.2 Case of different resolutions without ground truth
The previous set of experiments considered pair of images characterized by the same spatial resolution. As a complementary analysis, this section reports experiments conducted on real images of different spatial resolutions with real changes. However, for these 3 pairs of images, corresponding to the three scenarios, no ground truth is available. We first consider a Sentinel1 SAR image (European Space
Agency, 2017a) acquired on October 28th 2016. This image is a interferometric wide swath high resolution ground range detected multilooked SAR intensity image with a spatial resolution of m according to 5 looks in the range direction. Moreover, we also consider two multispectral Landsat 8 (United States Geological
Survey, 2017) pixel images with m spatial resolution and composed of the RGB visible bands (Band 2 to 4), acquired over the same region on April 15th 2015 and on September 22th 2015, respectively. Unfortunately, no groundtruth information is available for the chosen dates, as experienced in numerous experimental situations (Bovolo and Bruzzone, 2015). However, this region is characterized by interesting natural meteorological changes occurring along the seasons (e.g., drought of the Mud Lake, snow falls and vegetation growth), which helps to visually infer the major changes between observed images and to assess the relevance of the detected changes. All considered images have been manually geographically and geometrically aligned to fulfill the requirements imposed by the considered CD setup. Each scenario is individually studied considering the same denominations as in Section 5.3 and the same compared methods as in Section 5.1.
Scenario 1: optical vs. optical – In this scenario, two different situations are going to be explored, namely, observed images with the same or different resolutions. The first case considers both Landsat 8 images. Figure 5 depicts the two observed images and the change maps estimated by the three compared methods. These change maps have been generated according to (3.1) where the threshold has been adjusted such that each method reveals the most important changes, i.e., the drought of the Mud Lake. As expected, the robust fusion method presents better accuracy in detection since it was specifically designed to handle such a scenario. Nevertheless, the proposed method exhibits very similar results. It is worth noting that some of the observed differences are due to the patch decomposition required by the proposed method. The fuzzy method is able to localize the strongest changes, but low energy changes are not detected. The fuzzy method also suffers from resolution loss due to the size of the patches. Contrary to the proposed method, it does not take the patch overlapping into account, which contributes to decrease the detection accuracy.
Under the same scenario (i.e. optical vs. optical), an additional pair of observed images is used to better understand the algorithm behavior when facing to images of the same modality but with different spatial resolutions. The observed image pair is composed of the Sentinel2 image acquired on April 12th 2016 and the Landsat 8 image acquired in September 22th 2015. Note that the two observed images have the same spectral resolution, but different spatial resolutions. Figure 6 depicts the observed images as well as the change maps estimated by the comparative methods. Once again, it is possible to state the similarity of the results provided by the robust fusion method and the proposed one. It also shows the very poor detection performance of the fuzzy method. This may be explained by the difficulty of coupling due to differences in resolutions.
Scenario 2: SAR vs. SAR – In this scenario, observed SAR images acquired by the same sensor (Sentinel1) are used to assess the performance of the fuzzy method and the proposed one. The robust fusion method has not been considered due to the poor results obtained on the synthetic dataset (see Section 5.4 below). Figure 7 presents the observed images at each date and the change maps recovered by the two compared methods. The same strategy of threshold selection as for Scenario 1 has been adopted to reveal the most important changes. As expected, the proposed method presents a higher accuracy in detection than the fuzzy method. Possible reasons that may explain this difference are i) the fuzzy method is unable to handle overlapping patches and ii) the fuzzy method does not exploit appropriate datafitting terms, in opposite to the proposed one. Besides, as SAR images present strong fluctuations due to their inherent image formation process, the additional TV regularization of the proposed method may contribute to smooth such fluctuations and better couple the dictionaries.
Scenario 3: optical vs. SAR – For this scenario, once again, two different situations are addressed: images with the same or different spatial resolutions. The first one considers the Sentinel2 MS image acquired on April 12th 2016 and the Sentinel1 SAR image acquired in October 28th 2016. Figure 8 presents the observed images and the change maps derived from the fuzzy and proposed methods. To derive the change maps, the thresholding strategy is the same as for all previous scenarios. Once again, the proposed method shows better detection accuracy performance than the fuzzy one. It is important to emphasize the similarity of the results achieved in Scenario 3 and Scenario 2 for images acquired at the same date. Note also that this similarity can be observed for the proposed method, which contributes to increase its reliability for CD between multimodal images.
The second observed image pair consists in a Sentinel1 SAR image acquired on April 12th 2016 and a Landsat 8 MS image acquired on September 22th 2015. This pair represents the most challenging situation among all presented images, namely differences in both modalities and resolutions. Figure 9 presents the observed images at each date and the recovered change maps. For this last experiment, the proposed method presents better accuracy in detection than the fuzzy one. All differences in all previous situations can be observed in this scenario, culminating in the poor detection performance of the fuzzy method and a reliable change map for the proposed one.
5.4 Statistical performance assessment
Finally, the last set of experiments aims at statistically evaluating the performance of compared algorithms thanks to simulations on real images affected by synthetic changes. More precisely, in the case of multiband images, a dedicated CD evaluation protocol was proposed by Ferraris et al. (2017a) based on a single high spatial resolution hyperspectral reference image. The experiments conducted in this work follow the same strategy. Two multimodal reference images acquired at the same date have been selected as changefree latent images. By conducting simple copypaste of regions, as in Ferraris et al. (2017a), changes have been generated in both images as well as their corresponding groundtruth maps. This process allows synthetic yet realistic changes to be incorporated within one of these latent images, w.r.t. a predefined binary reference change mask locating the pixels affected by these changes and further used to assess the performance of the CD algorithms. This process is detailed in what follows.
5.4.1 Simulation protocol
Reference images – The reference images and used in this experiment comes from two largely studied open access satellite sensors, namely Sentinel1 (European Space
Agency, 2017a) and Sentinel2 (European Space
Agency, 2017b) operated by the European Spatial Agency. These images have been acquired over the same geographical area, i.e., the Mud Lake region in Lake Tahoe, at the same date on April 12th 2016. To fulfil the requirements imposed by the considered CD setup, both have been manually geographically and geometrically aligned. The Sentinel2 image used in this section is a image with m spatial resolution and composed of spectral bands corresponding to visible RGB (Bands to ). On the other hand, Sentinel1 reference image is a interferometric wide swath high resolution ground range detected multilooked SAR intensity image with a spatial resolution of m according to 5 looks in the range direction.
Generating the changes – Using a procedure similar to the one proposed by Ferraris et al. (2017a), given the reference images (), and a previously generated change mask , a change image can be generated as
(36) 
where the changeinducing functions is defined to simulate realistic changes in some pixels of the reference images. A set of predefined change masks has been designed according to specific copypaste change rules similar as the ones introduced by Ferraris et al. (2017a).
Generating the observed images – The observed images are generated under the previously defined distinct scenarios involving pairs of images, namely,

Scenario 1 considers two optical images,

Scenario 2 considers two SAR images,

Scenario 3 considers a SAR image and an optical image.
Each test set pair is formed by considering with for Scenario 1 and for Scenario 2. Conversely, for Scenario 3 handling multimodal images, two test pairs can be formed considering , i.e., .
5.4.2 Results
The ROC curves displayed in Fig. 10 with corresponding metrics in Table 2 correspond to the CD results obtained for each specific scenario. These results are discussed below.
Scenario 1: optical vs. optical – The ROC curves displayed in Fig. 9(a) with corresponding metrics in Table 2 (first two rows) correspond to the CD results obtained from a pair of optical observed images. These results show that the robust fusion achieves the best CD performance. This method has the benefit of exploring the joint model between the optical images, contrary to the fuzzy and proposed methods. Nevertheless, the proposed method achieves very similar performance. More importantly, they provide almost perfect detection even for very low PFA, i.e., for very low energy changes. On the other hand, the fuzzy method suffers from non detection and false alarm, even when applying the iterative strategy with a similar parameter selection approach as in Gong et al. (2016). This happens mostly in low energy change regions. One possible explanation is that the iterative selection is not able to distinguish between low energy and unchanged pixels, which may bias the coupling of dictionaries. Also, the disjoint reconstruction cannot properly deal with low energy changes because coupling is not perfect. In addition, as the methods directly work with the observed images without estimating the latent image, noise could be interpreted as a change, thus increasing the false alarm rate.
Scenario 2: SAR vs. SAR – As in the previous case, this dual scenario considers homologous observed SAR images. In this case the ROC curves are displayed in Fig. 9(b) with corresponding metrics in Table 2 (rd and th rows). Fig. 9(b) shows that the proposed method offers the highest precision among the compared methods and keeps a high level of detection compared to the Scenario 1. The fuzzy method presents a better accuracy result compared to optical images. One of the reasons is that optical images are generally characterised by richer information, which makes the dictionary coupling more difficult than for two SAR images. At the end, the robust fusion CD method shows a very low detection accuracy as it is not suited to deal with SAR images.
Scenario 3: optical vs. SAR – This scenario corresponds to a more difficult problem than the previous one. The physical information extracted in each image cannot be directly related in the observational space, contrary to the previous scenarios. The ROC curve are displayed in Fig. 9(c) with corresponding metrics in Table 2 (last two rows). As in Scenario 2, Fig. 9(c) shows that the proposed method still offers the highest detection accuracy, while the other methods present a very poor performance. Regarding the fuzzy method, the dictionary and the subsequent sparse code estimations are severely affected by the differences in terms of dimensionality of measurements and dynamics. Even by tuning the algorithmic parameters to increase the weight of the image of lowest dynamics (or lowest resolution), the dictionaries are not properly coupled. Note that, to use the robust fusion method in this challenging scenario, a spectral degradation has been artificially applied to reach the same spectral resolution for the two images. This has been achieved by considering a bandaveraging to finally form a panchromatic image. Resulting detection performance is even poorer than the fuzzy method because it supposes the same physical information between images. Only strong related changes are detected in this case.
Scenario 1 
AUC  

Dist.  
Scenario 2 
AUC  
Dist.  
Scenario 3 
AUC  
Dist. 
5.5 Implementation details
This paragraph briefly discusses some computational aspects of the proposed method. Concerning the algorithmic implementation, the code was implemented in Matlab and run on a Windows platform equipped with a Intel Core i7 8GB RAM CPU. Depending on the size of the images and on the number of iterations, analyzing a pair of images as those considered in the experiments described in this section may require one hour. The computational bottleneck is the memory required to store parameters for each image and possibly large temporary byproduct variables, in particular those relying on large matrix computation such as or . The size of such matrices depends directly on the numbers and of pixels of the observed images and the chosen size of the dictionaries. Note however that some computations could have been conducted in parallel to speed up the procedure, e.g., optimizing independently w.r.t. the dictionaries and (see Section 4.4), optimizing independently w.r.t. the latent images and