Efficient Image Splicing Localization via Contrastive Feature Extraction
Abstract
In this work, we propose a new data visualization and clustering technique for discovering discriminative structures in highdimensional data. This technique, referred to as cPCA++, utilizes the fact that the interesting features of a “target” dataset may be obscured by high variance components during traditional PCA. By analyzing what is referred to as a “background” dataset (i.e., one that exhibits the high variance principal components but not the interesting structures), our technique is capable of efficiently highlighting the structure that is unique to the “target” dataset. Similar to another recently proposed algorithm called “contrastive PCA” (cPCA), the proposed cPCA++ method identifies important datasetspecific patterns that are not detected by traditional PCA in a wide variety of settings. However, the proposed cPCA++ method is significantly more efficient than cPCA, because it does not require the parameter sweep in the latter approach. We applied the cPCA++ method to the problem of image splicing localization. In this application, we utilize authentic edges as the background dataset and the spliced edges as the target dataset. The proposed method is significantly more efficient than stateoftheart methods, as the former does not require iterative updates of filter weights via stochastic gradient descent and backpropagation, nor the training of a classifier. Furthermore, the cPCA++ method is shown to provide performance scores comparable to the stateoftheart Multitask Fully Convolutional Network (MFCN).
I Introduction
With the proliferation of lowcost and highly sophisticated image editing software, the generation of realisticlooking manipulated images has never been more easily accessible. At the same time, the prevalence of social media applications, such as Twitter, has made it very easy to quickly circulate these manipulated or fake images. Thus, there has been an increasing interest in developing forensic techniques to detect and localize forgeries (also referred to as manipulations or attacks) in images. One of the most common types of forgery is the image splicing attack. A splicing attack is a forgery in which a region from one image (i.e., the donor image) is copied and pasted onto another image (i.e., the host image). Forgers often use splicing to give a false impression that there is an additional object present in the image, or to remove an object from the image. Image splicing can be potentially used in generating false propaganda for political purposes. For example, the altered or manipulated image in the top row of Figure 1 shows Bill Clinton shaking hands with Saddam Hussein, although this event never occurred. The authentic images used to create the fake image can be seen in the bottom row of Figure 1.
An additional splicing example^{1}^{1}1https://www.nist.gov/itl/iad/mig/mediaforensicschallenge is shown in Figure 2. The top row shows (from left to right): the manipulated image (also referred to as the probe or spliced image), the host image, and the donor image. In this example, the murky water from the donor image was copied and pasted on top of the clear water in the host image. This gives the false appearance that the water in the pool is not clean. The murky water in the manipulated image is referred to as the spliced surface or region. The bottom row in Figure 2 shows (from left to right): the spliced surface (or region) ground truth mask and the spliced boundary (or edge) ground truth mask. These two types of masks provide different ways of highlighting the splicing manipulation. The surface ground truth mask is a perpixel binary mask which specifies whether each pixel in the given manipulated image is part of the spliced surface (or region). We use the color white to denote a pixel belonging to the spliced surface and the color black to denote a pixel not belonging to the spliced surface. The edge ground truth mask is a perpixel binary mask which specifies whether each pixel in the given probe image is part of the boundary of the spliced surface. In this case, we use the color white to denote a pixel belonging to the spliced boundary and the color black to denote a pixel not belonging to the spliced boundary.
There are two main problems in the splicing literature: detection and localization. The detection problem refers to the problem of classifying an image as a whole as either spliced or authentic, without localizing the spliced region or boundary. Many of the current techniques only address the detection problem, and do not address the localization problem. However, recently, more effort has been made in the forensics community in addressing the localization problem, which aims to classify each pixel in an image as either belonging to the spliced surface/boundary or not. Most of the recent techniques utilize deeplearningbased techniques such as convolutional neural networks (CNNs) to localize splicing attacks. Although CNNs have yielded some improvement in the localization of splicing attacks, they rely on careful selection of hyperparameters, network architecture, and initial filter weights. Furthermore, CNNs require a very long training time because they utilize stochastic gradient descent and backpropagation to iteratively update the filter or kernel weights. It is desired to have an approach that relies more on fundamental mathematical theory, does not require a significant amount of experimental tuning, and does not require long training and testing times.
In this paper, we propose such an approach, the basis of which is Principal Component Analysis (PCA). As we will see, traditional PCA is not adequate for the task of image splicing localization, and so we will first present a new dimensionalityreduction technique for performing discriminative feature extraction, referred to as cPCA++, which is inspired by a recently proposed algorithm called “contrastive PCA” (cPCA) [1]. We then propose a new approach for image splicing localization based on cPCA++, which is significantly more efficient than stateoftheart techniques such as the Multitask Fully Convolutional Network (MFCN) [2], and still achieves comparable performance scores. Also, cPCA++ is derived via matrix factorization, which allows us to identify underlying bases and corresponding weightings in the data that can be used for denoising applications (see App. B for such an example).
The rest of the paper is organized as follows. Sec. II reviews existing splicing localization techniques. Sec. III presents cPCA++ as a new dimensionalityreduction technique, discusses how it differs from traditional PCA, and presents simulation results. The general framework or pipeline of the proposed cPCA++ approach for splicing localization is presented in Sec. IV. Experimental results on splicing datasets are presented in Sec. V. Finally, concluding remarks are given in Sec. VI.
Ii Related Work
Zampoglou et al. [3] conducted a comprehensive review of nondeeplearningbased techniques for image splicing localization, and provided a comparison of their performance. These techniques can be roughly grouped into the following three categories based on the type of feature or artifact they exploit: noise patterns [4, 5, 6, 7], Color Filter Array (CFA) interpolation patterns [8, 9], and JPEGrelated traces [10, 11, 12, 13, 14, 15, 16, 17, 18]. Recently, there has been an increasing interest in the application of deeplearningbased techniques to general image forensics and splicing detection/localization [2, 19, 20, 21, 22, 23, 24, 25, 26]. Specifically, convolutional neural networks (CNNs) have attracted a significant amount of attention in the forensics community, due to the promising results they have yielded on a variety of imagebased tasks such as object recognition and semantic segmentation [27, 28, 29]. One of the stateoftheart CNNbased techniques for image splicing localization is the Multitask Fully Convolutional Network (MFCN), proposed in [2]. In this work, one branch is used to learn the spliced surface ground truth mask (or label), while the other branch is used to learn the spliced boundary/edge ground truth mask. In addition to the surface ground truth mask, the boundaries between inserted regions and their host background can be an important indicator of a manipulated area. It is shown in [2] that simultaneously training on the surface and edge labels can yield finer localization of the splicing manipulation, as compared to training only on the surface label. It is also important to note that the MFCNbased method and the proposed cPCA++ method both output an edgebased probability map (as we will see in Sec. IV), which allows for a direct comparison.
Although CNNbased approaches have yielded promising results in the field of image forensics, they rely on careful selection of hyperparameters, network architecture, and initial filter weights. The behavior of the resulting CNN classifier, after learning of the coupled feature extractor and classifier blocks, is difficult to explain mathematically. Furthermore, CNNs require a long training time since the filter weights need to be iteratively updated via stochastic gradient descent and backpropagation. It is desired to have an approach that relies more on fundamental mathematical theory, does not require a significant amount of experimental tuning, and does not require long training/testing times.
In this paper, we propose such an approach, based on a new version of Principal Component Analysis (PCA). In the context of image splicing localization, the two classes of interest (i.e., spliced and authentic boundaries) are very similar in terms of their covariance matrices, and traditional PCA is not able to effectively discriminate between the two classes. Instead, we propose a new version of PCA, referred to as cPCA++, that is able to perform discriminative feature extraction when dealing with extremely similar classes. We then propose a new approach for image splicing localization based on cPCA++. The proposed approach is mathematically tractable and does not require a significant amount of experimental tuning. Unlike CNNs, the proposed approach does not require iterative updates of filter weights via stochastic gradient descent and backpropagation, and thus is much more efficient than CNNbased approaches. In addition, we will see that the cPCA++ approach does not require the training of a classifier (e.g., support vector machines or random forests), which greatly speeds up the method. Also, the proposed approach can be readily parallelized due to its lack of dependence on inherently serial methods such as stochastic gradient descent. In the next section, we will derive the cPCA++ method, and then apply it to the image splicing localization problem in Sec. IV.
Iii cPCA++
Iiia The cPCA++ Method
As we will see in Sec. IV, one characteristic of the image splicing localization problem is the strong similarity between authentic and spliced edges, especially when spliced edges have been smoothed over with a lowpass filter. Thus, traditional Principal Component Analysis (PCA) is not able to effectively discriminate between spliced and authentic edges. In order to deal with this issue, we will first study the problem of dimensionality reduction for extremely similar classes. Doing so will allow us to develop a new and efficient dimensionalityreduction technique that we refer to as cPCA++, which is inspired by a recently proposed algorithm called “contrastive PCA” (cPCA) [1]. In order to derive the cPCA++ method, we shall first consider the traditional PCA method.
Suppose we are presented with a data matrix , where denotes the original feature dimension of the data and denotes the number of instances included in . PCA first computes the empirical covariance matrix of :
(1) 
where we assumed that has zeromean. Next, PCA would compute the subspace spanned by the top or leading eigenvectors of (i.e., those corresponding to the largest eigenvalues). The basis for this space would constitute the filters used to process the input data:
(2) 
where denotes the number of principal eigenvectors to return. Now, a lowdimensional version of the input data can be obtained as:
(3) 
It is known that the PCA filters preserve the most energy in after the transformation. Observe that this property may not yield separability of classes in . That is, if the data matrix is composed of multiple classes (e.g., spliced and authentic edges in our case), performing traditional PCA will not necessarily allow us to find a representation in which we can separate the classes.
In this work, we will take a different approach, inspired by a recently proposed algorithm called the “contrastive PCA” (cPCA) method [1], which obtains discriminative filters. This approach focuses on finding directions that yield large variations for one dataset, referred to as the “target” or “foreground” dataset, while simultaneously yielding small variations for another dataset, referred to as the “background” dataset. There are two problem setups that benefit from this scheme, and we examine them in the following remark.
Remark 1.
In [1], the problem is to discover structures that are unique to one dataset relative to another. There are two ways of utilizing the solution to this problem. They are described as follows.
 1)

Consider the example described more thoroughly in Sec. IIIB2. We are provided a “target” or “foreground” dataset with digits superimposed atop of grass background images. Observe that the target dataset contains multiple classes (i.e., the two digits, and ). Examples of instances of this dataset are illustrated in Fig. 4. We wish to learn filters that are tuned to the digits, as opposed to the relatively strong grass background images. In order to accomplish this task, consider having access to a “background” dataset that only contains instances of grass images (i.e., without digits). The task is to discover the structures that are unique to the target dataset and not present in the background dataset. If this is accomplished, the filters are hoped to be able to differentiate between the two digits in the “target” dataset.
 2)

In this alternative setup, we would like to solve a binary classification problem with the caveat that the two classes are very similar, and are dominated by the structures that are present in both classes. This is the case in the image splicing localization problem, where one class is the spliced edges and the other class is the authentic edges. In this scenario, we consider one class (i.e., the spliced edges) to be the target dataset and the other class (i.e., the authentic edges) to be the background dataset. If we are able to learn filters that are tuned to the structures unique to the spliced edges, then these filters form a basis for an efficient classifier that is able to differentiate between spliced and authentic edges. We will see the form of this classifier in Sec. IV.
We will take a detectionbased approach to dimensionality reduction. We will setup a detector whose task is to identify whether or not a presented data matrix contains the interesting or special structures present in the foreground dataset (observe that the data matrix may contain background instances as well). By following this route, we will see that the detector will tune itself to specifically look for these special structures in the presented data matrix. Fortunately, the detector will look for these structures by transforming the incoming data matrix with a set of linear filters. This means that we may simply utilize the output of these filters as a lowdimensionality representation of the input data matrix, tuned specifically to the interesting structures.
To setup the problem mathematically, we assume that the data matrix is collected in an independently and identically distributed (i.i.d.) manner from the background dataset and has the following distribution:
(4) 
where denotes an unknown covariance matrix of the background dataset and with and positivedefinite means that has the following probability density function (pdf):
(5) 
On the other hand, we assume that when the instances of are sampled independently from the foreground dataset, has the following distribution:
(6) 
where the mean of can be factored as the product of a basis matrix and a lowdimensionality representation matrix . The innerdimension represents the underlying rank of the data matrix , which is generally assumed to be much smaller than the initial dimension .
Now, it is assumed that we are presented with a data matrix that could be drawn completely from the background dataset (we refer to this case as the null hypothesis ) or contain instances from the background dataset and some instances from the foreground dataset (we refer to this case as hypothesis ). That is, the data matrix under can be written as:
(7) 
where and denote data instances from the foreground and background datasets, respectively. When is active, we assume that we know the value of . Given (4) and (6) above, we have that the mean of under can be written as:
(8) 
where we have introduced the matrix , where , with the following structure:
(9) 
Therefore, the detection/classification problem we wish to examine is the determination of the active hypothesis (either or ), where the data matrix is described, under each hypothesis, by:
(10) 
Note that and denote the covariance matrix under and , respectively. The underlying assumption is that when the data matrix contains only instances from the background dataset, then the mean would be zero. We thus assume that all data instances of have the mean of the respective partitions subtracted off. For example, if the raw (unprocessed) data matrix is given by , then the partitions of the data matrix are given by:
(11)  
(12) 
In solving this detection problem, we will be able to obtain the matrix , which is the desired lowdimensionality representation of the foreground dataset. While it may initially appear that the detection problem is relatively simple—especially when the mean under the foreground dataset is relatively large in magnitude compared to the power in the covariance matrices, the detection problem is difficult when the mean is small compared to the covariance power (i.e., it is masked by large variation that naturally occurs in both the background and foreground datasets), as outlined earlier and in [1]. This is true in the context of image splicing localization, so it is assumed that the mean under the foreground dataset is small compared to the covariance power.
Now, we would like to determine if a given data matrix belongs to the distribution in the null hypothesis () or the alternative hypothesis (). A classical approach to such problems is the likelihood ratio test, which is derived via the test:
(13) 
where step is due to Bayes’ rule, and and denote the prior probabilities of hypotheses and , respectively. The above can readily be rearranged as:
(14) 
This is referred to as the “likelihood ratio test.” It evaluates the likelihood ratio of the data matrix under each hypothesis and compares it to a threshold given by . The likelihood ratio on the lefthandside of (14) is desired to have a large value when the hypothesis is active, but is desired to take on a small value when the hypothesis is active. The threshold on the righthandside of (14) is generally swept to achieve different performance that trades the probability of detection to falsealarm. Observe that the likelihood ratio test is only useful when the probabilities and can be computed (i.e., the probability density function is completely known or all of its parameters can be estimated). In our case, however, we do not have knowledge of many parameters, including the mean components and , as well as the covariance matrices and . In order to deal with this situation, a common approach is to utilize what is referred to as the generalized likelihood ratio test (GLRT), which optimizes over parameters that are unknown from within the ratio. In our case, the GLRT is given by [30, p. 200]:
(15) 
where denotes the threshold parameter, denotes the Gaussian pdf of under the hypothesis , and denotes the Gaussian pdf of under the nullhypothesis . That is,
(16)  
(17) 
In (15), we clearly indicate that the resulting GLRT statistic will be a function of the matrix , which is the basis for the feature mean under . We will return to this point later in the derivation. Now, we can optimize each pdf over the unknown covariance matrix, independently, as prescribed by the GLRT in (15) to obtain that the optimal covariance matrices for each case is given by the empirical covariance matrix (i.e., the maximumlikelihood covariance matrix estimator for a Gaussian distribution [31, p. 89]), so that:
(18) 
and the pdf for is given by
(19) 
Thus, the GLRT simplifies to:
(20) 
where step states that the GLRT statistic is equivalent to (20). Now, using the derivation in [32], we have that the GLRT statistic can be written as (see equation (8) and related derivations in App. A from [32]):
(21) 
We will now optimize over our free variable . When the nullhypothesis is active, it can be shown that the GLRT statistic is independent of . On the other hand, when is active, we have that the GLRT is dependent on and is given by:
(22) 
where is obtained via (7) and we have defined the following quantities:
(23) 
where denotes the second order statistic associated with the background dataset only, while denotes the second order statistic associated with the foreground dataset only. Now, to ensure that the gap in the GLRT statistic between the two hypotheses is as large as possible, we must maximize the GLRT value over when the hypothesis is active (since the GLRT value is independent of when the nullhypothesis is active). It turns out that the value of that maximizes the GLRT is obtained by solving a generalized eigenvalue problem [33] [34, pp. 454–455]. The optimal choice for turns out to be the principal eigenvectors of the matrix , or, equivalently^{2}^{2}2Assuming a square matrix is diagonalizable with nonnegative eigenvalues, the matrix for can be written as , which modifies all eigenvalues of by adding one to a positivescaled version of them. This does not change the order of the eigenvalues of since they are assumed to be nonnegative and thus and share the same principal eigenvectors when and the eigenvalues of are nonnegative., the principal eigenvectors of the matrix when the eigenvalues of are nonnegative (we will show later that this is indeed the case^{3}^{3}3We will later show that the eigenvalues of are nonnegative. Observe, however, that and are similar since is nonsingular [35, p. 53], which means that the eigenvalues of are also nonnegative.). While the solution to itself is not directly interesting for the problem of image splicing localization (it is relevant for applications related to matrix factorization, denoising, and dictionary learning [36]—which we explore in App. B), the reduceddimensionality matrix is much more relevant for dimensionality reduction purposes.
The maximumlikelihood estimator for (which was used to obtain (21)) is written as
(24) 
where is a bank of filters yet to be determined, but has the following dependence on [32]:
(25) 
In fact, the filters in (25) can be shown to be the principal eigenvectors of the matrix:
(26) 
This can be seen by noting that the optimal (being the eigenvectors of ) must satisfy for some diagonal matrix that contains the eigenvalues. From this, we have that . Denoting for some diagonal invertible scaling matrix that we are free to choose, we have that denotes the eigenvectors of the matrix (since ). Substituting this into (25), we have that , with the choice of the matrix . Now, we would like to show that the matrix is diagonal. First, observe that . The matrix can be shown to be diagonal since the columns of are orthogonal for distinct eigenvalues and the columns of that correspond to the same eigenvalue can be chosen to be orthogonal [37, p. 345]. Thus, is a diagonal matrix. It can also be shown is positive definite (and thus invertible) as long as has full column rank. Furthermore, the matrix has full column rank because the matrix can be shown to be diagonalizable. This is because can be shown to be similar to [38, p. 81], which is real and symmetric and thus diagonalizable [38, p. 97].
The obtained filters attempt to examine the specific structures that differentiate samples from and . This leads to a relatively straightforward and efficient method for dimensionality reduction, which is listed in Alg. 1. Also, note that it is assumed that the matrix is positivedefinite. If it is not positivedefinite, it is customary to apply diagonalloading (addition of to the covariance matrix for a small ) to force the covariance matrix to be positivedefinite when it is rankdeficient (although, if enough diverse data samples are available, it is rare that the covariance matrix is rankdeficient).

Compute:
(27) (28) 
Perform eigenvalue decomposition on
(29) 
Compute the top righteigenvectors of
It is not immediately clear that is a real matrix, since may not be symmetric. However, since the matrix is similar to the real and symmetric matrix [38, p. 81] (this is because is by definition similar to ) where denotes the matrix square root of , we have that and share a spectrum, which is real due to being a real symmetric matrix [38, p. 78]. It is important to note that the eigenvalues of are also nonnegative due to it being similar to , and being a positivesemidefinite matrix^{4}^{4}4The definition of a positive semidefinite matrix is that is symmetric and satisfies for all . In our case, we have , where and we defined as a transformation of .. That is, since is positivesemidefinite, its eigenvalues are nonnegative, and since is similar to , its eigenvalues are also nonnegative. This was a requirement for our earlier derivation to show that the principal eigenvectors of are the same as the principal eigenvectors of . We also know that the eigenvectors of the matrix are real since is real and symmetric [38, p. 97]. Let us now denote one of the eigenvectors of , associated with eigenvalue as . Then, we have that and for some eigenvector of (since is an eigenvalue of as we have shown above—we will see, however, that we can always find eigenvectors in , as opposed to ). Multiplying both sides on the left by , we obtain:
(30) 
Then, we immediately obtain that:
(31) 
where
(32) 
denotes the eigenvector of , which is real due to and . Thus, we see that is indeed a real matrix. Note that (which contains the principal eigenvectors of as its columns) may not satisfy since the eigenvectors of are not guaranteed to be orthonormal.
To see the advantage of the cPCA++ method, it is useful to compare it with the “contrastive PCA” (cPCA) algorithm from [1]. The cPCA algorithm from [1] performs an eigendecomposition of the following matrix:
(33) 
where is a contrast parameter that is swept to achieve different dimensionality reduction output. We observe that the cPCA algorithm is inherently sensitive to the relative scales of the covariance matrices and . For this reason, a parameter sweep for (which adjusts the scaling between the two covariance matrices) was used in [1]. In other words, multiple eigendecompositions are required in the cPCA algorithm. On the other hand, the cPCA++ algorithm does not require this parameter sweep, as the relative scale between the two covariance matrices will be resolved when performing the eigendecomposition of the matrix . To see this property analytically, consider Example 1.
Example 1 (A needle in a haystack).
We consider the following feature example in order to better understand the traditional PCA, cPCA, and cPCA++ methods, and how they react to a small amount of information hidden in a significant amount of “noise.” Let the covariance matrices of the background and target datasets be respectively given by (where we note that is positivedefinite):
(34)  
(35) 
where and it is assumed that and , making the vectors and orthonormal. It is further assumed that , which means that the vector that is not common to both background and target has very small power in comparison to what is common between them. It is also assumed that and , which means that the diagonalloading should not affect the result significantly. It is assumed that . In this example, we would like to obtain a filter to reduce the dimension to from our original features. Note that (35) describes a matrix with eigenvalues and and associated eigenvectors and . The reason why we are interested in the above problem is that by examining the covariance structure above, we see that the data is inherently noisedominated. This noise (i.e., the term) appears in both the background dataset and the target dataset, but the target dataset may contain some classspecific information (i.e., the term) that is not present in the background dataset. However, this “interesting” information is dominated by the noise due to the eigenvector . We wish to perform dimensionality reduction that rejects variance due to the eigenvector but preserves information due to the eigenvector of the target dataset. The obvious solution is to choose the dimensionality reduction filter to be . We will examine how each algorithm performs.
 PCA

When traditional PCA is executed on the target dataset, the principal eigenvector (and thus ) will be found to be since . In addition, since , is also the principal eigenvector of the background dataset, which is not useful in extracting the interesting structure from the foreground dataset. This is because , which means that this filter will in fact nullout the “interesting” eigenvector.
 cPCA

When cPCA is executed on the background and target datasets, we have that will be given by:
(36) Now, (36) allows us to see exactly the reasoning behind sweeping . When , we obtain the PCA method operating on the target dataset (which would pick the filter defined by ). When , we obtain the PCA method operating on the background dataset (which still would pick the filter defined by ). Instead, the desired value of is one that nulls out the eigenvector —e.g., . Observe that when this choice of is made (or chosen through a sweep), we have that the principal eigenvector (and thus ) of will be . This is precisely what we want the filter to be in order to differentiate between the two datasets (or extract the useful information from the target dataset relative to the background). Although the optimal value is evident in this simple analytic example (due to the fact that we know the underlying structure indicated in (34) and (35)), this is not the case in most experiments and thus it is necessary to sweep over the unbounded range .
 cPCA++

For the analysis of the cPCA++ method, let us first consider the inverse of , via the matrix inversion lemma:
(37) where step is due to the assumption that and denotes the projection onto the left nullspace of [38, p. 22] (since ). Now, multiplying by , we obtain:
(38) where step is due to the fact that and are orthonormal. This means that the matrix in the case of the cPCA++ method is approximately a rank matrix with principal eigenvector . Clearly then, the filter chosen by cPCA++ will be , which is the desired result. Observe that this result was obtained without the need for a hyperparameter sweep (i.e., like the sweep over the parameter in cPCA).
Note that another recent publication [39], also based on the general principle of [1], was brought to our attention. However, careful review of this article shows that the approach in [39] is different from ours. In [39], the algorithm was obtained in a similar manner as the cPCA algorithm (i.e., it was formulated as a slightly modified maximization problem). On the other hand, our proposed cPCA++ approach is fundamentally different because it directly addresses the matrix factorization and dimensionality reduction nature of our problem, and therefore, gives rise to an intuitive perinstance classification algorithm (see Alg. 2 further ahead) that can be applied to image splicing localization, and also addresses image denoising problems (see App. B).
In the next subsection, we will simulate both the cPCA method and the cPCA++ method for a collection of experiments from [1] (we will also simulate the popular tSNE dimensionality reduction method [40]). We will observe that the cPCA++ algorithm will be able to discover features from within the target dataset without the need to perform a parameter sweep, which will improve the running time of the method significantly (see Sec. IIIC).
IiiB Performance Comparison of Feature Extraction Methods
In this subsection, we will examine many of the same experiments that the cPCA algorithm^{5}^{5}5https://github.com/abidlabs/contrastive was applied to in [1]. Table I lists the parameters of the various datasets that will be examined in this section. In all cases, the desired dimensions from the methods will be set to (i.e., after feature reduction). We also incorporated the tDistributed Stochastic Neighbor Embedding (tSNE) algorithm [40] in the results^{6}^{6}6https://lvdmaaten.github.io/tsne/##implementations, even though tSNE does not provide an explicit function for the discovered feature reduction. For example, after learning the filters through Alg. 1, it is possible to efficiently apply them to new data matrices. In the case of tSNE, however, the dimensionality reduction is done in relationship to the other samples in the dataset, so when a new data matrix is to be reduced in dimensions, the tSNE algorithm must be reexecuted. However, we include it in this exposition since it is a common algorithm for modern feature reduction. The main takeaway from these experiments is the fact that the cPCA++ algorithm can obtain an explicit filter matrix, , and does not require a parameter sweep, which makes the algorithm much more efficient. At the same time, it is still able to discover the interesting structures in the target dataset.
Example  

Synthetic  
MNIST over Grass  
Mice Protein Expression  
MHealth Measurements  
Single Cell RNASeq of Leukemia Patient 
IiiB1 Synthetic Data
We first consider a synthetic data example to help illustrate the differences between the traditional PCA, cPCA, and cPCA++ algorithms. Consider the following original feature structure (i.e., prior to feature reduction) for an arbitrary th sample from a foreground or target data matrix :
(39) 
in which denotes the class index, with subvectors , where , , are independent. Now, it is assumed that . On the other hand, the other two subvectors are more useful to differentiate across the different classes :
(40) 
while is distributed according:
(41) 
Knowing the above structure, we can deduce that to differentiate between the different classes , we need only use and , while completely ignoring as it does not help differentiate between the different classes. On the other hand, traditional PCA executed on the target data matrix would readily pick features from since they have the largest variance—even though they do not help in distinguishing between the different classes.
Instead, it is assumed that there is a background dataset that, while not containing the differentiating features, follows the following distribution:
(42) 
where the subvectors are independent. We observe that only informs us as to the relative variance across the dimensions, but not the explicit mean values (i.e., feature structure) within the dimensions. It can be shown that the filters obtained by the cPCA++ algorithm are given by (please see App. A):
(43) 
where denotes the standard basis vector of dimension (i.e., ). Note that these are the ideal filters to extract the signal structure from the target dataset described by (39). We ran the cPCA++ method on this synthetic dataset, and compared the results with the other feature reduction methods in Fig. 3. Note that the tSNE method operated over the target or foreground dataset. For all methods examined, we extract the dimensionality reduced dataset for dimensions. All classes in the figure are equally likely, and a total of samples were generated. In the top row of plots in Fig. 3, we ran the cPCA algorithm for three positive choices of the contrast parameter . We see that a contrast factor of yields a correct separation among the classes, while the other values of the contrast factor fail to do so. In the bottomleft plot, we have the performance of the traditional PCA algorithm on the target dataset. It is in fact a special case of the cPCA method outlined in [1] for a contrast parameter of . We observe that traditional PCA fails to cluster the data. This is expected by our investigation above, since the traditional PCA method is designed to pick the directions that best explain the most variance of the dataset, which are directions that explain . In the bottomcenter plot, we executed the tSNE algorithm [40] on the foreground dataset for iterations (no significant improvement was obtained by running tSNE longer). We see that tSNE also fails to select the correct dimensions with which to visualize the data. Finally, in the bottomright plot, we show the data plotted against the feature directions obtained by the cPCA++ algorithm. We observe that our feature extractor was able to perform the correct feature selection in one step, without the need for a parameter sweep. We note that while this is a synthetic example that can be solved analytically, it helps in illustrating how the method actually works. In the next few simulations, we compare and contrast the performance of the cPCA++, cPCA, traditional PCA, and tSNE algorithms on more complex datasets.
IiiB2 MNIST over High Variance Backdrop
In this experiment (also conducted in [1]), the MNIST dataset [41] is used and the images corresponding to digits and are extracted. Utilizing traditional PCA processing, it is possible to separate these two particular digits.
However, the authors of [1] (as we do here), superimposed these digits atop of highvariance grass backgrounds in order to mask the fine details that make the digit recognizable and distinct. Figure 4 shows six examples of these “target” images. We see that the digits themselves are difficult to see atop the highvariance background images. The task now is to apply the feature reduction techniques to still separate these digits.
In order to obtain a “background” dataset for the cPCA and cPCA++ methods, grass images (without the MNIST digits) randomly sampled from the ImageNet dataset [42] were used to estimate . Note that the same grass images that were used to generate the target images need not be utilized, but grass images in general will do.
In Fig. 5, we evaluate the performance of the dimensionality reduction techniques on the target images illustrated in Fig. 4. It can be seen that both traditional PCA and tSNE have difficulty clustering the two digits, due to the high variance background masking the fine structure of the digits. On the other hand, the cPCA and cPCA++ methods are able to achieve some separation between the two digits, although in both cases there is some overlap. Again, note that cPCA++ does not require a hyperparameter sweep.
IiiB3 Mice Protein Expression
This next example utilizes the public dataset from [43]. This dataset was also used in [1]. In this dataset, we have access to protein expression measurements from mice that have received shock therapy and mice that have not. We would like to analyze the foreground dataset (mice with shock therapy applied) to attempt to cluster mice that have developed Down Syndrome from those that have not. We also have access to a background dataset, which consists of protein expression measurements from mice that did not experience shock therapy. In Fig. 6 we illustrate the results of the different feature reduction techniques. It can be seen that traditional PCA is unable to effectively cluster the data. On the other hand, the tSNE, cPCA, and cPCA++ methods are all able to achieve some separation of the data. Although tSNE was able to achieve good separation in this particular example, we reiterate the fact that tSNE is not an appropriate method to be used to extract filters or function transformations from the highdimensional space to the lowerdimensional space as the algorithm must be reexecuted with new samples in order to map previously unmapped data. There is one other significant observation about this example. As indicated in the main paper, there is a very small number of samples in the background dataset. This caused the background covariance matrix, , to be rankdeficient. For this reason, diagonal loading was employed to ensure that the background covariance matrix is invertible.
IiiB4 MHealth Measurements
In the next experiment (also conducted in [1]), we use the IMU data from [44]. In this dataset, a variety of sensors are used to monitor subjects that are performing squats, jogging, cycling, or lying still. The sensors include gyroscopes, accelerometers, and electrocardiograms (EKG). In the topleft subplot of Fig. 7, we see that performing traditional PCA on a dataset of subjects that are performing squats or cycling does not yield a reasonable clustering for those two activities. On the other hand, the optimal contrast parameter was used for cPCA, and the optimal cPCA result is illustrated in the topright subplot of Fig. 7. For both the cPCA and cPCA++ methods, the background covariance matrix was obtained from subjects lying still. We observe that the cPCA algorithm was able to cluster the two activities effectively. The tSNE output is then shown in the bottomleft subplot of the figure. We see that tSNE was unable to distinguish the two activities. Finally, the output of the cPCA++ method is shown in the bottomright subplot of the figure. We see that cPCA++ is able to effectively cluster the two activities, without the need for a parameter sweep.
IiiB5 Single Cell RNASeq of Leukemia Patient
Our final dataset in this section was obtained from [45], and this analysis was also conducted in [1]. The target or foreground dataset contains singlecell RNA expression levels of a mixture of bone marrow mononuclear cells (BMMCs) from a leukemia patient before and after stem cell transplant, and the goal is to separate the pretransplant samples from the posttransplant samples. The original features were reduced to only 500 genes based on the value of the variance divided by the mean. Once these features are obtained, we execute the cPCA algorithm for different values of and plot the result in the top row of Fig. 8. In this example, the background covariance matrix was obtained from RNASeq measurements from a healthy individual’s BMMCs, for both the cPCA and cPCA++ methods. As noted in [1], this should allow the dimensionality reduction methods to focus on the novel features of the target dataset as opposed to being overwhelmed by the variance due to the heterogeneous population of cells as well as variations in experimental conditions. We observe that the data is close to being separable for the contrast value of . Next, we plot the traditional PCA output in the bottomleft subplot of Fig. 8, and observe that it is incapable of clustering the data. Next, in the bottomcenter plot, we illustrate the output of the tSNE algorithm, and see that while it does achieve some separation of the data, it is not on par with the cPCA algorithm. Finally, the cPCA++ output is illustrated in the bottomright subplot of Fig. 8, and we see that it is able to obtain a similar clustering performance as compared to the optimal cPCA clustering result. Note that cPCA++ does not require a parameter sweep. The covariance matrix was rank deficient in this example, so we utilized diagonal loading in order to make it invertible.
IiiC Computational Time Performance Comparison
In this section, we will examine the computational time requirements for the various datasets studied in Sec. IIIB. We neglect measuring the computation time required by the traditional PCA algorithm since it will be very similar to the cPCA++ algorithm and since traditional PCA is not an effective algorithm for discriminative dimensionality reduction, as we saw in the results of Sec. IIIB. As shown in Table II, the proposed cPCA++ method is significantly faster than both the cPCA method and the tSNE method. The reason why the cPCA++ method is on average (i.e., averaged over the experiments outlined in Sec. IIIB) times faster than the cPCA method is that while both methods require the eigendecomposition of similarly sized covariance matrices, the cPCA algorithm requires a sweep over the contrast hyperparameter (i.e., multiple eigendecompositions). On the other hand, the cPCA++ method does not require this hyperparameter tuning since the scale discrepancy is automatically resolved. It is also notable to see that cPCA++ is on average times faster than tSNE. The tSNE method has a very high running time due to the fact that it works by finding relationships between all the points in a dataset, and not through the determination of compact parameters. In addition, it is not very well suited for dimensionality reduction when future data samples will be obtained (since it has to be reexecuted on the entire new dataset).
Example  cPCA  tSNE  cPCA++ 

Synthetic  
MNIST over Grass  
Mice Protein Expression  
MHealth Measurements  
RNASeq of Leukemia Patient  
Average cPCA++ Speedup  x  x  x 
Iv The cPCA++ Framework for Image Splicing Localization
In this section, we describe the general framework of the cPCA++ approach for image splicing localization. In this approach, we focus on detecting the spliced boundary or edge.
We take this approach of determining the spliced boundary rather than the spliced region or surface because there is an ambiguity in the determination of which image is the donor and the host. For example, consider Figure 9, which shows a probe image and the corresponding spliced boundary ground truth mask. Please note that we use the color white to denote a pixel belonging to the spliced boundary, and the color black to denote a pixel not belonging to the spliced boundary. It is possible to consider region A as the spliced region and region B as the authentic region, or vice versa. Thus, there is an ambiguity in how one labels the spliced and authentic regions in a given manipulated image. However, this ambiguity is resolved if we instead consider the spliced boundary (rather than the spliced surface/region).
The “training” phase consists of two main steps: 1) collect labeled or “training” samples/patches, 2) build a feature extractor based on the training samples. As we will see later, the cPCA++ approach does not require the training of a classifier, such as a support vector machine or random forest. However, we will still refer to patches used in this phase as “training” samples, since we are using their labeled information (i.e., spliced vs. authentic edges). In the testing phase, we are given a new image (i.e., not seen during the training phase), and the goal is to output a probability map that indicates the probability that each pixel belongs to the spliced boundary. The testing phase consists of the following main steps: 1) divide a given test image into overlapping patches, 2) extract the feature vector for each test patch, 3) obtain the probability output for each test patch, and 4) reconstruct the final output map for the given test image. The training and testing phases are elaborated below.
In the training phase, the first step is to collect labeled or “training” samples/patches from a given set of images. In order to do this, we need access to two masks for each training image: 1) the spliced surface ground truth mask and 2) a mask which highlights the superset of all edges/boundaries in the image (i.e., spliced boundaries as well as natural/authentic boundaries), referred to as the edge detection output mask. Note that although we utilize the surface ground truth mask in selecting the training patches, the output of the cPCA++ method will be edgebased (i.e., the output will represent an estimate of the true spliced edge/boundary). This will be explained in more detail in the next paragraph. We utilized the CASIA v2.0 dataset for training purposes. For the CASIA v2.0 dataset, the surface ground truth masks are not provided. We generated the ground truth masks using the provided reference information. In particular, for a given spliced image in the CASIA dataset, the corresponding donor and host images are provided, and we used this information to generate the ground truth mask. The second mask that is needed for collecting the training patches is the edge detection output mask. We utilized structured edge detection for detecting spliced and natural edges [46].
Once we have these two masks, we can proceed to collect the training patches. Each training image is divided into overlapping patches, and we then select target/foreground and background samples in the following manner. We would like target/foreground samples to be those that contain a boundary between a spliced region and an authentic region (referred to as a spliced boundary). In order to achieve this, we utilize the surface ground truth masks, and we select foreground samples to be those that have a splicing fraction that lies in a certain range (e.g., 3070 percent of the patch is a spliced area/surface). On the other hand, background samples do not contain a boundary between a spliced and authentic region. We impose an additional constraint on the background samples so that they contain a minimum amount of authentic edges (as specified by the structured edge detector). Figure 10 shows an example of a foreground patch and a background patch. Note that although we are using the surface ground truth masks in selecting foreground and background samples, what is actually important is the spliced boundary/edge (or lack thereof).
The next step is to build a feature extractor based on the covariance matrices of the foreground and background training samples. Suppose we have collected background patches and foreground patches, and let the size of each patch be , where is the number of channels. Since we are working with RGB images, will be equal to . Each training patch can be flattened so that it is a vector, where is equal to (or 3 if ). We can represent the flattened background and foreground patches as data matrices and , respectively, such that different columns of (or ) correspond to different training samples.
To see why it is necessary to use the cPCA++ method as opposed to traditional PCA for the image splicing localization task, consider the top 50 eigenvectors of the covariance matrix of (i.e., authentic edges or background) and the top 50 eigenvectors of the covariance matrix of (i.e., spliced edges or target/foreground). We compute the power of the foreground eigenvectors () after projecting them onto the background subspace (spanned by ) to quantify the similarity in the subspaces spanned by and : . What this indicates is that the subspaces spanned by the top principal components of each dataset ( and ) are mostly overlapping—hinting at the fact that it is likely that any classifier will be overwhelmed by the similarity of the features of these two datasets. For this reason, we will utilize the cPCA++ method instead to alleviate this situation and perform discriminative feature extraction.
We thus utilize Alg. 1, described in Sec. III, to build the feature extractor. We first center the data matrices and by subtracting their respective means. We then calculate the second order statistics and , and then perform eigenvalue decomposition on . If the matrix is rankdeficient, we utilize diagonal loading in order to make it invertible. We then compute the top righteigenvectors of . The matrix is used to extract features during the testing phase, and is referred to as the transform matrix.
In the testing phase, we divide each test image into overlapping patches. Similar to what was done during the training phase, we can flatten the test patches and represent them as a matrix . We then utilize Algorithm 2 to obtain a probability output for each patch. The motivation behind Algorithm 2 is explained as follows. The cPCA++ method will naturally find filters that will yield a small norm for the reduced dimensionality features of background samples and a larger norm for foreground samples. This means that once the filters are obtained, they are expected to yield small valued output when applied to background patches and larger valued output when applied to foreground patches. This presents us with a very simple and efficient technique for obtaining the output: simply measure the output power after dimensionality reduction and then convert the raw value to a probabilitybased one. The full output map for a given test image can then be reconstructed by averaging the contributions from overlapping regions. After reconstruction, we perform an elementwise multiplication of the output map and the structural edge detection output mask. The reason for doing this is explained as follows. In this task, we are attempting to classify the spliced edge. Therefore, if the structural edge detector labeled a given pixel as nonedge (i.e., neither an authentic edge nor a spliced edge), then the corresponding pixel value in the cPCA++ output map should automatically be set to zero.
V Experimental Analysis
Va Evaluation/Scoring Procedure
In this subsection, we explain the procedure used to score the output maps from the cPCA++ method, as well as from the Multitask Fully Convolutional Network (MFCN)based method [2] we compared against. The MFCNbased method is a stateoftheart approach that also outputs an edgebased probability map, and thus can be directly compared with the proposed cPCA++ method. The MFCN achieved the highest score in the splicing localization task in the 2017 Nimble Challenge, and the second highest score in the 2018 Media Forensics Challenge (which are part of the DARPA MediFor program^{7}^{7}7https://www.darpa.mil/program/mediaforensics). We evaluated the performance of the methods using the and Matthews Correlation Coefficient () metrics, which are popular perpixel localization metrics. As noted previously, the edgebased approach that we propose avoids the ambiguity in the labeling of the spliced and authentic surfaces/regions. Because of the ambiguity in the surfacebased labeling, most surfacebased scoring procedures score the original output map as well as the inverted version of the output map, and select the one that yields the best score. However, in edgebased scoring procedures, there is no need to invert the edgebased output map, and the output can be scored directly. For a given spliced image, the metric is defined as
where represents the number of pixels classified as true positive, represents the number of pixels classified as false negative, and represents the number of pixels classified as false positive. In this case, the target/foreground patches can be viewed as the positive samples and the background patches can be viewed as the negative samples. The metric ranges in value from to , with a value of being the best. For a given spliced image, the metric is defined as
The metric ranges in value from to , with a value of being the best.
VB Experimental Results
We first compared the training times of the proposed cPCA++ method and the MFCNbased method, and found the proposed method to be significantly more efficient. The MFCNbased method requires a training time of approximately 11.5 hours on a NVIDIA GeForce GTX Titan X GPU, while the proposed cPCA++ method only requires a training time of approximately 1.5 hours on CPU (Intel Xeon Gold 6126). Please note that the cPCA++ code has not yet been optimized to make full utilization of the multicore processor. Next, we evaluated the proposed cPCA++ method on the Columbia and Nimble WEB datasets and compared its performance with the MFCNbased method. Tables III and IV show the Matthews Correlation Coefficient () and scores, respectively. It can be seen that the cPCA++ method yields higher scores (in terms of both and ), as compared to the MFCNbased method.
Figures 11 and 12 show multiple examples of localization output from the Columbia and Nimble WEB datasets, respectively. Each row shows (from left to right) a manipulated or probe image with the spliced edge highlighted in pink, the structural edge detection output mask highlighting both spliced and authentic edges, the cPCA++ raw probability output map, and the MFCNbased raw probability output map. In these figures, it can be seen that the cPCA++ method yields a finer localization output than the MFCNbased method.
Dataset  cPCA++  MFCN 

Columbia  
Nimble WEB 
Dataset  cPCA++  MFCN 

Columbia  
Nimble WEB 
Vi Conclusion
In conclusion, we proposed cPCA++, which is a new technique for discovering discriminative features in highdimensional data. The proposed approach is able to discover structures that are unique to a target dataset, while at the same time suppressing uninteresting highvariance structures present in both the target dataset and a background dataset. The proposed cPCA++ approach is compared with a recently proposed algorithm, called contrastive PCA (cPCA), and we show that cPCA++ achieves similar discriminative performance in a wide variety of settings, even though it eliminates the need for the hyperparameter sweep required by cPCA. Following this discussion, the cPCA++ approach was applied to the problem of image splicing localization, in which the two classes of interest (i.e., spliced and authentic edges) are extremely similar in nature. In the context of this problem, the target dataset contains the spliced edges and the background dataset contains the authentic edges. We show that the cPCA++ approach is able to effectively discriminate between the spliced and authentic edges. The resulting method was evaluated on the Columbia and Nimble WEB splicing datasets. The proposed method achieves scores comparable to the stateoftheart Multitask Fully Convolutional Network (MFCN), and it does so very efficientlyâ–without the need to iteratively update filter weights via stochastic gradient descent and backpropagation, and without the need to train a classifier.
Appendix A Derivation of Filters for Synthetic Example
The sample covariance of the background dataset specified in (42) will be:
(44)  
(45) 
where the operator denotes the Kronecker product of the matrices and . On the other hand, the sample covariance of the target/foreground dataset is given by (when the classes are equally likely):
(46) 
Then, the matrix for the cPCA++ algorithm becomes:
(47) 
The matrix turns out to be symmetric (and thus is diagonalizable) and has the following blockdiagonal structure:
(48) 
where
(49)  
(50)  
(51) 
The eigenvalues of the block diagonal matrix are given by the eigenvalues of its diagonal blocks^{8}^{8}8This is because an eigenvalue of by definition satisfies . However, since is block diagonal, we have that