Real or Fake? Spoofing State-Of-The-Art
Face Synthesis Detection Systems
The availability of large-scale facial databases, together with the remarkable progresses of deep learning technologies, in particular Generative Adversarial Networks (GANs), have led to the generation of extremely realistic fake facial content, which raises obvious concerns about the potential for misuse. These concerns have fostered the research of manipulation detection methods that, contrary to humans, have already achieved astonishing results in some scenarios. In this study, we focus on the entire face synthesis, which is one specific type of facial manipulation. The main contributions of this study are: i) a novel strategy to remove GAN “fingerprints” from synthetic fake images in order to spoof facial manipulation detection systems, while keeping the visual quality of the resulting images, ii) an in-depth analysis of state-of-the-art detection approaches for the entire face synthesis manipulation, iii) a complete experimental assessment of this type of facial manipulation considering state-of-the-art detection systems, remarking how challenging is this task in unconstrained scenarios, and finally iv) a novel public database named FSRemovalDB produced after applying our proposed GAN-fingerprint removal approach to original synthetic fake images.
The observed results led us to conclude that more efforts are required to develop robust facial manipulation detection systems against unseen conditions and spoof techniques such as the one proposed in this study.
Fake images and videos, seeing them as digital manipulations and not physical (e.g., through face masks [2015_ISPM_PAs, 2019_TIFS_Fingerprint_PAs_Tolosana]), have become a great public concern recently [BBCNews_deepfake]. So far, the number and realism of fake contents have been limited by the lack of sophisticated editing tools, the high domain expertise, and the complex and time-consuming process involved. However, it is becoming increasingly easy to automatically synthesise non-existent faces or even manipulate the face of a real person in an image/video, thanks to the free access to large public databases and also the improvement of deep learning techniques that eliminate manual editing steps. As a result, accessible open software and mobile applications such as ZAO and FaceApp have led to large amounts of synthetically generated fake videos [zao, faceapp].
Currently, facial fake methods can be categorised in four different groups regarding the level of manipulation [Jain2019facialManipulation, rossler2019faceforensics++]: i) entire face synthesis, ii) face swapping/identity swap, iii) facial attributes, and iv) facial expression.
In this study, we focus in particular on the entire face synthesis manipulation. This manipulation creates non-existent faces, usually through powerful Generative Adversarial Networks (GANs) [goodfellow2014generative], e.g., through the recent StyleGAN approach proposed in [Karras_2019_CVPR]. This type of facial manipulation provides astonishing results, generating high-quality facial images with a high level of realism. Nevertheless, contrary to humans, most state-of-the-art detection systems provide very good results against this type of facial manipulation, remarking how easy is to detect the GAN “fingerprints” included in the fake images from the real images.
|Color-related Features||SVM||AUC = 70%||NIST MFC2018|
|GAN-related Features||CNN||Acc. = 99.50%||
|SVM||Acc. = 84.78%||
|Image-related Features||CNN + Attention Mechanism||EER = 0.05%||
|Image-related Features||DRN||AP = 99.8%||
The main contributions of this study are:
A novel approach based on GAN-fingerprint removal in order to spoof state-of-the-art facial manipulation detection systems, while keeping the visual quality of the resulting images. Fig. 1 graphically summarises our proposed approach based on the use of autoencoders.
An in-depth analysis of state-of-the-art detection approaches for the entire face synthesis manipulation, explaining the key aspects of the detection systems, databases, and results achieved in each of them.
A thorough experimental assessment of this type of facial manipulation considering state-of-the-art detection systems and different experimental conditions, i.e., controlled and in the wild scenarios.
A novel database named Face Synthesis Removal (FSRemovalDB) produced after applying our proposed GAN-fingerprint removal approach to original synthetic images111https://github.com/BiDAlab/FSRemovalDB.
The remainder of the paper is organised as follows. Sec. II summarises previous studies focused on the detection of the entire face synthesis manipulation. Sec. III explains all details of our proposed GAN-fingerprint removal approach. Sec. IV summarises the key features of the real and fake databases considered in the experimental framework. Sec. V and VI describe the proposed experimental setup and results achieved, respectively. Finally, Sec. VII draws the final conclusions and points out some lines for future work.
Ii Related Works
Different studies have recently evaluated how easy is to detect manipulations based on the entire face synthesis. Table I depicts a comparison of the most relevant approaches in this area. For each study, we include information related to the features, classifiers, best performance, and databases considered.
In [mccloskey2018detecting], the authors analysed the architecture of GANs in order to detect different artifacts between fake and real images. They proposed a detection system based on colour features and a linear Support Vector Machine (SVM) for the final classification. Their approach achieved a final 70% Area Under the Curve (AUC) for the best performance when considering the NIST MFC2018 dataset [NIST_challenge]. Later on, Yu et al. analysed in [yu2018attributing] the existence and uniqueness of GAN fingerprints in order to detect fake images. In particular, they proposed a learning-based formulation based on an attribution network architecture in order to map an input image to its corresponding fingerprint image. Therefore, they learned a model fingerprint for each source (each GAN instance plus the real world), such that the correlation index between one image fingerprint and each model fingerprint serves as softmax logit for classification. Their proposed approach was tested using real faces from CelebA database [celeba] and synthetic faces created through different GAN approaches (ProGAN [pgan], SNGAN [sngan], CramerGAN [cramergan], and MMDGAN [mmdgan]), achieving a final 99.50% accuracy for the best performance. In [wang2019fakespotter], Wang et al. conjectured that monitoring neuron behavior could also serve as an asset in detecting fake faces since layer-by-layer neuron activation patterns may capture more subtle features that are important for the facial manipulation detection system. Their proposed approach, named FakeSpoter, extracted as features neuron coverage behaviors of real and fake faces from deep face recognition systems (i.e., VGG-Face [vggface], OpenFace [openface], and FaceNet [facenet]), and then trained a SVM for the final classification. The authors tested their proposed approach using real faces from CelebA-HQ [celebahq] and FFHQ [Karras_2019_CVPR] databases and synthetic faces created through InterFaceGAN [shen2019interpreting] and StyleGAN [Karras_2019_CVPR], achieving for the best performance a final 84.78% accuracy for the FaceNet model. More recently, Stehouwer et al. carried out in [Jain2019facialManipulation] a complete analysis of different facial manipulation methods. They proposed to use attention mechanisms to process and improve the feature maps of Convolutional Neural Networks (CNN) models. For the facial manipulation method considered in our study (i.e., entire face synthesis), the authors achieved a final 0.05% Equal Error Rate (EER) considering real faces from CelebA [celeba], FFHQ [Karras_2019_CVPR], and FaceForensics++ [rossler2019faceforensics++] databases and fake images created through ProGAN [pgan] and StyleGAN [Karras_2019_CVPR] approaches. Finally, Wang et al. carried out in [wang2019detecting] a very interesting research using publicly available commercial software from Adobe Photoshop in order to synthesise new faces [adobetool], and also a professional artist in order to manipulate 50 real photographs. The authors began running a human study through Amazon Mechanical Turk (AMT), showing real and fake images to the participants and asking them to classify each image into one of the classes. The results remark the challenging of the task for humans, with a final 53.5% performance (chance = 50%). After the human study, the authors proposed an automatic detection system based on Dilated Residual Networks (DRN), achieving Average Precisions (AP) of 99.8% and 97.4% for automatic and manual face synthesis manipulation.
Finally, we also include for completeness some important references to other recent studies focused on the detection of general GAN-based image manipulations, not facial ones. In particular, we refer the reader to [zhang2019detecting, huh2018fighting, zhou2018learning, nataraj2019detecting].
To summarise, this section highlights: i) how challenging is to detect this type of facial manipulation for humans, and ii) the good system performance results achieved by most current automatic detection systems against this type of manipulation, as they are able to learn the GAN fingerprints produced in fake images.
Iii Proposed Approach:
Our proposed approach intends to remove all discriminative information that allows state-of-the-art facial manipulation detection systems to easily detect fake images from real images. Concretely, we propose to use a convolutional autoencoder.
In general, an autoenconder comprises two distinct parts, the encoder , and decoder :
where denotes the input image of the network, the latent feature representation of the input image after passing through the encoder , and the reconstructed image learned from after passing through the decoder . Thus, and can be learned by minimising the reconstruction loss :
Therefore, when is nearly 0, is able to discard all redundant information from and code it into properly. However, for a reduced size of the latent feature representation vector, will increase and will be forced to encode in only the most discriminative information. In this sense, we claim that the autoencoder can act as a GAN-fingerprint removal system.
Fig. 2 describes the details of our proposed approach. It comprises convolutional and max-pooling layers. As described in the figure, our proposed approach is trained using only real face images from the development dataset. Later on, in evaluation, once the autoencoder is trained, we can pass synthetic face images through the autoencoder to remove the GAN fingerprint information, and enhance them with statistical information learned from the real face images.
Four different public databases are considered in the experimental framework. Fig. 3 shows some examples of each database. We now summarise the most important features.
Iv-a Real Face Images
Iv-A1 CASIA-WebFace [yi2014learning]
this database contains 494,414 face images from 10,575 actors and actresses of IMDb. Face images comprise random pose variations, illumination, facial expression and resolution.
Iv-A2 VGGFace2 [cao2018vggface2]
this database contains 3,31 million images from 9,131 different subjects, with an average of 363 images per subject. Images were downloaded from the internet and contain large variations in pose, age, illumination, ethnicity and profession (e.g., actors, athletes, and politicians).
Iv-B Synthetic Face Images
this database comprises 150,000 unique faces, collected from the website222https://thispersondoesnotexist.com. Synthetic images are based on the recent StyleGAN approach [Karras_2019_CVPR] trained with FFHQ database [FFHQ].
Iv-B2 100K-Faces [100kfaces]
this database contains 100,000 synthetic images generated using StyleGAN [Karras_2019_CVPR]. In this database the StyleGAN network was trained using around 29,000 photos of 69 different models, producing face images with a flat background.
V Experimental Setup
In order to perform a fair experimental evaluation, we first remove from all real and synthetic images those factors not related to the own face but to the specific conditions associated to each database. In this study we focus on two different factors.
Background: this is a clearly distinctive aspect among real and synthetic face images as different acquisition conditions are considered in each database.
Head pose: images generated by GANs hardly ever produce high variation from the frontal pose [Jain2019facialManipulation], contrasting with most popular real face databases such as CASIA-WebFace and VGGFace2. Therefore, this factor may falsely improve the performance of the detection systems since non-frontal images are more likely to be real faces.
To remove these factors from all real and synthetic images, we first extract 68 face landmarks using [Kazemi_2014_CVPR]. Given the landmarks of the eyes, an affine transformation is determined so that the location of the eyes appears in all images at the same distance from the borders. This step allows to remove all the background information of the image while keeping the maximum amount of facial region. Regarding the head pose, landmarks are used to estimate if a face is frontal or not. We keep in our experimental framework only face frontal images in order to avoid biased results. After this pre-processing stage, we always provide images of 224x224 pixels as input to the systems. Fig. 3 shows examples of the crop-out faces of each database after applying the pre-processing stage considered in this experimental framework.
V-B Facial Manipulation Detection Systems
Two different state-of-the-art detection approaches are considered in this study.
On the one hand, we consider the Xception network proposed in [chollet2017xception], as it provides the best detection results in most recent studies [Jain2019facialManipulation, rossler2019faceforensics++, dolhansky2019deepfake]. In particular, we follow the same training approach considered in [rossler2019faceforensics++]: i) we first consider the XceptionNet model pre-trained with ImageNet [deng2009imagenet], ii) we change the initial fully-connected layer of the ImageNet model by a new one (two classes, real or synthetic image), iii) we fix all weights up to the final layers and pre-train the network for few epochs, and finally iv) we train the network for 20 more epochs and choose the best performing model based on validation accuracy.
On the other hand, we replicate the recent technique proposed in [nataraj2019detecting], inspired from classical steganalysis. This approach consists of computing the co-occurrence matrices directly on the image pixels on each red, green, and blue channels, and finally pass this information through a CNN, allowing the network to extract non-linear robust features.
All experiments are implemented under PyTorch framework, with a NVIDIA Titan X GPU. Adam optimiser is considered with a learning rate of , dropout for model regularization with a rate of , and a loss function based on binary cross-entropy.
|Development||Evaluation||XceptionNet [chollet2017xception]||Steganalysis [nataraj2019detecting]|
The experimental protocol designed in this study intends to perform an exhaustive analysis of state-of-the-art facial manipulation detection systems. Thus, three different experiments are considered: i) controlled scenarios, ii) in the wild scenarios, and finally iii) GAN-fingerprint removal.
Each database is divided into two different datasets, one for the development and training of the systems (70%) and the other one for the final evaluation (30%). Additionally, the development dataset is divided into two different subsets, training (75%) and validation (25%). The same number of real and synthetic images are considered in the experimental framework. In addition, for real face images, different users are considered in the development and evaluation datasets in order to avoid biased results.
Vi Experimental Results
Vi-a Controlled Scenarios
In this first experimental section we evaluate how easy is to detect entire face synthesis manipulations in controlled scenarios, i.e., when samples from the same databases are considered for both development and final evaluation of the systems. This is the strategy commonly used in most studies, resulting in very good classification results (see Sec. II).
In total, four different experiments are carried out in this section, from Exp. A.1 to Exp. A.4. Table II describes the development and evaluation databases considered in each experiment together with the corresponding final evaluation results in terms of EER. Additionally, we represent in Fig. 4 the evolution of the loss/accuracy of the XceptionNet and Steganalysis detection systems for the Exp. A.1, for completeness.
Analysing Fig. 4, both XceptionNet and Steganalysis approaches are able to learn discriminative features to detect real and synthetic face images. The training process is faster for the XceptionNet detection system compared with the Steganalysis, converging to a lower loss value in fewer epochs (close to zero after 20 epochs). The best validation accuracy achieved in Exp. A.1 for the XceptionNet and Steganalysis approaches are 99% and 95%, respectively. Similar trends are observed along the other experiments.
We now analyse in Table II the final evaluation results obtained for each experiment (from Exp. A.1 to Exp. A.4) and detection approach. Analysing the results obtained by the XceptionNet system, good results are obtained in most experiments with values around 1% EER. These results agree with previous studies in the topic (see Sec. II), proving the potential of the XceptionNet model in controlled scenarios. Finally, we also analyse the results achieved by the Steganalysis approach. In general, a high degradation of the system performance is observed compared with the XceptionNet approach, especially for the 100K-Face database, e.g., a final 20.93% EER is obtained in Exp. A.4.
Vi-B In the Wild Scenarios
This section evaluates the performance of facial manipulation detection systems in more realistic scenarios, i.e., in the wild. The following two aspects are considered: i) different development and evaluation databases, and ii) different image resolutions among the development and evaluation of the models. This last aspect results crucial as the quality of raw images/videos is usually modified when uploading to social media, for example. The effect of image resolution has been preliminary analysed in previous studies [rossler2019faceforensics++, korshunov2018deepfakes], but for different facial manipulation groups, i.e., face swapping/identity swap and facial expression manipulations. The main goal of this section is to analyse the real generalisation capacity of state-of-the-art detection systems in unconstrained scenarios.
First, we focus on the scenario of considering different development and evaluation databases, from Exp. B.1 to Exp. B.8 in Table II. Analysing the scenario of considering different synthetic databases (from Exp. B.1 to B.4), a high degradation of the system performance is observed regardless of the facial manipulation detection approach. For the XceptionNet, the average EER is 3.93%, i.e., over 5 times higher than the results achieved in Exp. A.1-A.4 (0.69% average EER). Regarding the Steganalysis approach, the average EER is 18.22%, i.e., almost 2 times higher than the results achieved in Exp. A.1-A.4 (9.77% average EER). This system performance degradation is produced as different GAN models are used in each database to generate the synthetic face images. Therefore, images of each synthetic database contain different GAN fingerprints, as mentioned in previous studies [yu2018attributing].
This performance degradation is even higher when we consider different development and evaluation databases for both real and synthetic faces, as depicted from Exp. B.5 to B.8 in Table II. In this case, average EERs of 6.69% and 20.29% are obtained for the XceptionNet and Steganalysis, respectively. This worsening is especially critical for the XceptionNet, with a system performance degradation over 10 times compared with the controlled scenario results of Sec. VI-A.
Finally, we also analyse how the image resolution affects to this type of facial manipulation detection systems. We focus only on the XceptionNet model as it provides much better results compared with the Steganalysis approach. Fig. 5 depicts the system performance results in terms of EER(%), from lower to higher modifications of the image resolution. The facial manipulation detections systems trained from Exp. B.5 to B.8 are considered in this analysis. Therefore, the detection systems considered were trained with the raw image resolution provided in the databases. It is important to remark that we are just modifying the image resolution, not the image size (i.e., 224x224 pixels in all experiments). In general we can observe in all experiments a high degradation of the system performance with the image resolution. For example, when the image resolution is reduced by 4/7, the average EER is 14.04%, an average absolute worsening of 7.43% EER compared with the raw image resolution (raw equals to 7/7). This performance degradation is even higher when we further reduce the image resolution, with EERs(%) close to 50%.
These results prove the poor generalisation capacity of state-of-the-art facial manipulation detection systems to unseen conditions.
Vi-C GAN-Fingerprint Removal
This section analyses our novel strategy based on the removal of the GAN-fingerprint information from the synthetic fake images in order to spoof current state-of-the-art facial manipulation detection systems. The XceptionNet detection system is trained using real and synthetic face images, as carried out in previous experiments. In particular, we use the experimental protocol considered in the Exp. B.6 of Table II. Fig. 6 depicts the system performance results in terms of EER for the original B.6 experiment (4.43% EER), and for the different sizes of the latent feature representations of the autoencoder (our proposed GAN-fingerprint removal approach). Moreover, we also include face image examples in order to show how the latent feature representation size of the autoencoder impacts the visual quality of the image. In general, we can observe a high degradation of the system performance when considering our proposed GAN-fingerprint removal approach compared with the original results achieved in Exp. B.6. In particular, the average EER when considering our proposed approach is 19.66%, i.e., over 4 times higher than the results achieved in the original Exp. B.6 (4.43% EER). Finally, it is also interesting to note that the system performance does not vary significantly with the the latent feature representation size, suggesting that the performance decay observed is not due to a downsizing effect caused by the autoencoder, but to the removal of the GAN-fingerprint information from the synthetic face images. This aspect can be seen in the face image examples included in Fig. 6 (bottom), as the reconstructed images are visually similar to the original version. This can can also be confirmed by the highly reduced loss of the autoencoder.
In order to confirm our theory, we perform a thorough analysis to be sure that the autoencoder is really removing the GAN-fingerprint information and not just changing the image resolution of the images (as illustrated in Fig. 5). Thus, we first train the XceptionNet model considering different levels of image resolution, and then we pass our GAN-fingerprint removal images through the detection systems. Fig. 7 shows the system performance results in terms of EER for each different detection system and latent feature representation size of the autoencoder. Therefore, five different detection systems are separately trained per image resolution. The results achieved prove that EERs do not significantly decrease when using downsized synthetic images in training, concluding that our proposed approach is actually removing the GAN-fingerprint information.
The main idea in this work is to contribute for further improvements in performance of facial fake detection methods, by describing one relatively simple method that spoof the state-of-the-art fake detection techniques. We started by training one deep autoencoder using public genuine face databases that models the typical spatial correlations between the pixels of real facial images and simultaneously removes their high frequency components, which allows to perceive the “fingerprints” of the models used to generate synthetic images. In test time, this autoencoder is fed with synthetic data to produce manipulated versions whose properties were deliberately changed for spoof fake facial detection systems. In the empirical validation of our approach, we used various well known face datasets, coming out with three major conclusions about the performance of the state-of-the-art facials fake detection methods: i) the existing systems attain almost perfect performance when the evaluation data is derived from the same source used in the training phase, which suggests that these systems have actually learned the GAN ’fingerprints’; ii) the observed performance decreases substantially (by up to ten times) when the methods are exposed to data from unseen databases, and over seven times if the image resolution is substantially reduced; and iii) the accuracy of the existing methods also drops significantly when analysing synthetic data manipulated by the approach described in this paper. In short, our experiments suggest that the existing facial fake detection methods still have a poor generalisation capability and are highly susceptible to - even simple - image transformation manipulations, such as decreases in resolution or similar to the one proposed in this work. While loss of resolution may not be particularly concerning in terms of the potential misuse of the data, it is important to note that our approach is capable of confounding detection methods highly maintaining a visual similarity with the original image.
This work has been supported by projects: BIBECA (RTI2018-101248-B-I00 MINECO/FEDER), Bio-Guard (Ayudas Fundación BBVA a Equipos de Investigación Científica 2017). Also, we gratefully acknowledge the donation of the NVIDIA Titan X GPU used for this research made by NVIDIA Corporation.
João Neves received the B.Sc., M.Sc., and Ph.D degrees in Computer Science from the University of Beira Interior, Portugal, in 2011, 2013, and 2018, respectively. He is currently an assistant professor at University of Beira Interior. His research interests broadly include computer vision and pattern recognition, with a particular focus on biometrics and surveillance.
Ruben Tolosana received the M.Sc. degree in Telecommunication Engineering, and his Ph.D. degree in Computer and Telecommunication Engineering, from Universidad Autonoma de Madrid, in 2014 and 2019, respectively. In April 2014, he joined the Biometrics and Data Pattern Analytics - BiDA Lab at the Universidad Autonoma de Madrid, where he is currently collaborating as a PostDoctoral researcher. Since then, Ruben has been granted with several awards such as the FPU research fellowship from Spanish MECD (2015), and the European Biometrics Industry Award (2018). His research interests are mainly focused on signal and image processing, pattern recognition, deep learning, and biometrics, particularly in the areas of handwriting and handwritten signature. He is author of several publications and also collaborates as a reviewer in many different high-impact conferences (e.g., ICDAR, ICB, BTAS, EUSIPCO, etc.) and journals (e.g., IEEE TPAMI, TIFS, TCYB, TIP, ACM Computing Surveys, etc.). Finally, he has participated in several National and European projects focused on the deployment of biometric security through the world.
Ruben Vera-Rodriguez received the M.Sc. degree in telecommunications engineering from Universidad de Sevilla, Spain, in 2006, and the Ph.D. degree in electrical and electronic engineering from Swansea University, U.K., in 2010. Since 2010, he has been affiliated with the Biometric Recognition Group, Universidad Autonoma de Madrid, Spain, where he is currently an Associate Professor since 2018. His research interests include signal and image processing, pattern recognition, and biometrics, with emphasis on signature, face, gait verification and forensic applications of biometrics. He is actively involved in several National and European projects focused on biometrics. Ruben has been Program Chair for the IEEE 51st International Carnahan Conference on Security and Technology (ICCST) in 2017; and the 23rd Iberoamerican Congress on Pattern Recognition (CIARP 2018) in 2018.
Vasco Lopes, received his BSc (2017) and MSc (2019) degrees in Computer Science and Engineering from the University of Beira Interior, Portugal, where he is now pursing PhD in the area of computer vision. His current research interests include computer vision, robotics and artificial intelligence.
Hugo Proença B.Sc. (2001), M.Sc. (2004) and Ph.D. (2007) is an Associate Professor in the Department of Computer Science, University of Beira Interior and has been researching mainly about biometrics and visual-surveillance. He is the coordinating editor of the IEEE Biometrics Council Newsletter and the area editor (ocular biometrics) of the IEEE Biometrics Compendium Journal. He is a member of the Editorial Boards of the Image and Vision Computing and International Journal of Biometrics and served as Guest Editor of special issues of the Pattern Recognition Letters, Image and Vision Computing and Signal, Image and Video Processing journals.