On the Learning of Deep Local Features forRobust Face Spoofing Detection

On the Learning of Deep Local Features for Robust Face Spoofing Detection


Biometrics emerged as a robust solution for security systems. However, given the widespread of biometric applications, criminals are developing techniques to circumvent them by simulating physical or behavioral traits of legal users (spoofing attacks). Despite face being a promising characteristic due to its universality, acceptability and presence of cameras almost everywhere, face recognition systems are extremely vulnerable to such frauds since they can be easily fooled with common printed facial photographs. State-of-the-art approaches, based on Convolutional Neural Networks (CNNs), present good results in face spoofing detection. However, these methods do not exploit the importance of learning deep local features from each facial region, even though it is known from face recognition that different face regions have much different visual aspects, that can also be exploited for face spoofing detection. In this work we propose a novel CNN architecture trained in two steps for such task. Initially, each part of the neural network learns features from a given facial region. After, the whole model is fine-tuned on the whole facial images. Results show that such pretraining step allows the CNN to learn different local spoofing cues, improving the performance and convergence speed of the final model, outperforming the state-of-the-art approaches.

I Introduction

Biometric systems are increasingly common in our everyday activities [1]. By recognizing people by their own physical, physiological or behavioral traits, they inhibit most of frauds when compared with security systems based on knowledge (passwords, e.g.) or tokens (keys, cards, etc.). However, nowadays, criminals are already developing techniques to accurately simulate biometric characteristics of valid users, such as face, fingerprint and iris, process known as spoofing attack [2, 3], to gain access to places or systems. In this context, robust countermeasure methods must be developed and integrated with the traditional biometric applications in order to prevent such frauds. Despite face being a promising trait due to its convenience for users, universality and acceptability, traditional face recognition systems can be fooled with common printed facial photographs of legal users [2], which can be easily obtained by criminals in the worldwide network, especially due to the spread of social medias and networks.

Spatial image information, i.e., the distribution and relationship between the values pixels in neighbor positions in the two-dimensional coordinate system of the image, is extremely important in tasks involving faces, such as face detection [4] and face recognition [5, 6]. The different patterns of each facial region, i.e., the distribution of the facial elements, encode rich and discriminative information in order to distinguish them from other objects, and also to distinguish a given face from other ones. Regarding face spoofing detection and the importance of each facial region for such task, different approaches based on handcrafted features, such as [7, 8], also mentioned that different spoofing cues can be extracted from different facial regions.

Recently, deep learning architectures emerged as good alternatives for solving complex problems and have reached state-of-the-art results in many tasks due to their great power of abstraction and robustness, working with abstract and high-level features, self-learned from the training data [9, 10]. Among the proposed deep learning architectures, Convolutional Neural Networks (CNN) [11] emerged as one of the most important classes of deep neural networks able to deal, with great performances, with digital images.

Some CNN based state-of-the-art approaches were recently proposed for face spoofing detection such as [13, 12, 14, 15]. However, none of them take into account the different visual aspects of each facial region and, consequently, the different local spoofing cues that could be learned by the neural networks to improve their performances. All the methods work on whole faces, in a holistic way, or with random and small patches, i.e., they train the networks with samples extracted from aleatory regions of the faces, all together, which can also degraded the performance of the training algorithm since the backpropagation method can get distracted by the different visual information of the regions of the face from which the random patches were extracted instead of making the neural network learn (by updating its weights) the real differences between real and fake faces in each facial region (with similar visual aspects, differing only due to the spoofing artifacts).

In this context, we propose a novel CNN architecture trained in two steps for a better performance in face spoofing detection: (i) local pretaining phase, in which each part of the model is trained on each main facial region, learning deep local features for attack detection and initializing the whole model in a great way in the search space (the network will learn to detect multiple and different spoofing cues from all the facial regions); (ii) global fine-tuning phase, in which the whole model is just fine tuned based on the weights learned independently by its parts and on whole real and fake facial images, in order to improve the model generalization. Results obtained on two major datasets for the evaluation of face spoofing detection techniques, Replay-Attack [8] and CASIA FASD (Face Antispoofing Dataset) [16] datasets, show that the pretraining step on local regions of the face improves the performance of the final model and its convergence speed. The proposed approach outperformed the state-of-the-art methods besides of working with and efficient CNN architecture.

Ii Technical Background

In this section we briefly present some concepts regarding the importance of spatial information and differences of the facial regions for face detection and recognition as well as the face spoofing detection problem and some related works.

Ii-a Facial Regions and Spatial Information

The spatial relationship between the facial elements and regions in the images encode rich information in order to distinguish a face from the background or other objects or even to differentiate the faces of two different individuals [4, 5]. The initial works on automated face detection and recognition already used such kind of information extensively, allowing great results, besides efficiency.

Regarding face detection, the early work of Viola and Jones [4] used Haar-like features to identify the presence of faces in digital images. In short, they apply, to each area of a given image, a cascade classifier which verifies, hierarchically, if all main facial features are present in that position, discarding the area if the test fails for any of the facial features considered. Such features capture, in short, the contrast differences in neighbor regions (intensity of their pixels) typically found in human faces. Fig. 1 shows some of these features and their correspondence to the regions of human faces. The black rectangles represent darker regions expected and the white ones, brighter regions expected. The feature showed in the middle focus on darker and brighter areas corresponding to the eyes (especially due to eyebrows) and cheeks, respectively. The feature on the right searches for the contrast of the nose and eyes in human faces.

Fig. 1: Original face detected (left) based on the features used by Viola and Jones [4] (two of them showed on top) to quickly detect the presence of a human face in a given image. The method searches for the different visual patterns of the neighbor facial regions. Image extracted from the documentation of the well-referenced OpenCV [17] library for Computer Vision, which implements the method.

Based on this initial work, which allowed, for the first time, automated face detection in reasonable time for real applications, lots of studies, including recent ones, have been reported, most of them based on variations on such initial features, but still being based in contrasts in neighbor facial regions, such as [18, 19, 20].

When dealing with face recognition, the first effective method for real scenarios, given the high complexity process of human face analysis and matching, was proposed by Turk an Pentland [5], based on the Principal Component Analysis (PCA) [21], which can be used to find the most discriminative eigenvetors that best describe the variance of the set of data, in our case, faces, and reduce the dimensionality of the problem. Given the similarity of such eigenvectors (when represented as 2D images) to facial images, Turk and Pentland called them eigenfaces. It is possible to identify the facial elements and regions (and their spatial relationship) in the eigenfaces, indicating that this kind of information is important to differentiate faces of different people, i.e., when reducing the dimensionality of the data (faces), this kind of information has a great discriminative power for face recognition algorithms.

The great majority of recently proposed works, most of them based on the ideas of such original study, also considers the main facial regions for recognizing people. Works based on other transformations for reducing the dimensionality of the facial images space, such as the ones based on the Linear Discriminat Analysis (LDA) [23], also obtain, as the “basis” of the new coordinate system, vectors that ensemble human faces when viewed as 2D images, being noticeable the different facial regions in them. The own deep learned based architectures, which self-learn the most discriminative features for face representation from the set of data also capture the spatial information and relationships between facial elements and regions, presenting weights that serve as edge and facial elements detector (eyes, nose, etc.), in most of cases, such as the VGG-Face model [24], which was originally trained on 2.6 million facial images.

Researches in Pyschology, such as [25, 26], also showed that human beings have an extremely ability to detect faces, more accurate and much faster than any other object and highlight the importance of spatial information and the positioning of the facial regions and elements for face detection and recognition. In [26], e.g., the authors found that the presentation time to a group of people detect a visual stimulus as a face was shorter (38 ms) when were presented normal faces than when were presented jumbled faces (56 ms), i.e., with some parts out of place (region of the mouth above the region of the eyes, etc.). Despite all this, no work, to the best of our knowledge, studied and used deep local features to improve the performance of the state-of-the-art deep learning architectures, main motivation and goal of this work, as said.

Ii-B Face Spoofing Detection

According to Ratha, Connell and Bolle [3], as in any other security system, there are many ways to attack a biometric system. In short, the attacks to biometric applications can be divided in two groups: direct and indirect attacks. In the direct attacks (spoofing attacks), criminals generate synthetic samples of biometric traits of legal users, such as photographs (face simulation), gelatin fingers (fingerprint simulation), contact lenses (iris simulation), among others, to obtain access to places, or systems. Criminals try to fool the capture sensor with such samples, the most vulnerable point of the biometric recognition system [3].

In the indirect attacks, criminals, after investigating the inner working of the system and based on some fragility, act by modifying the algorithms used to match templates or internal messages exchanged by the system modules [3]. Fig. 2 shows the main points of attack of a biometric system. It is important to know, however, that the great majority of attacks to biometric systems are in the direct mode due to the simplicity for the attackers that do not need to investigate the inner working of the system.

Fig. 2: Points of attack in a traditional biometric system. The spoofing attacks occur in point “1”, i.e., by fooling the sensor (presentation of fake traits) [27].

Among the main biometric traits, face is a promising one especially due to its convenience, low cost of acquisition and acceptability by users [1], being very suitable to a wide variety of environments, including mobile ones. However, despite all these advantages, face recognition systems are the ones that most suffer with spoofing attacks since they can be easily fooled even with common printed photographs [2]. Fig. 3 shows some real and fake faces from the Replay-Attack [8] dataset. As can be seen, it is very difficult to distinguish between real and fake faces.

Fig. 3: Images from real (first row) and fake faces (second row) from the Replay-Attack [8] dataset.

Regarding face spoofing attacks, they can be performed by presenting to the cameras of the biometric systems a static face image (printed, showed on displays of mobile devices, or a 3D mask) or a dynamic set of face images (videos recorded from the faces of legal users displayed on mobile devices) [2]. As one can observe in Fig. 3, different spoofing cues can be analyzed in each facial region, such as shadows (more common in real faces, when considering 2D fake faces, in their outer regions).

Ii-C Related Works

Face spoofing detection methods have been proposed in literature in the last years. Regarding the approaches that work with handcrafted features, most of them focus on detecting spoofing artifacts and image quality distortions in order to identify fake faces. Some of them, e.g., extract color features, based on the assumption that fake faces, when recaptured by the cameras of the biometric systems, present distortions in colors, reflectance, etc. due to the properties of the materials they are made with. In [28] , e.g., the authors argue that fake faces tend to present darker colors and different contrasts, as well as more low frequency areas than real faces. They use such information to extract features for face classification in real or fake.

Other works such as [29, 30, 31] extract texture features based on the LBP (Local Binary Patterns) [32] descriptor and its variations to characterize real and fake faces, presenting good results. In [29], e.g., they extract specific features from each facial region and combine them in a final feature vector in the end of the process for classification, improving a lot the results of the method than when working with features from the whole face. Some of these works also mentioned that the best features were extracted from specific facial regions, especially the central one and from the foreskin area [29].

Among the approaches for face spoofing detection which uses deep learning architectures, more specifically Convolutional Neural Networks (CNN) [11], since for this task they obtained the state-of-the-art results, to the best of our knowledge, all of them work on the whole faces, learning global spoofing cues, or on random and small patches extracted from the faces, not focusing on the learning of the local spoofing cues from each facial region. In [12], e.g., the authors apply a Transfer Learning algorithm in order to adapt the VGG-Face [24] model, a benchmark CNN for face recognition, trained on 2.6 million of facial images from people, for spoofing detection, given the similar domain of the problems, obtaining great results. In [33], the authors also apply a similar Transfer Learning algorithm on VGG-Face model using it for feature extraction, without modifying the original model, focusing on efficiency. In [34], a more time consuming algorithm for Transfer Learning is applied to the VGG-Face [24] model, in which layers of the original CNN are updated for the spoofing detection task, obtaining great results, but also making the process more expensive and requiring more processing power (advanced GPUs) and time.

All these three mentioned studies based on VGG-Face work with the whole facial images as input. Other important works in literature such as [15, 35] also extract global deep spoofing cues from the faces based on other architectures. In [14], e.g., the authors propose a CNN model and integrate it with a Long-Short Term Memory (LSTM) [36] neural network, for learning temporal holistic features from the faces in sequences of images (videos), also obtaining a good performance.

In [13], the authors explore random patches for face spoofing detection. They use such approach for augmenting the dataset but present the patches all together (from random and different parts of the faces) to train their CNN architecture. Despite good results, given the extremely different visual patterns of each facial region, the neural network can get distracted and learn, also basing its predictions for attack detection, mainly on the structural information of the faces (visual patterns of the facial components in each region), much more evident in the images (useful only for face recognition), not focusing on the spoofing cues in the patches for attack detection. In other words, the backpropagation algorithm, i.e., the weights updates of the CNN, can be more influenced by the structural aspects of the facial elements (presence or absence, size, shape, etc. of the eyes, e.g., in a given patch), than by the subtle spoofing cues.

Another well-know patch-based approach for face spoofing detection, presented in [7], works with small and not fixed patches (regions) from the faces to train classification models. In each face, given an extensively analysis based on several metrics, they select the best patches to represent the whole facial image in order to classify it as real or fake. They use many metrics to determine which patch should be selected to represent the face, which are obtained from different regions of the faces for each sample, also degrading the performance of the method in learning spoofing cues (they work with traditional classifiers).

Despite the lack of attention to deep local features regarding face spoofing detection, Krizhevsky, Sutskever and Hinton [37] demonstrated, on other image classification tasks, that the use of local (and fixed) regions of the images (visual local information), in an initial training step of the deep learning model, tends to improve its performance, also avoiding to escape from local minima in the hyperparameter optimization. Ba et al. [38] also suggested the use of facial patches for initializing deep models applied to face recognition based on studies on Neuroscience. Another interesting work [39], uses such initial training step based on fixed image patches for improving vehicle classification in images.

Iii Proposed Approach

In this work, we propose a novel CNN architecture for face spoofing detection, which we called lsCNN (Locally Specialized CNN), with a novel training algorithm for a more effective learning of deep local spoofing features, based on two steps: (i) Local pretaining phase, in which each part of the model is trained on each main facial region (predefined and fixed), learning deep local features for attack detection, from such areas, and allowing to initialize the whole model in a great position in the search space; and (ii) Global fine-tuning phase, in which the whole model is fine-tuned based on the weights learned independently by its parts on the facial regions, in order to improve the model generalization.

Iii-a lsCNN Architecture

Basically, the lsCNN presents 4 convolutional and pooling layers ( to ) at the bottom, being each convolutional layer immediately followed by a batch normalization, a scale and a signal rectification (ReLU - Rectified Linear unit) layers. The batch normalization and scale layers serve to normalize the output feature maps obtained in the convolutional layers, improving learning. The rectification function, in each neuron, acts as activation function, eliminating negative values in the resultant feature maps and also accelerating training. At the top of the network a fully-connected layer () is present, also followed by a batch normalization, scale and ReLU layers, as well as a dropout one (). Finally, there is a softmax layer with two neurons in order to classify the faces in real or fake. Tab. I presents the lsCNN architecture in terms of its layers, i.e., size of kernels, strides, sizes of input and output feature maps.

Layer Kernel Size Stride Input Maps Output Maps
Conv1 1
Pool1 2
Conv2 1
Pool2 2
Conv3 1
Pool3 2
Conv4 1
Pool4 2
TABLE I: Architecture of the proposed lsCNN. The inputs of lsCNN are RGB (3 channels) facial images with pixels: maps.

As shown in Tab. I, lsCNN expects as input, 3-channel facial images in RGB color space. Despite that other color spaces allow to deal more accurately with illumination issues, in order to approximate the model to the inner working of human eyes (which capture only red, green and blue waves of light) and their perception in natural conditions, as well as by the fact that most of digital cameras register their images, originally, in RGB mode, we opted for such image representation than other color systems or even, e.g., texture based images.

Iii-B Local Pretraining

Similar to [37], in order to initialize the whole lsCNN model in a good position of the search space and make it specialized on deep local spoofing feature from each region of the faces, we split each training face in its 9 main regions (patches), shown in Fig. 4, regions also adopted in some works on face recognition.

Fig. 4: A face image ( pixels) obtained from the Replay-Attack dataset [8] splitted in 9 fixed patches (non-overlapping regions of pixels).

After, we also split the lsCNN architecture into 9 independent smaller CNNs, for simplicity called PatchNets, presenting, each of them, a ninth the size of the original model, and being trained on each of the 9 main facial regions considered from the faces, from p1 to p9. Each PatchNet has as input, RGB patches with pixels, from a respective region of the training faces. Tab. II shows the architecture of each PatchNet and Fig. 5 illustrates the training process of the 9 instances of such smaller neural network on the facial regions of a given image. As one can observe, on the top of each PatchNet there are 2 softmax neurons since they are also trained to classify their respective patches in real or fake.

Layer Kernel Size Stride Input Maps Output Maps
Conv1 1
Pool1 2
Conv2 1
Pool2 2
Conv3 1
Pool3 2
Conv4 1
Pool4 2
TABLE II: Architecture of each smaller CNN (PatchNet), part of the lsCNN, trained on each facial region, from p1 to p9 (fixed patches with pixels, also in RGB color space).
Fig. 5: Illustration of the local pretraining process of lsCNN. Given a facial image, it is splitted in its 9 main regions, from p1 to p9, and 9 instances of the smaller CNN architecture (PatchNet) are trained on each of them.

Iii-C Global Fine-tuning

After training the 9 smaller neural networks in their respective facial regions, their weights and biases are used to initialize the parts of the whole lsCNN, for a fine-tune step of the larger model on the whole training facial images, in order to improve its generalization.

As shown in Fig. 6, each smaller network initializes the weights of the connections and biases of a ninth of the lsCNN model. The connections of lsCNN between neurons from different parts of such larger model are zero-initialized. The weights of the two fully-connected layers on top are randomly initialized from a normal distribution in order to improve even more the generalization of model, as in [37]. Their biases are also zero-initialized. In Fig. 6, for simplicity, in each part of lsCNN, only the connections from one neuron in a given feature map (FM) to the neurons of the previous layer, and the connections of the selected neuron in the first part of lsCNN to its receptive fields in the other parts of such whole model, are shown. However the lsCNN has all the connections of a traditional CNN.

Fig. 6: Illustration of the initialization of the lsCNN model based on the weights of the 9 PatchNets. The thicker colored connections represent, actually, connections, and are initialized with the weights learned locally by each PatchNet. The black dotted lines indicate connections zero-initialized and the green thinner ones are initialized with random values from a normal distribution (zero-mean and standard deviation of , by default).

After the initialization, the same training facial images (splitted in patches in the former step) are used to fine-tune the weights of the whole lsCNN model, also allowing it to detect some global or more generic features, which were not learned locally in the pretraining step.

Iv Experiments, Results and Discussion

We evaluated the proposed lsCNN architecture on three important face spoofing detection databases: (i) NUAA Impostor Database [40]; (ii) Replay-Attack dataset[8]; and (iii) CASIA FASD (Face Spoofing Detection Database) [16] database. Subsecs. IV-A and IV-B describe the experiments and good results obtained, as well as some discussions.

Iv-a NUAA Impostor Database

NUAA Photograph Impostor Database [40] contains grayscale facial photographs (already cropped) obtained from real and fake faces: 3,491 images for training (1,743 from real faces and 1,748 from printed ones) and 9,123 test images (3,362 real and 5,761 fake facial images). We performed an initial experiment on such smaller dataset and, for this, we had to reduced the depth of the lsCNN model, eliminating the third and fourth convolutional and pooling layers due to the small size of the input faces ( pixels - input patches with only pixels). Given this reduction in depth, for this experiment we augmented the width of the original lsCNN: the first convolutional layer presented output feature maps and the second, . The fully-connected layer presented neurons and, following [11], kernels (with stride of 2 pixels) were used in the convolutions, given the formed shallow architecture.

The whole lsCNN model was also divide in 9 parts and we initialized all weights of the PatchNets based on random values from a zero-mean normal distribution (with standard deviation of ), and normalized the input facial images (before splitting them) by subtracting the mean value of the training set and dividing the values of the pixels by 128, in order to ensure that most of them would belong to the interval . The biases of the neurons were all zero-initialized.

As optimizer, we used the Adam method, with the following parameters: 64 training images per batch; base learning rate of , first momentum of and second momentum of . We trained the 9 PatchNets by iterations using the Caffe framework [41], initialized the whole lsCNN model with their learned weights and biases and trained the whole CNN for iterations on the whole training faces. For performance comparison, we also assessed a CNN with the same architecture of lsCNN, but traditionally trained, i.e., by initializing all its weights with random values extracted from a normal distribution with zero-mean and standard deviation of and training it on the whole faces also by iterations. Fig. 7 shows the ROC (Receiver Operating Characteristics) curves of lsCNN and the CNN traditionally trained on whole faces, learning global features. As one can observe, the proposed approach presented a much higher ROC curve than the traditional CNN. The proposed CNN architecture obtained an Equal Error Rate (EER) of 14.10% while the CNN traditionally trained, an EER of 23.11%, much worse.

Fig. 7: ROC curve for the lsCNN and a CNN with the some architecture, but trained traditionally, i.e., on the whole faces of NUAA [40], without a local pretraining step. The higher the curve, the better.

Iv-B Replay-Attack and CASIA Databases

In order to allow a more robust analysis of lsCNN, we performed two larger experiments: on the Replay-Attack [8] and CASIA [16] databases. The Replay-Attack dataset contains 360 videos for training (60 videos of real faces and 300 from fake ones), 360 videos for validation in order to calibrate the threshold of the system used to determine whether a given face image (extracted from a video frame) is real or fake, and a test set of videos with 80 videos from real users and 400 videos from fake faces. The CASIA [16] dataset presents videos of 50 subject, 12 videos per subject being 3 real and 9 fake. The dataset is divided in training (20 subjects, i.e., 240 videos) and test set (30 subjects - 360 videos). There is no validation set explicitly defined for such database.

We detected and cropped the faces in the frames of the videos in both experiments using the robust MTCNN [43] deep neural network, for an accurate face segmentation. Based on the eyes’ landmarks of a face, returned as output by such method, we applied a scale transformation on the respective image in order to normalize the distance between both eyes to 60 pixels (using the MATLAB algorithm based on interpolation and on the values of the nearest pixels). After detecting and normalizing the face in each frame, we cropped it based on the eyes and capturing the whole facial region (both ears, foreskin and chin), with a fixed size of pixels in RGB color space. Some cropped faces from the Replay-Attack dataset are shown in Fig. 3. In both experiments, in order to classify a video, we considered a majority of voting scheme of the faces in its frames. Frames with no face detected by the MTCNN architecture were discarded.

Differently of the experiment with the NUAA dataset, in the experiments with Replay-Attack and CASIA databases, we considered the original architecture of lsCNN given the larger facial images obtained. After cropping the faces of all frames of all training videos, an augmentation process on such the datasets was performed. In each of them, initially and for each facial image, we generated two new version of it by increasing or decreasing the values of the R, G, and B channels by 50. This was done in order to force the network to not rely on brightness for spoofing detection (we did not apply techniques for attenuating the shadows on the faces since they are important to distinguish real faces from 2D fake ones).

For each of the three versions of each original training facial image, we also applied noise or blur transformations in three levels each (low magnitudes to not affect much the images), in order to make the neural network also learn smoother features and do not rely much on noise. Again we used the MATLAB toolbox for applying blur and Gaussian noise to the images. The blur operation was applied in three levels (using a Gaussian filter with standard deviations of , and ), as well as the Gaussian noise (with standard deviations of , and ). Such transformations were applied isolated, so we obtained, for each of the three initial images from a a given face, 6 representations of each. In this sense we augmented our dataset 19 times (original images and transformed images).

For the Replay-Attack dataset we obtained training facial images, and for the CASIA dataset, images. Again, we initialized all weights of the smaller PacthNets based on random values from a zero-mean normal distribution (standard deviation of ) and normalized each channel of the input facial images by subtracting the mean value of it and diving all the image values by 128 (before splitting them), again in order to ensure that most of them would belong to the interval . The biases of the neurons were all zero-initialized. As optimizer, we also used the Adam method in both cases, with the same following parameters: 64 training images per batch; base learning rate of , first momentum of and second momentum of .

In both experiments, we trained the 9 smaller PatchNets by iterations on the facial patches using the Caffe framework [41] and initialized the whole lsCNN model. Then we fine-tuned it until iterations. For the Replay-Attack dataset, the best model was obtained (considering results on the validation set of videos) on iteration . For the CNN with the same architecture, traditionally initialized with random values extracted from a normal distribution with zero-mean and standard deviation of and trained on the whole faces, the best model was obtained on iteration , much later. The results of the proposed approach and of state-of-the-art methods are found in Tab. III.

Whole Fine-Tuned VGG-Face [34] 1.20
Efficient Fine-Tuned VGG-Face [33] 16.62
Patch Based Handcrafted Approach [7] 5.0
Fine-Tuned VGG Face[12] 8.40 4.30
Li et al. [12] 2.90 6.10
Random Patches Based CNN [13] 2.50 1.25
Boulkenafet et al. [42] 0.40 2.90
lsCNN Traditionally Trained 0.33 1.75
lsCNN 0.33 2.50
TABLE III: Results on Replay-Attack: Equal Error Rate (EER) on the validation dataset and Half-Total Error Rate (HTER) on the test set. Best values are highlighted.

As one can observe, besides of obtaining the best EER, lsCNN presented a great HTER, much better than expensive methods such as [34], which works with extremely complex and large CNNs, such as VGG-Face. Despite lsCNN obtaining a worse HTER result than the traditionally trained network, it obtained the presented results much faster (in much earlier iteration of the training), as mentioned.

Regarding the CASIA experiment, the best model for lsCNN was obtained on iteration and for the traditionally trained CNN on iteration on iteration . In order to compare the performance of such methods with state-of-the-art approaches, we measured the EER, since such dataset presents a predefined test dataset. Tab. IV shows the results.

Method EER
Fine-tuned VGG-Face [12] 5.20
LSTM-CNN [14] 5.17
Yang et al. [15] 4.92
Patch Based Handcrafted Approach [7] 4.65
Li et al. [12] 4.50
Random Patches Based CNN [13] 4.44
lsCNN Traditionally Trained 4.44
lsCNN 4.44
TABLE IV: Results in the CASIA [16] dataset of the proposed network architecture (lsCNN) and other state-of-the-art methods. The best values are highlighted.

As one can observe, lsCNN obtained the best EER on the CASIA dataset, as well as the traditionally trained CNN and the work of [13], better than approaches that requires complex and expensive architectures. Besides, when compared with the traditionally trained CNN, its training was much faster (lsCNN obtained its best performance on iteration against iteration for the lsCNN architecture traditionally trained).

V Conclusion

Face spoofing detection is a critical task nowadays, given the widespread of face recognition systems and the development, by criminals, of attack techniques to simulate faces of legal users, which can easily fool traditional face recognition systems with common printed facial photographs, available on social medias and networks.

Despite face recognition and detection methods take into account the different regions of human face in such tasks, to the best of knowledge, no technique used so far, until now, deep local spoofing cues for attack detection, as we propose. Results obtained show a high increase in performance of the CNNs when initialized based on a local pretraining stage, obtaining state-of-the-art results on three datasets even with a quite compact model, much more efficient than benchmark CNNs such as VGG-Face, highly used for such task based on Transfer Learning.

The proposed learning algorithm can also be easily applied for training other CNN models, including larger architectures and CNNs with higher capacities, increasing even more the performances obtained. The dataset augmentation, as we performed, will be critical in such process.


  1. A. K. Jain, A. A. Ross, and K. Nandakumar, Introduction to Biometrics. United States: Springer, 2011.
  2. K. Patel, H. Han, and A. K. Jain,“Secure face unlock: spoof detection on smartphones,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 10, pp. 2268-2283, 2016.
  3. N. Ratha, J. Connell, and R. Bolle, “An analysis of minutiae matching strength,” in Proc. of International Conference on Audio- and Video-Based Biometric Person Authentication, pp. 223-228, 2001.
  4. P. Viola and M. Jone, “Rapid object detection using a boosted cascade of simple features”, in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2001.
  5. M. Turk and A. Pentland, “Face recognition using eigenfaces”, in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 1991.
  6. G. Chiachia, A. X. Falcão, N. Pinto, A. Rocha, and D. Cox, “Learning person-specific representations from faces in the wild,” IEEE Trans. on Information Forensics and Security, vol. 9, no. 12, pp. 2089-2099, 2014.
  7. Z. Akhtar and G. Foresti, “Face spoof attack recognition using discriminative image patches,” Journal of Electr. and Comp. Engineering, 2016.
  8. I. Chingovska, A. Anjos, and S. Marcel, “On the effectiveness of Local Binary Patterns in face anti-spoofing,” in Proc. of International Conference of Biometrics Special Interest Group (BIOSIG), 2012.
  9. Y. Bengio, “Deep learning of representations: looking forward,” Statistical Language and Speech Processing, vol. 7978, pp. 1-37, 2013.
  10. A. Canziani, A. Paszke, E. Culurciello, “An analysis of deep neural network models for practical applications,” ArXiv preprint arXiv:1605.07678v4, 2017.
  11. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
  12. L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, and A. Hadid, “An original face anti-spoofing approach using partial convolutional neural network,” in Proc. of International Conference on Image Processing Theory Tools and Applications, pp. 1-6, 2016
  13. Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu, “Face anti-spoofing using patch and depth-based CNNs,” in Proc. of International Joint Conference on Biometrics, 2017.
  14. Z. Xu, S. Li, and W. Deng, “Learning temporal features using LSTM-CNN architecture for face anti-spoofing,” in Proc. in Asian Conference on Pattern Recognition, pp. 141-145, 2015.
  15. J. Yang, Z. Lei, and S. Z. Li, “Learn convolutional neural network for face anti-spoofing,” CoRR, abs/1408.5601, 2014.
  16. Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li, “A face antispoofing database with diverse attacks,” in Proc. of Int. Conf. on Biometrics, 2012.
  17. G. Bradski, “The OpenCV library”, Dr. Dobb’s Journal of Software Tools, 2000.
  18. T. Mita, T. Kaneko, and O. Hori, “Joint Haar-like Features for Face Detection”, in /textitProceedings of IEEE International Conference on Computer Vision, vol. 1, 2005.
  19. M. Mathias, R. Benenson, M. PedersoliLuc, V. Gool, “Face Detection without bells and whistles”, in Proceedings of European Conference on Computer Vision, pp. 720-735, 2014.
  20. S. Ma, L. Bai, “A face detection algorithm based on Adaboost and new Haar-Like feature”, in Proceedings of IEEE International Conference on Software Engineering and Service Science, 2016.
  21. K. Pearson, “On lines and planes of closest fit to systems of points in space”, Philosophical Magazine, vol. 2, no. 6, pp. 559-572, 1901.
  22. M. Dusenberry, “On eigenfaces: creating ghost-like images from a set of faces”, 2015. Available at: https://mikedusenberry.com/on-eigenfaces
  23. R. Fisher, “The use of multiple measurements in taxonomic problems”, Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936.
  24. O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in Proc. of British Machine Vision Conference, 2015.
  25. M. Lewis and H. Ellis, “How we detect a face: a survey of psychological evidence”, International Journal of Imaging Systems and Technology - Facial Image Processing, Analysis, and Synthesis, vol. 13, no. 1, pp. 3-7, 2003.
  26. G. Purcell and A. Stewart, “The face-detection effect”, Bulletin of Psychonomic Society, vol. 24, pp. 118-120, 1986.
  27. J. Galbally, J. Fierrez, and J. O. Garcia, “Vulnerabilities in biometric systems: attacks and recent advances in liveness detection,” Database, vol. 1, no. 3, pp. 1-8, 2007.
  28. TRONCI, R. et al. Fusion of multiple clues for photo-attack detection in face recognition systems. In: Anais da International Joint Conference on Biometrics. Estados Unidos: IEEE, 2011. p. 1–6.
  29. M ̈a a ̈ TT ̈a, J.; HADID, A.; PIETIK ̈aINEN, M. Face spoofing detection from single images using micro-texture analysis. In: Anais da International Joint Conference on Biometrics. Estados Unidos: IEEE, 2011. p. 1–7.
  30. PEREIRA, T. F. et al. LBP-TOP based countermeasure against facial spoofing attacks. In: Anais da Asian Conference on Computer Vision - Workshop on Computer Vision With Local Binary Pattern Variants. Daejeon: [s.n.], 2012.
  31. PARVEEN, S. et al. Face liveness detection using Dynamic Local Ternary Pattern (DLTP). Computer, v. 5, n. 2, 2016.
  32. T. Ojala, M. Pietik ̈ainen, and T. M ̈aenp ̈a a ̈ , “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971-987, 2002
  33. G. Souza, D. Santos, R. Pires, A. Marana, J. Papa, “Efficient Transfer Learning for robust face spoofing detection”, in Proceedings of Iberoamerican Congress on Pattern Recognition, 2017.
  34. O. Lucena, A. Júnior, V. Moia, R. Souza, E. Valle, R. Lotufo, “Transfer Learning using Convolutional Neural Networks for face anti-spoofing”, in Proceedings of International Conference Image Analysis and Recognition, 2018.
  35. D. Menotti, G. Chiachia, A. Pinto, W. Schwartz, H. Pedrini, A. Falcão, “Deep representations for iris, face, and fingerprint spoofing detection”, IEEE Transactions on Information Forensics and Security, vol. 10, no. 4, 2015.
  36. S. Hochreiter, J. Schmidhuber, “Long-short term memory”, Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
  37. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. of Neural Information Processing Systems, pp. 1-9, 2012.
  38. J. Ba, G. Hinton, V. Mnih, J. Leibo, C. Ionescu, “Using fast weights to attend to the recent past”, ArXiv:1610.06258, 2016.
  39. D. Santos, G. Souza, A. Marana, “A 2D Deep Boltzmann Machine for robust and fast vehicle classification”, Proceedings of Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 155-162, 2017.
  40. X. Tan, Y. Li, J. Liu, and L. Jiang, “Face liveness detection from a single image with sparse low rank bilinear discriminative model,” in Proc. of European Conference on Computer Vision, 2010, pp. 504-517.
  41. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: convolutional architecture for fast feature embedding,” ArXiv preprint arXiv:1408.5093v1, 2014.
  42. Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face antispoofing based on color texture analysis,” in Proc. of International Conference on Image Processing, pp. 2636-2640, 2015.
  43. K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks”, IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description