Cross-Domain Deep Face Matching for Real Banking Security Systems

Cross-Domain Deep Face Matching for Real Banking Security Systems


Ensuring the security of transactions is currently one of the major challenges facing banking systems. The usage of face for biometric authentication of users is becoming adopted worldwide due its convenience and acceptability by people, and also given that, nowadays, almost all computers and mobile devices have built-in cameras. Such user authentication approach is attracting large investments from banking and financial institutions, especially in cross-domain scenarios, in which facial images from ID documents are compared with digital self-portraits (selfies) taken with the cameras of mobile devices, for the automated opening of new checking accounts or financial transactions authorization. In this work, besides of collecting a large cross-domain face database, with 27,002 real facial images of selfies and ID documents (13,501 subjects) captured from the systems of the major public Brazilian bank, we propose a novel approach for such cross-domain face matching based on deep features extracted by two well-referenced Convolutional Neural Networks (CNN). Results obtained on the large dataset collected, which we called FaceBank, with accuracy rates higher than 93%, demonstrate the robustness of the proposed approach to the cross-domain problem (comparing faces in IDs and selfies) and its feasible application in real banking security systems.

I Introduction

In the last decades, Biometrics has emerged as a robust solution for automated people recognition. Among the main biometric traits, face is one of the most convenient since its capture does not require much user collaboration and cameras are present almost everywhere, including in mobile devices [1, 2, 3]. Currently, state-of-the-art methods for face recognition and authentication are based on Convolutional Neural Networks (CNN) [4, 5, 6], deep neural networks inspired on the workings of the human brain, which have presented great accuracy results in many complex tasks involving images. CNNs have been applied in different face recognition and authentication systems, including in commercial ones.

According to [7], financial institutions must have effective and reliable methods to authenticate their customers. An effective authentication system should protect customers’ data, prevent money laundering and terrorist financing, reduce fraud, inhibit identity theft and promote the legal enforceability of the agreements on electronic transactions [8]. Performing financial transactions with unauthorized or improperly identified people in a banking environment can result in huge financial losses, damage to the reputation of the company, and breach of bank secrecy.

In this context, banks are investing in robust methods for face authentication [3] in order to increase the user experience of their systems, especially in mobile banking, as well as to prevent frauds. A tendency nowadays in the financial industry is the usage of facial images from different sources (cross-domain problem), usually photographs of ID documents and digital self-portraits (selfies), for user authentication in order to allow automated opening of new checking accounts, authorization of financial transactions and registration of mobile devices [9].

In this work, besides of collecting a novel and large cross-domain face database, which we called FaceBank, composed of 27,002 real facial images of selfies and ID documents (13,501 subjects) captured from the systems of the major public Brazilian bank, the largest dataset in this sense, to the best of our knowledge, we propose a novel approach for such cross-domain face matching problem based on two well-referenced CNNs, VGG-Face [5] and OpenFace [6], by extracting deep and robust features from the facial images, and by training effective classifiers in order to identify genuine and imposter cross-domain matchings, a high complex problem given the significant differences found in facial images captured from different sources. Besides of working with deep and high-level features, we also apply some normalization techniques on the images themselves and on their feature vectors to address even better the cross-domain issues. Results show that the proposed architecture presented great accuracy rates, higher than 93%, and low processing times, being suitable for use in a real banking security systems. Although VGG-Face has been a bit more accurate than OpenFace for feature extraction, the latter is more efficient and therefore more appropriate for mobile banking.

This paper is organized as follows: Section II presents an overview on banking systems and identity fraud; Section III briefly describes some studies on cross-domain face matching; Section IV presents the FaceBank dataset; Section V describes the proposed approach for deep cross-domain face matching; Sections VI and VII present the experiments, results, discussions and conclusions of our work.

Ii Banking Security Systems and Identity Fraud

An increase in the occurrences of identity fraud has been observed around the world in the last years. According to the Centre for Counter Fraud Studies (University of Portsmouth), identity fraud has grown steadily over the past 10 years and the estimated damages, only in the United Kingdom, reached about 5.4 billion pounds per year on average during this period [10].

The 2017 Identity Fraud report [11] shows that despite the efforts of the global community, in 2016 about 15.4 million consumers were victims of identity theft or fraud. In 2017, criminals successfully targeted 2 million more victims and stole about 16 billion dollars. It is possible to notice an epidemic increase in the number of fraud attempts, with systematic and specialized methods developed by attackers in many cases.

A recent report [12] from leading market analysts, suggested that the rapid digitization of consumers’ lives and enterprises’ records will increase the cost of data breaches to 2.1 trillion dollars globally by 2019, almost four times the previously estimated cost of breaches for 2015. This report also highlighted the increasing professionalism of cybercrime, with the emergence of cybercrime products over the past years.

The increasing rates of the cost of electronic fraud to Brazilian banks, for instance, nears 624 million dollars per year [13]. Given the emergent mobile banking in Brazil, which increased its traded volumes by four times only in the last three years, some banks are already using fingerprint authentication in order to achieve a better level of security, as in other countries [13]. The customer, if previously registered, can carry out financial transactions on mobile devices or even on ATMs by presenting their registered fingers to the sensor. However, not all mobile phones being used in poor countries present fingerprint sensors, despite almost all of them have digital cameras. In the case of ATMs, people need to touch a (not always clean) surface, making customers often unsatisfied with the identification system. Besides, usually ID documents do not have their owner fingerprints, making it unfeasible, for instance, to open checking accounts through mobile devices by means of fingerprint matching.

Due to all these reasons and considering the convenience of using facial features for people authentication, banking institutions, following their digital strategies, are increasingly implementing tools to allow automated opening of checking accounts, authorizing transactions and devices totally online through smartphones by means of face authentication. People interested in such a service do not need to go to a physical branch to present the required documentation. Instead, by using their mobile phones, they can take photographs of their ID documents, containing their facial images, and a digital self-portrait (selfie), proving the possesion of the document by its legal owner [2]. The matching of the faces in the photographs taken can occur directly on the device as well as on the bank server. Fig. 1 illustrates the explained process of matching faces from a selfie and an ID document. Cases like this show that face is a tendency as biometric trait for people authentication in banking environments [2].

Fig. 1: Illustration of the matching process of faces from an ID document and two selfies (from different people). After detecting, cropping and normalizing the face in the document, it is matched with the face in the selfie in order to authenticate the person, validating the document. It is possible to observe different visual aspects in the two kinds of images (ID and selfies) taken with the same mobile phone.

From November 2016 to February 2018, the major public Brazilian bank received about 1.5 million requests for opening checking accounts through smartphones, all of them being manually inspected. Among this total, 22% of them were rejected by the human experts. Besides presenting different faces in the ID document and selfie (indicating fraud), most of the requests presented low quality facial images due to issues with illumination, facial occlusion, expression changes, low resolution, or even due to scratched documents. Since the matching is still performed by humans and given the high number of requests being received, such a process is quite expensive for the bank, slow and also subject to failures. An automated method could, at least, automatically discard some of the requests, saving time and resources for the financial institution. By the end of 2018, the expectation is to reach a total of 3.35 million opened checking accounts through mobile devices in such bank.

Iii Cross-Domain Face Matching

As previously stated, face is one of the most convenient biometric traits since its capture can be performed at a distance, in a non-intrusive and even non-cooperative way [1, 2]. Besides, cameras are nowadays found almost everywhere, including in mobile devices, the main technological basis for the present and future of banking transactions [7]. When the work involves comparing cross-domain images (e.g., matching faces obtained from ID documents and selfies or faces extracted from surveillance videos), it is possible to note a substantial increment in the level of challenge in the matching process, which already is a complex problem in Machine Learning. In cross-domain conditions, the classification algorithms usually have their performance reduced due to the different visual aspects of the images from different sources, such as different kind of blur, illumination changes, noise or even change on the facial expressions.

Folego et al. [9] explored approaches for cross-domain face authentication, comparing selfies to ID photographs based on features extracted by the VGG-Face [5] deep neural network. They approach the problem with proper image photometric adjustments and data normalization techniques, together with deep learning architectures, to extract the most prominent and robust features from the original images, reducing the effects of domain differences. However, their dataset was composed of relatively few images (dozens of individuals) and not obtained from a real banking scenario.

In order to deal with typical face cross-domain issues such as illumination, alignment, noise, or even facial expression changes, the method proposed by Ho and Gopalan [14] works by deriving a latent subspace for the original faces, characterizing their multifactor variations. Images were synthesized in order to produce different illumination and other 2D perturbations, forming tensors to represent the faces. Results indicated that the method is effective on constrained and unconstrained datasets.

To the best of our knowledge, no evaluation regarding face authentication on large cross-domain datasets with real images was reported in the literature, especially for banking scenarios, the main target of our work.

Iv FaceBank Dataset

The amount of data for training is an important issue when dealing with Machine Learning algorithms, especially with Deep Learning approaches. Given the high capacity of the deep neural networks due to their large numbers of free parameters to be tuned, the quality of their predictions improves with experience [15]. Face recognition and authentication systems built by large private corporations present, in general, top accuracies, since they are trained on huge private datasets, containing millions of facial images, usually obtained from social media, far more than the number of images in the datasets usually available for research.

Regarding cross-domain face authentication (ID document photo and selfie, for instance), there is no large dataset available. Usually, given the difficulty to collect data, researchers evaluate their new methods with images from few individuals. Besides, no dataset with real images from banking systems was used in past evaluations, and this is an essential issue to be considered when evaluating techniques and deep neural networks in order to obtain reliable results.

Based on these considerations, we obtained authorization from the largest public Brazilian bank to collect a large dataset, which we called FaceBank, from its databases of facial images (selfies and scanned ID documents), in order to conduct this work. Initially, about 150,000 images in RGB color space were collected, between selfies from profiles of individuals in the bank’s internal social network and ID documents from the same individuals. However, we detected that many of these images presented no faces (especially the profile images) or faces with low resolution (in the case of the IDs). In this sense, in order to eliminate such bad images and avoid a decrease in the performance in our model, and given the processing and time restrictions we had to observe, we applied a fast technique based on the efficient face detection algorithm of Viola and Jones [16] to the images in order to detect which of them presented real faces with, at least, regular resolution. We discarded both images of the users that had one image discarded (selfie or ID).

After this process, we obtained 27,002 facial images from 13,501 subjects (two images per subject, i.e., selfie and ID document). In order to crop the faces in these remaining images more precisely, a more robust algorithm based on HOG (Histogram of Oriented Gradients) [17] was applied to detect the faces again as well as some landmark points (such as eyes and mouth coordinates, etc.). Fig. 2 shows examples of images that compose the collected FaceBank dataset. Even visually, it is possible to note the huge differences in the facial images of the same person from different domains, i.e., selfie and ID document, and also the regular quality of the resultant images, typical from real banking scenarios, all this demonstrating the high complexity of the cross-domain face authentication problem in banking security systems.

Fig. 2: Examples of real facial images of selfies and ID documents of the FaceBank dataset. The dataset contains a total of 27,002 images (13,501 individuals).

V Proposed Approach

In this work, besides of collecting the large FaceBank dataset with real banking face images of selfies and ID documents, we also propose a robust approach for cross-domain face matching based on two well-referenced Convolutional Neural Networks (CNN), VGG-Face [5] and OpenFace [6], to extract deep and robust features from the faces, with good level of invariance to the domain differences. We also applied normalization techniques to the facial images and to their feature vectors to attenuate such issues even more and improve the model performance.

After normalizing the selfies and IDs, extracting and normalizing their deep feature vectors using the VGG-Face [5] or OpenFace [6] CNN models, we trained and assessed four classifiers (Linear Support Vector Machine - Linear SVM [18], Power Mean SVM - PmSVM [19], Random Forest - RF [20], and RF with Ensemble Vote Classifier - Voting RF [21]) in order to verify which one performed better in the task of classifying a pair of face images (ID and selfie) as genuine or imposter and compare their results in the banking context.

In summary, given a test pair of selfie and ID document images, a sequence of steps including face normalization, deep features extraction, feature vectors normalization, as well as classification, are performed in order to verify whether such pair of facial images were captured from the same person (genuine pair) or from different individuals (imposter pair). Fig. 3 shows these steps (proposed architecture), which are described in subsections V-A to V-E.

Fig. 3: Overview of the proposed architecture for cross-domain face matching: facial images are taken with the mobile phone and normalized; their features vectors are extracted using deep neural networks (VGG-Face or OpenFace); they are also normalized and subtracted from each other; and then the final difference feature vector is classified as a genuine access (same person in both images) or imposter request (distinct people in the images) by a classifier.

V-a Face Detection, Cropping and Alignment

In order to carry out face detection and cropping, as said, in this work we used a robust and efficient algorithm available in Dlib library [22], which is based on HOG (Histogram of Oriented Gradients) [17] features. This algorithm returns the coordinates of the rectangle that contains the detected face in the input image, as well as the coordinates of the left and right eyes. With this information it is possible to align, crop and resize all the face images from the dataset before starting to compare them.

Likewise in [9], in the face cropping step, we included the ear, chin, and hair in the Region of Interest (ROI), by expanding by 22% the initial rectangle returned by the Dlib algorithm. This expanded ROI tends to increase the results of the face matching.

Face alignment is performed by rotating the face until the coordinates of both eyes are in line with the -axis. Finaly, the cropped and aligned face images are resized by using bilinear interpolation. For the VGG-Face feature extraction based approach, the face image size must be pixels, whilst for the OpenFace feature extraction based approach the face image size must be pixels.

V-B Face Normalization

A typical problem of comparing photos of documents with selfies is the large difference in lighting due to the change in the application domain. Other issues, such as facial pose and expression changes, as well as the different resolutions of the images, are also problematic. Aiming to mitigate some of these problems, especially illumination differences, and before extracting the features of the faces in the images, we applied the Automatic Color Equalization (ACE) [23] to normalize the cropped facial images.

The ACE technique is based on a computational model of the human visual system that performs a photometric transformation on the images in order to equalize simultaneously global and local effects of illumination [23]. It obtains good contrast enhancements even when the quality of the images is poor. We use this technique as an effort to approximate the two kinds of images under analysis.

V-C Data Augmentation

Obtaining a huge amount of real data (of real training samples) is an expensive and not always a possible process. Aiming to add more training data to the 27,002 images of the FaceBank, we generated new ones by applying some transformations on them. This data augmentation strategy is a common practice when working with CNNs and other classifiers [24].

According to Nielsen [25], despite the fact that artificial images do not substitute the potential of real samples, it is conceivable that adding to the training data transformed images based on the original ones might help the deep neural networks learn more about the patterns being addressed. By making small modifications to the original images, it is possible to expand the training database substantially. Common augmentation methods include noise addition, image equalization, random crop, scale change, jitter, brightness and contrast modifications. In this work, three of these methods were applied.

Initially, we increased the FaceBank dataset by adding white Gaussian noise to the original facial images in the regions near the eyes. As a generic approach, we used a sampling mechanism that added uncorrelated Gaussian noise to the visual input . If indexes the raw pixels, a new sample is given by: .

The second transformation, applied to the original and noisy images, was to randomly increase or decrease the brightness of a given training image, so that the model would learn not to rely on brightness information. As in [26], new images were generated with different brightness by first converting the images to the HSV (Hue, Saturation, and Value) color space and scaling the V channel up or down (converting the image back to the RGB color space after that).

Finally, in order to further augment the database obtained after the two previous transformations, we applied the Contrast Limited Adaptive Histogram Equalization (CLAHE) [27] technique to the training images, which divides the input image into small blocks, applies a conventional histogram equalization in each block, and then checks if any histogram bin is above the contrast limit. As an observation, at the end of all the data augmentation process, we obtained images (, since we applied transformations doubling the size of the dataset in each of them), selfies and ID documents. This set of images was called Augmented FaceBank.

V-D Feature Extraction, Normalization and Difference

In order to extract robust features from the facial images given their different domains, we used the well-referenced Convolution Neural Network (CNN) called VGG-Face [5], originally trained using a dataset with more than 2.6 million (same domain) facial images of 2,622 different people, which achieved state-of-the-art results in face recognition. By using the trained VGG-Face model, a very deep model of CNN containing 16 layers, we avoided many issues such as overfitting in our dataset, despite its size, as well as obtaining a good power of generalization due to the high capacity of the network (huge amount of parameters) and its large original training set.

We used the trained model of VGG-Face for Transfer Learning, i.e., we passed our facial images (from the Augmented FaceBank dataset) through the network and extracted their feature vectors based on the output of the layer “fc6” of the network (the third layer from top to bottom). Despite the fact that other studies usually extract features from the layer “fc7” (on top of “fc6”) from VGG-Face trained model when performing such a task, we explored the layer “fc6”, a fully connected layer with 4,096 neurons, since in [9] it allowed obtaining the best results for cross-domain face matching.

In order to compare the results with the performance of a different deep neural network, in this work we also assessed another well-referenced CNN: OpenFace [6]. OpenFace is an open source model, also implemented and trained on large datasets of facial images from the literature. Besides being able to use this neural network in commercial applications due to its open license, another interesting aspect of OpenFace is that it maps each face into an Euclidean space (into a hypersphere within it) by a -dimensional feature vector, output of its top layer. Its training algorithm, is mainly based on the Triplet Learning [28] approach, in which the network is trained on genuine (same person) and imposter (different people) pairs of faces and tries to ensure that the faces of genuine pairs are closer in such Euclidean space than faces from different people, given a tolerance margin, following Eq. 1:


where and indicate feature vectors of faces (selfie and ID) from the same person, the feature vector from the face of another person, is the tolerance margin (usually set to ), and the training set for the neural network. The similarity degree of two faces is measured based on the Euclidean distance between their feature vectors.

VGG-Face model presents more parameters (higher capacity for feature learning) and allows extracting larger feature vectors. However, OpenFace (its default model), besides presenting a slightly different training algorithm and open license, is also interesting to our problem due to its great results reported in other applications [6] and efficiency, being especially suitable for mobile banking.

Given the cross-domain problem, feature vectors extracted from images of different domains might have values with significantly different magnitudes. To mitigate this problem when comparing such vectors, we applied normalization to them. The -norm of a feature vector is given by:


where for VGG-Face and for OpenFace.

The -normalized version of each feature vector is given by:


After normalizing the feature vectors of the faces from the Augmented FaceBank, for each pair of selfie and ID, we also combined their vectors, and , respectively, into a final feature vector in order to emphasize their different properties and train the classifiers (in order to identify genuine and impostor face matchings), by using the absolute value of the subtraction , which showed to be the best choice for our problem.

Pair Generation

As said, to train the classifiers, we extracted the deep features of each pair of faces (selfie and ID) from the Augmented FaceBank using one of the CNNs evaluated. Then, the extracted features were normalized and stored into feature vectors. In order to verify if a pair of selfie and ID images is from the same person (genuine matching), or from distinct persons (imposter matching), the ID feature vector is subtracted from the selfie feature vector and the vector resultant from the module of the difference is finally presented to the classifier. For the pair generation task, we performed a random split of the individuals of the dataset into two disjoint sets: training and test. The training set contained 80% of the individuals of the Augmented FaceBank dataset, while the test set had 20%. Then, in each set, we generated random pairs of two images (selfie and ID) for representing genuine matchings and imposter matchings.

V-E Classification

Given the genuine and imposter pairs of selfies and IDs and their difference vectors, we assessed different and well-referenced classifiers to determine the best option for our banking cross-domain face matching problem. We selected four effective and also efficient classifiers from the literature in order to evaluate their performances in our cross-domain problem: Linear Support Vector Machine (Linear SVM) [18]; Power Mean SVM (PmSVM) [19]; Random Forest (RF) [20]; and RF with Ensemble Vote Classifier (Voting RF) [21]. In our case, the Voting RF combines the decisions of 5 RFs. Due to robustness and efficiency of the code and reprodutibility of the experiments, we used the implementations of such methods available in the well-referenced Scikit Learning library [29].

Likewise in [9], we also decided to evaluate the Linear and PmSVM given their good performances in many tasks and due to their reported efficiency. Despite Linear SVM presenting inferior results in many tasks than SVMs with other kernels, it is fast, being more appropriate for environments with hardware restrictions, as in mobile devices. Regarding PmSVM, compared with state-of-the-art methods for large-scale image classification, it has achieved the highest learning speed and highest accuracy in many cases [30]. The RF-based classifiers are also robust and very efficient since they are based on decision trees.

In order to measure the accuracy of the selected classifiers, we used the global accuracy metric since some of them only had as output the class of each test sample, following Eq. 4:


where is the predicted label for the test sample, is the real label of such sample and is the number of test samples.

Vi Experiments, Results and Discussion

In order to assess the proposed approach and the performance of the assessed classifiers and to analyze the feasibility of its application in real banking security systems, especially for mobile devices, we considered one imposter pair for each genuine pair in the training and test stages of the classification, for a balanced training.

We evaluated the performance of the proposed architecture more than once by varying the number of subjects and the total number of difference feature vectors being considered, in order to verify, in a more detailed way, their robustness regarding the amount of data for training and test. For all classifiers, we used the default hyper-parameters defined on the Scikit Learning library [29]. For the PmSVM, the default value of the regularization parameter was set to 0.01. Tab. I shows the results obtained for all the classifiers given the features extracted by VGG-Face. In the first test, for example, we considered only 10,000 subjects from the Augmented FaceBank dataset (20,000 pairs of faces, 10,000 genuines and 10,000 imposters). We set, as said, 80% of the subjects (and their respective difference feature vectors) for training and 20% for test.

PmSVM RF Voting RF
91.65 89.57 89.65 93.45
92.43 89.87 89.13 93.28
92.69 88.09 89.26 93.51
92.75 89.87 89.77 92.67
92.81 90.91 88.95 92.82
TABLE I: Accuracy results (%), given the features extracted by VGG-Face, on the Augmented FaceBank dataset, considering one imposter pair for each genuine pair of faces (ID and selfie) and varying the number of subjects and difference feature vectors under analysis. The best result for each classifier is highlighted.

As one can observe, the Voting RF obtained the best overall performance and its best accuracy result occurred when we considered 50,000 subjects. When working with 108,008 individuals, this classifier presented only a slight decrease in performance compared with the previous tests, still presenting better results than all other classifiers and demonstrating its robustness. Regarding processing time, voting RF is also very efficient by working with decision trees. It spent, on average, only 30 milliseconds for classification of each test sample.

As shown, the performance of the Linear SVM and PmSVM, in general, increased with the increasing sizes of the training and test sets, also demonstrating their robustness to large datasets (often found in real scenarios), despite being slower the former classifier (Linear SVM spent about 45 milliseconds for each test samples). The RF classifier presented its best performance with 80,000 subjects.

The results obtained by the classifiers given the feature vectors extracted by OpenFace are shown in Tab. II. It is important to note that the results obtained were very close to those of VGG-Face. Besides, OpenFace works by default with smaller images ( pixels), saving computational resources in the forward pass of the images through the network for feature extraction, and generates a much more efficient representation for the faces (it generates a 128-dimensional feature vector for each face, and VGG-Face a 4,096-dimensional vector), allowing classifiers being faster and more suitable for mobile applications. The forward pass of each facial image in the VGG-Face took about 2.89 seconds while in OpenFace it took only 0.14 seconds per image.

Subjects Pairs Train/Test
PmSVM RF Voting RF
89.17 85.45 88.72 91.50
89.91 85.67 89.48 90.65
89.82 86.81 89.17 90.86
89.88 86.89 89.52 90.70
89.89 86.73 89.39 90.61
TABLE II: Accuracy results (%), given the features extracted by OpenFace, on the Augmented FaceBank dataset, considering one imposter pair for each genuine pair of faces (ID and selfie) and varying the number of subjects and difference feature vectors under analysis. The best result for each classifier is highlighted.

The best result regarding all experiments, of accuracy, was obtained by the VGG-Face neural network with the Voting RF classifier when working with subjects. Voting RF obtained the best results in all experiments, with both CNNs, being very suitable for the cross-domain face matching problem due to its efficiency inherited from the decision trees.

In order to better visualize the performances of the CNNs with such a powerful classifier, Fig. 4 shows the accuracies obtained by this classification method by varying the size of the dataset. As can be seen in Fig 4, the Voting RF classifier tends to decrease its performance, as expected, when considering more subjects. However, such deterioration in accuracy is not so accentuated for both CNNs.

Fig. 4: Comparison of the performance (global accuracy) of the Voting RF classifier given the feature vectors (their difference versions) extracted by each CNN: VGG-Face and OpenFace.

Vii Conclusion

Banking identity fraud is becoming increasingly common worldwide, causing huge financial losses to the banks and financial system, making them invest massively in higher-level security systems, mainly based on biometric recognition. Among the main traits, face is one of the most important due to its convenience and availability of digital cameras almost everywhere, including in mobile phones. Besides, a tendency nowadays is to open new checking accounts through mobile devices in an automated way by matching facial images from selfies and photographs of ID documents. Such cross-domain problem is a high complex task especially due to differences between the two kinds of images.

In this work, we collected a large dataset, which we called FaceBank, with 27,002 real images of selfies and ID documents (13,501 subjects) from the databases of the largest public Brazilian bank, and proposed a robust approach for cross-domain face matching, comparing selfies and IDs, based on two well-referenced CNNs, VGG-Face and OpenFace, which obtained great results (accuracy rate higher than 93%), even in such difficult task. To the best of our knowledge, FaceBank is the largest cross-domain face dataset collected, with real banking images, and this is the first large scale work on such kind of dataset. We plan to make FaceBank available for future researches, after the agreement of the bank that provided the images to us.

The usage of deep face features extracted by well-referenced CNNs, VGG-Face and OpenFace, proper image processing techniques, feature vectors normalization and robust classifiers, especially the Voting RF, attenuates significantly the effects of domain differences, allowing good results even when working with a large number of facial images. Based on the accuracy obtained (higher than 93%) and its efficiency, it is possible to conclude that the proposed architecture for cross-domain deep face matching is feasible for real banking applications, including mobile ones. The proposed approach can also be applied, for instance, to help human experts in extremely critical environments, rejecting the matchings with very low scores. All this will save crucial time and resources for the financial institutions.


The authors are grateful to bank X for authorizing the images collection and this study on them, to Institution Y (grant #Z) and Institution W (grant #K).


  1. A. K. Jain, A. A. Ross, and K. Nandakumar. Introduction to Biometrics. United States: Springer, 2011.
  2. R. Gonzalo, N. Poh, R. Wong, and R. Reillo, “Time evolution of face recognition in accessible scenarios”, Human-Centric Computing and Information Sciences, vol. 5, no. 4, pp. 1-24, 2015.
  3. R. Srinivasan and A. R. Chowdhury, “Robust face recognition based on saliency maps of sigma sets”, in Proceedings of International Conference on Biometrics: Theory, Applications, and Systems, 2015.
  4. L. Bottou, Y. Bengio, P. Haffner, and Y. LeCun, “Gradient-based learning applied to document recognition”, Proceedings of the IEEE, vol. 86, 1998.
  5. O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition”, in Proceedings of British Machine Vision Conference, 2015.
  6. B. Amos, B. Ludwiczuk, and M. Satyanarayanan, “OpenFace: a general-purpose face recognition library with mobile applications”, CMU School of Computer Science Technical Report - CMU-CS-16-118, 2016.
  7. G. Fosse, S. Leo, C. S. Rodriguez, and N. Gratão, “FEBRABAN survey on Banking Technology”, Technical Report, 2017.
  8. Federal Financial Institutions Examination Council (FFIE), “Authentication in an internet banking environment”, Financial Institution Letter - FIL-103-2005, 2005.
  9. G. Folego, M. A. Angeloni, J. A. Stuchi, A. Godoy, and A. Rocha, “Cross-domain face verification: matching ID document and self-portrait photographs”, in Proceedings of Workshop on Computer Vision (WVC), 2016.
  10. M. Button, D. Shepherd, D. Blackbourn, and M. Tunley, “Annual fraud indicator 2016”, Experian, PKF Littlejohn and the University of Portsmouth’s Centre for Counter Fraud Studies, 2016.
  11. Javelin Strategy & Research, “2017 Identity fraud: securing the connected life”, Technical Report, 2017.
  12. J. Moar, “The future of cybercrime and security: financial and corporate threats and mitigation”, Technical Report, 2017.
  13. C. Rodriguez, A. Mompean, E. Ribeiro, A. Gabiatti, and M. Motta, “Combate sem trégua às fraudes eletrônicas 2017”. Available at:
  14. H. T. Ho and R. Gopalan, “Model-driven domain adaptation on product manifolds for unconstrained face recognition”, International Journal of Computer Vision, vol. 109, no. 1-2, pp. 110-125, 2014.
  15. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks”, in Proceedings of Int. Conference on Neural Information Processing Systems, pp. 1097-1105, 2012.
  16. P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features”, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 511-518, 2001.
  17. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection”, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 886-893, 2005.
  18. C. Cortes and V. Vapnik, “Support-vector networks”, Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
  19. J. Wu, “Power mean SVM for large scale visual classification”, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  20. L. Breiman, “Random forests”, Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
  21. S. Raschka et al. Mlxtend - rasbt/mlxtend: Version 0.10.0. DOI 10.5281/zenodo.1127706, 2017.
  22. D. E. King, “Dlib-ML: a machine learning toolkit”, Journal of Machine Learning Research, vol. 10, no. 1, pp. 1755-1758, 2009.
  23. A. Rizzi, C. Gatta, and D. Marini, “A new algorithm for unsupervised global and local color correctio”, Pattern Recognition Letters, vol. 24, no. 11, pp. 1663-1677, 2003.
  24. K. Grm, V. Struc, A. Artiges, M. Caron, H. K. Ekenel, “Strengths and weaknesses of deep learning models for face recognition against image degradations”, IET Biometrics, vol. 7, no. 1, pp. 81-89, 2018.
  25. M. A. Nielsen. Neural Networks and Deep Learning. Determination Press, 2015.
  26. S. Du, H. Guo, and A. Simpson, “Self-driving car steering angle prediction based on image recognition”, Technical Report, Stanford, 2017.
  27. S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A. Geselowitz, T. Greer, B. T. H. Romeny, J. B. Zimmerman, “Adaptive histogram equalization and its variations”, Computer Vision, Graphics, and Image Processing, vol. 39, no. 3, pp. 355-368, 1987.
  28. F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: a unified embedding for face recognition and clustering”, in Proceedings of IEEE Conference on Computer Vision and Pattern Recog., pp. 815-823, 2015.
  29. F. Pedregosa et al., “Scikit-learn: Machine Learning in Python”, Journal of Machine Learning Research, pp. 2825-2830, 2011.
  30. J. Wu and H. Yang, “Linear regression-based efficient SVM learning for large-scale classification”, IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 10, pp. 2357-2369, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description